\newcommand\norm[1]{|| #1 ||}
\newcommand{\myvertiii}[1]{{\vert\kern-0.2ex\vert\kern-0.2ex\vert #1 \vert\kern-0.2ex\vert\kern-0.2ex\vert}}
\newcommand{\norme}[1]{{\myvertiii{#1}}}
\newcommand\nnz[1]{{\textrm{nnz}(#1)}}
\newcommand\E{\mathbf{E}}
\newcommand\Dist{\mathbf{Dist}}
\newcommand\reals{{\mathrm{I\!R}}}
\newcommand\R{{\mathrm{I\!R}}}
\newcommand\poly{{\textrm{poly}}}
\newcommand\rank{{\textrm{rank}}}
\newcommand\colspan{{\textrm{range}}}
\newcommand\diam{{\textrm{diam}}}
\newcommand\polylog{{\textrm{polylog}}}
\newcommand\rowspan{{\textrm{rowspan}}}
\newcommand\colbasis{{\textrm{colbasis}}}
\newcommand\argmin{{\mathrm{argmin}}}
\newcommand\Err{{\mathtt{NE}}}
\newcommand\eps{{\epsilon}}
\newcommand\bfone{{\mathbf{1}}}
\newcommand\Var{{\mathbf{Var}}}
\newcommand\stdev[1]{{\mathbf{stdev(#1)}}}
\newcommand\cgf{{\mathbf{CGF}}}
\newcommand\bfone{{\mathbb{1}}}
\newcommand\Iden{{\mathbf{I}}}
\newcommand\Prob{{\mathbf{Pr}}}
\DeclareMathOperator\subg{\mathbf{subG}}
\newcommand\cN{{\cal N}}
\newcommand\cS{{\cal S}}
\newcommand\cD{{\cal D}}
\newcommand\cM{{\cal M}}
\newcommand\cE{{\cal E}}
\newcommand\cC{{\cal C}}
\newcommand\KJL{K_{\mathbb{JL}}}
\newcommand\tO{{\tilde O}}
\newcommand\tx{{\tilde{\mathbf{x}}}}
\newcommand\tX{{\tilde{\mathbf{X}}}}
\newcommand\tY{{\tilde{\mathbf{Y}}}}
\newcommand\tA{{\tilde{\mathbf{A}}}}
\newcommand\bA{{\mathbf{A}}}
\newcommand\ba{{\mathbf{a}}}
\newcommand\bw{{\mathbf{w}}}
\newcommand\bW{{\mathbf{W}}}
\newcommand\bR{{\mathbf{R}}}
\newcommand\bB{{\mathbf{B}}}
\newcommand\bQ{{\mathbf{Q}}}
\newcommand\bZ{{\mathbf{Z}}}
\newcommand\bz{{\mathbf{z}}}
\newcommand\bU{{\mathbf{U}}}
\newcommand\bV{{\mathbf{V}}}
\newcommand\bX{{\mathbf{X}}}
\newcommand\bS{{\mathbf{S}}}
\newcommand\hS{\hat {\mathbf{S}}}
\newcommand\bs{{\mathbf{s}}}
\newcommand\bx{{\mathbf{x}}}
\newcommand\by{{\mathbf{y}}}
\newcommand\bb{{\mathbf{b}}}
\newcommand\bT{{\mathbf{T}}}
\newcommand\bP{{\mathbf{P}}}
\newcommand\bG{{\mathbf{G}}}
\newcommand\bY{{\mathbf{Y}}}
\newcommand\hY{\hat {\mathbf{Y}}}
\newcommand\bH{{\mathbf{H}}}
\newcommand\bD{{\mathbf{D}}}
\newcommand\be{{\mathbf{e}}}
\newcommand\br{{\mathbf{r}}}
\newcommand\bSig{{\mathbf{\Sigma}}}
\newcommand\bPi{{\mathbf{\Pi}}}
\newcommand\hphi{{\hat\phi}}
Sketching for Matrix Computations:
Sparse Embeddings
Ken Clarkson
IBM Research, Almaden
- These slides at
http://kenclarkson.org/G2S3_2015/
- Enable javascript, navigate as in powerpoint
- Math may take some time to render properly; if it doesn't end up looking right, try reloading
Faster embeddings: countsketch
A sketching matrix of i.i.d. sub-Gaussians is good but slow:
- \bS is an embedding with high probability
- Computing \bS\bA takes \Omega(mnd) time
We will show a faster scheme, and use the reduction
embedding \rightarrow mat mult \rightarrow single estimate
to analyze it.
Sparse Embeddings
- Adaptation of
CountSketch
from streaming algorithms
- \bS looks like:
\qquad\qquad\left[\begin{matrix}
0 & -1 & 0 & 0 & +1 & 0
\\ +1 & 0 & 0 & +1 & 0 & 0
\\ 0 & 0 & +1 & 0 & 0 & -1
\\ 0 & 0 & 0 & 0 & 0 & 0
\end{matrix}
\right]
- One random \pm 1 per column
- Row \bA_{i*} of \bA contributes \pm\bA_{i*} to one row of \bS\bA
- m= \tO(d^2)
Sparse embeddings, more precisely
For
i\in [n],
pick uniformly and independently:
- h_i\in [m]
- s_i\in\{+1,-1\}
Define
\bS\in\R^{m\times n} by
\bS_{h_i, i}\gets s_i, \text{ for } i\in [n],
and
\bS_{j,i} \gets 0 otherwise
The s_i are "sign flips"
The array h hashes to m "hash buckets"
That is,
\bS_{j,*} = \sum_{i:\,h_i=j} s_i \be_i^\top,
so
[\bS\bA]_{j,*} = \sum_{i:\,h_i=j} s_i \be_i^\top \bA = \sum_{i:\,h_i=j} s_i \bA_{i,*}
Sparse Embeddings, More
- For \by = \bA\bx, each row of sparse embedding \bS :
- Collects a subset of entries of \by (into a "bucket")
- Applies random signs
- Adds
- Result is \E[\norm{\bS\by}] = \norm{\by }_2^2:
- Each bucket squared \approx squared norm of its entries
- Splitting into buckets separates big entries
- With n buckets, don't quite get \norm{\by }_2^2 exactly, always
- (Entry \rightarrow Bucket is not one-to-one)
Analysis of sparse embeddings, 1/3
From the analysis via matrix multiplication,
it's enough to analyze, for unit vector \by,
\begin{align*}
\E[(\norm{\bS\by }^2-1)^2]
& = \E[\norm{\bS\by}^4] - 2\E[\norm{\bS\by}^2] + 1
\\ & = \E[\norm{\bS\by}^4] - 1.
\end{align*}
Let \bz = \bS\by.
\begin{align*}
\E_s[ & (\norm{\bS\by }^2 - 1)^2] + 1
= \E_s[\norm{\bz}^4]
\\ & = \E_s\left[\left[\sum_j \bz_j^2\right]^2\right]
= \sum_{j,j'} \E_s[\bz_j^2\bz_{j'}^2]
\\ & = \sum_{j,j'} \E_s[\bz_j^2]\E_s[\bz_{j'}^2] + \sum_j \E_s[\bz_j^4] - \E_s[\bz_j^2]^2
\\ & \le 1 + \sum_j \E_s[\bz_j^4]
\end{align*}
Analysis of sparse embeddings, 2/3
We have
\bz_j = [\bS\by]_j = \sum_{i:\,h_i=j} s_i \by_i,
so
\sum_j \E_s[\bz_j^4] = \sum_{i,i',i'',i''':\, h_i = h_{i'} = \ldots = j} \E_s[s_i\by_i s_{i'}\by_{i'} s_{i''}\by_{i''} s_{i'''}\by_{i'''}]
which for i\ne i' is zero unless
i''=i and i'''=i', or the reverse. So
\E_{s,h}[\bz_j^4]
\le\frac{1}{m^2}\left[ 2\sum_{i\ne i'}\by_i^2\by_{i'}^2 +\sum_{i,i''}\by_i^2\by_{i''}^2\right]
\le\frac{3}{m^2}\norm{\by }^4 = \frac{3}{m^2}.
and
\E[(\norm{\bS\by }^2 - 1)^2]\le\sum_{j\in[m]}\E_{s,h}[\bz_j^4]\le\frac{3}{m}.
Analysis of sparse embeddings, 3/3
We have for \bS a sparse embedding, and unit \by ,
\norme{\Err(\by )}_{2,\bS } = \E[(\norm{\bS\by }^2 - 1)^2]^{1/2}\le K/\sqrt{m},
for constant K.
By the bound for embedding via matrix multiplication,
If K/\sqrt{m}\le\beta\delta^{1/p},
then \bS is a (\beta r)-embedding with failure probability \delta.
Here r=\rank(\bA).
With \delta = 1/9, \eps/r = \beta\ge 3K/\sqrt{m} yields
For
- \eps\in (0,1)
- r\equiv\rank(\bA )
There is
- m=O(r^2/\eps^2)
- \bS\in\R^{m\times n} countsketch matrix
such that
with failure probablity
1/9,
\bS is an
\eps-embedding of
\colspan(\bA ).
Slightly less sparse embeddings
So far: one nonzero per column.
There are sparse embeddings where
the sketch dimension m is smaller, and
the number s of nonzeros is slightly larger.
There are sparse subspace \eps-embedding matrices \bS\in\R^{m\times n}
with sn nonzeros for matrices \bA\in\R^{n\times d} of rank r, such
that
s=O(\polylog(r)/\eps) and m=\tO(r/\eps^2), or
s=O(1/\eps) and m=O(r^{1+\gamma}/\eps^2) for any constant \gamma \gt 0.
In practice, small s\gt 1 has better performance [D15].
Hybrid approaches are also possible, resulting in similar
performance for some problems.