Sparse Embeddings, G2S3 2015, Sketching Matrices

\( \newcommand\norm[1]{|| #1 ||} \newcommand{\myvertiii}[1]{{\vert\kern-0.2ex\vert\kern-0.2ex\vert #1 \vert\kern-0.2ex\vert\kern-0.2ex\vert}} \newcommand{\norme}[1]{{\myvertiii{#1}}} \newcommand\nnz[1]{{\textrm{nnz}(#1)}} \newcommand\E{\mathbf{E}} \newcommand\Dist{\mathbf{Dist}} \newcommand\reals{{\mathrm{I\!R}}} \newcommand\R{{\mathrm{I\!R}}} \newcommand\poly{{\textrm{poly}}} \newcommand\rank{{\textrm{rank}}} \newcommand\colspan{{\textrm{range}}} \newcommand\diam{{\textrm{diam}}} \newcommand\polylog{{\textrm{polylog}}} \newcommand\rowspan{{\textrm{rowspan}}} \newcommand\colbasis{{\textrm{colbasis}}} \newcommand\argmin{{\mathrm{argmin}}} \newcommand\Err{{\mathtt{NE}}} \newcommand\eps{{\epsilon}} \newcommand\bfone{{\mathbf{1}}} \newcommand\Var{{\mathbf{Var}}} \newcommand\stdev[1]{{\mathbf{stdev(#1)}}} \newcommand\cgf{{\mathbf{CGF}}} \newcommand\bfone{{\mathbb{1}}} \newcommand\Iden{{\mathbf{I}}} \newcommand\Prob{{\mathbf{Pr}}} \DeclareMathOperator\subg{\mathbf{subG}} \newcommand\cN{{\cal N}} \newcommand\cS{{\cal S}} \newcommand\cD{{\cal D}} \newcommand\cM{{\cal M}} \newcommand\cE{{\cal E}} \newcommand\cC{{\cal C}} \newcommand\KJL{K_{\mathbb{JL}}} \newcommand\tO{{\tilde O}} \newcommand\tx{{\tilde{\mathbf{x}}}} \newcommand\tX{{\tilde{\mathbf{X}}}} \newcommand\tY{{\tilde{\mathbf{Y}}}} \newcommand\tA{{\tilde{\mathbf{A}}}} \newcommand\bA{{\mathbf{A}}} \newcommand\ba{{\mathbf{a}}} \newcommand\bw{{\mathbf{w}}} \newcommand\bW{{\mathbf{W}}} \newcommand\bR{{\mathbf{R}}} \newcommand\bB{{\mathbf{B}}} \newcommand\bQ{{\mathbf{Q}}} \newcommand\bZ{{\mathbf{Z}}} \newcommand\bz{{\mathbf{z}}} \newcommand\bU{{\mathbf{U}}} \newcommand\bV{{\mathbf{V}}} \newcommand\bX{{\mathbf{X}}} \newcommand\bS{{\mathbf{S}}} \newcommand\hS{\hat {\mathbf{S}}} \newcommand\bs{{\mathbf{s}}} \newcommand\bx{{\mathbf{x}}} \newcommand\by{{\mathbf{y}}} \newcommand\bb{{\mathbf{b}}} \newcommand\bT{{\mathbf{T}}} \newcommand\bP{{\mathbf{P}}} \newcommand\bG{{\mathbf{G}}} \newcommand\bY{{\mathbf{Y}}} \newcommand\hY{\hat {\mathbf{Y}}} \newcommand\bH{{\mathbf{H}}} \newcommand\bD{{\mathbf{D}}} \newcommand\be{{\mathbf{e}}} \newcommand\br{{\mathbf{r}}} \newcommand\bSig{{\mathbf{\Sigma}}} \newcommand\bPi{{\mathbf{\Pi}}} \newcommand\hphi{{\hat\phi}} \)

Sketching for Matrix Computations:
Sparse Embeddings

Ken Clarkson
IBM Research, Almaden

These slides at
http://kenclarkson.org/G2S3_2015/
Enable javascript, navigate as in powerpoint
Math may take some time to render properly; if it doesn't end up looking right, try reloading

Faster embeddings: countsketch

A sketching matrix of i.i.d. sub-Gaussians is good but slow:

`\bS ` is an embedding with high probability
Computing `\bS\bA ` takes `\Omega(mnd)` time

We will show a faster scheme, and use the reduction

embedding `\rightarrow` mat mult `\rightarrow` single estimate

to analyze it.

Sparse Embeddings

Adaptation of CountSketch from streaming algorithms
`\bS ` looks like:
$\qquad\qquad\left[\begin{matrix} 0 & -1 & 0 & 0 & +1 & 0 \\ +1 & 0 & 0 & +1 & 0 & 0 \\ 0 & 0 & +1 & 0 & 0 & -1 \\ 0 & 0 & 0 & 0 & 0 & 0 \end{matrix} \right]$
One random `\pm 1` per column
Row `\bA_{i*}` of `\bA ` contributes `\pm\bA_{i*}` to one row of `\bS\bA `
`m= \tO(d^2)`

Sparse embeddings, more precisely

For `i\in [n]`,
pick uniformly and independently:

`h_i\in [m]`
`s_i\in\{+1,-1\}`

Define `\bS\in\R^{m\times n}` by $$\bS_{h_i, i}\gets s_i, \text{ for } i\in [n],$$ and `\bS_{j,i} \gets 0` otherwise

The `s_i` are "sign flips"
The array `h` hashes to `m` "hash buckets"

That is, $$\bS_{j,*} = \sum_{i:\,h_i=j} s_i \be_i^\top,$$ so $$ [\bS\bA]_{j,*} = \sum_{i:\,h_i=j} s_i \be_i^\top \bA = \sum_{i:\,h_i=j} s_i \bA_{i,*}$$

Sparse Embeddings, More

For `\by = \bA\bx`, each row of sparse embedding `\bS `:
- Collects a subset of entries of `\by ` (into a "bucket")
- Applies random signs
- Adds
Result is `\E[\norm{\bS\by}] = \norm{\by }_2^2`:
- Each bucket squared `\approx` squared norm of its entries
- Splitting into buckets separates big entries
  - Smaller variance
- With `n` buckets, don't quite get `\norm{\by }_2^2` exactly, always
  - (Entry `\rightarrow` Bucket is not one-to-one)

Analysis of sparse embeddings, 1/3

From the analysis via matrix multiplication,
it's enough to analyze, for unit vector `\by`, $$\begin{align*} \E[(\norm{\bS\by }^2-1)^2] & = \E[\norm{\bS\by}^4] - 2\E[\norm{\bS\by}^2] + 1 \\ & = \E[\norm{\bS\by}^4] - 1. \end{align*} $$

Let `\bz = \bS\by`. $$\begin{align*} \E_s[ & (\norm{\bS\by }^2 - 1)^2] + 1 = \E_s[\norm{\bz}^4] \\ & = \E_s\left[\left[\sum_j \bz_j^2\right]^2\right] = \sum_{j,j'} \E_s[\bz_j^2\bz_{j'}^2] \\ & = \sum_{j,j'} \E_s[\bz_j^2]\E_s[\bz_{j'}^2] + \sum_j \E_s[\bz_j^4] - \E_s[\bz_j^2]^2 \\ & \le 1 + \sum_j \E_s[\bz_j^4] \end{align*} $$

Analysis of sparse embeddings, 2/3

We have $$\bz_j = [\bS\by]_j = \sum_{i:\,h_i=j} s_i \by_i,$$ so
$$\sum_j \E_s[\bz_j^4] = \sum_{i,i',i'',i''':\, h_i = h_{i'} = \ldots = j} \E_s[s_i\by_i s_{i'}\by_{i'} s_{i''}\by_{i''} s_{i'''}\by_{i'''}]$$

which for `i\ne i'` is zero unless
`i''=i` and `i'''=i'`, or the reverse. So $$\E_{s,h}[\bz_j^4] \le\frac{1}{m^2}\left[ 2\sum_{i\ne i'}\by_i^2\by_{i'}^2 +\sum_{i,i''}\by_i^2\by_{i''}^2\right] \le\frac{3}{m^2}\norm{\by }^4 = \frac{3}{m^2}. $$ and $$\E[(\norm{\bS\by }^2 - 1)^2]\le\sum_{j\in[m]}\E_{s,h}[\bz_j^4]\le\frac{3}{m}.$$

Analysis of sparse embeddings, 3/3

We have for `\bS ` a sparse embedding, and unit `\by `, $$\norme{\Err(\by )}_{2,\bS } = \E[(\norm{\bS\by }^2 - 1)^2]^{1/2}\le K/\sqrt{m},$$ for constant `K`.

By the bound for embedding via matrix multiplication,
If `K/\sqrt{m}\le\beta\delta^{1/p}`,
then `\bS ` is a `(\beta r)`-embedding with failure probability `\delta`.
Here `r=\rank(\bA)`.

With `\delta = 1/9`, `\eps/r = \beta\ge 3K/\sqrt{m}` yields

For

`\eps\in (0,1)`
`r\equiv\rank(\bA )`

There is

`m=O(r^2/\eps^2)`
`\bS\in\R^{m\times n}` countsketch matrix

such that with failure probablity `1/9`, `\bS ` is an `\eps`-embedding of `\colspan(\bA )`.

Slightly less sparse embeddings

So far: one nonzero per column.
There are sparse embeddings where
the sketch dimension `m` is smaller, and
the number `s` of nonzeros is slightly larger.

There are sparse subspace `\eps`-embedding matrices `\bS\in\R^{m\times n}` with `sn` nonzeros for matrices `\bA\in\R^{n\times d}` of rank `r`, such that
`s=O(\polylog(r)/\eps)` and `m=\tO(r/\eps^2)`, or
`s=O(1/\eps)` and `m=O(r^{1+\gamma}/\eps^2)` for any constant `\gamma \gt 0`.

In practice, small `s\gt 1` has better performance [D15].

Hybrid approaches are also possible, resulting in similar performance for some problems.

Sketching for Matrix Computations: Sparse Embeddings

Ken Clarkson IBM Research, Almaden