Sub-Gaussian Matrices, G2S3 2015, Sketching Matrices

\( \newcommand\norm[1]{|| #1 ||} \newcommand\nnz[1]{{\textrm{nnz}(#1)}} \newcommand\E{\mathbf{E}} \newcommand\Dist{\mathbf{Dist}} \newcommand\reals{{\mathrm{I\!R}}} \newcommand\R{{\mathrm{I\!R}}} \newcommand\poly{{\textrm{poly}}} \newcommand\rank{{\textrm{rank}}} \newcommand\colspan{{\textrm{range}}} \newcommand\diam{{\textrm{diam}}} \newcommand\polylog{{\textrm{polylog}}} \newcommand\rowspan{{\textrm{rowspan}}} \newcommand\colbasis{{\textrm{colbasis}}} \newcommand\argmin{{\mathrm{argmin}}} \newcommand\Err{{\mathtt{NE}}} \newcommand\eps{{\epsilon}} \newcommand\bfone{{\mathbf{1}}} \newcommand\Var{{\mathbf{Var}}} \newcommand\stdev[1]{{\mathbf{stdev(#1)}}} \newcommand\cgf{{\mathbf{CGF}}} \newcommand\bfone{{\mathbb{1}}} \newcommand\Iden{{\mathbf{I}}} \newcommand\Prob{{\mathbf{Pr}}} \DeclareMathOperator\subg{\mathbf{subG}} \newcommand\cN{{\cal N}} \newcommand\cS{{\cal S}} \newcommand\cD{{\cal D}} \newcommand\cM{{\cal M}} \newcommand\cE{{\cal E}} \newcommand\cC{{\cal C}} \newcommand\KJL{K_{\mathbb{JL}}} \newcommand\tO{{\tilde O}} \newcommand\tx{{\tilde{\mathbf{x}}}} \newcommand\tX{{\tilde{\mathbf{X}}}} \newcommand\tY{{\tilde{\mathbf{Y}}}} \newcommand\tA{{\tilde{\mathbf{A}}}} \newcommand\bA{{\mathbf{A}}} \newcommand\ba{{\mathbf{a}}} \newcommand\bw{{\mathbf{w}}} \newcommand\bW{{\mathbf{W}}} \newcommand\bR{{\mathbf{R}}} \newcommand\bB{{\mathbf{B}}} \newcommand\bQ{{\mathbf{Q}}} \newcommand\bZ{{\mathbf{Z}}} \newcommand\bz{{\mathbf{z}}} \newcommand\bU{{\mathbf{U}}} \newcommand\bV{{\mathbf{V}}} \newcommand\bX{{\mathbf{X}}} \newcommand\bS{{\mathbf{S}}} \newcommand\hS{\hat {\mathbf{S}}} \newcommand\bs{{\mathbf{s}}} \newcommand\bx{{\mathbf{x}}} \newcommand\by{{\mathbf{y}}} \newcommand\bb{{\mathbf{b}}} \newcommand\bT{{\mathbf{T}}} \newcommand\bP{{\mathbf{P}}} \newcommand\bG{{\mathbf{G}}} \newcommand\bY{{\mathbf{Y}}} \newcommand\hY{\hat {\mathbf{Y}}} \newcommand\bH{{\mathbf{H}}} \newcommand\bD{{\mathbf{D}}} \newcommand\be{{\mathbf{e}}} \newcommand\br{{\mathbf{r}}} \newcommand\bSig{{\mathbf{\Sigma}}} \newcommand\bPi{{\mathbf{\Pi}}} \newcommand\hphi{{\hat\phi}} \)

Sketching for Matrix Computations:
Sub-Gaussian Matrices

Ken Clarkson
IBM Research, Almaden

These slides at
http://kenclarkson.org/G2S3_2015/
Enable javascript, navigate as in powerpoint
Math may take some time to render properly; if it doesn't end up looking right, try reloading

The very beginning: getting the right mean

If `X` is a random variable with `\E[X] = 0`, then for `\alpha\in\R`:

`\Var[\alpha X] = \E[(\alpha X)^2] = \alpha^2\Var[X]`
If `Y` is random and independent of `X`, then `\Var[X+Y] = \Var[X] +\Var[Y]`

If `\bs\in\R^{1\times n}` is a random row vector with

`\E[\bs ] = 0`, `\Var[\bs ] = (1,1,\ldots, 1) = \bfone`, and
`\bs ` has independent coordinates

Then for `\by\in\R^n`:
`\quad\E[\bs\by ] = \E[\sum_i\bs_i\by_i] = \sum_i\E[\bs_i]\by_i = \E[\bs ]\by = 0`, and
`\quad\E[(\bs\by )^2] = \Var[\bs\by ] = \sum_i\Var[\bs_i\by_i] = \sum_i\Var[\bs_i]\by_i^2 = \norm{\by }^2`

Norm and product estimation

Suppose `\bs\in\R^{1\times n}` has
`\quad\E[\bs ] =0`, (`\bs ` has centered entries)
`\quad\Var[\bs ]= \bfone_{1\times n}`
`\quad`independent entries
Then

Norm estimation: for any fixed `\by\in\R^n`, `\E[(\bs\by )^2] = \norm{\by }^2`, and
Product estimation: `\E[\bs\bs^\top ] = \E\left[ \begin{matrix} \bs_1^2 & \bs_1\bs_2& \ldots & \bs_1\bs_n \\ \bs_2\bs_1 & \bs_2^2 & \ldots \\... \\ \bs_n\bs_1 & \bs_d\bs_2 & \ldots & \bs_n^2 \end{matrix}\right] = \Iden`,
and so for any fixed `\bx ,\by\in\R^n`, `\E[(\bs\bx )^\top \bs\by ] = \E[\bx^\top\bs^\top\bs\by ] = \bx^\top\E[\bs^\top\bs ]\by = \bx^\top\Iden\by = \bx^\top\by `

Isotropic random vectors

The conditions on `\bs` can be generalized:

A random row vector `\bs\in\R^n` is isotropic when `\E[\bs^\top \bs] = \Iden_n`.

If `\bs` has an isotropic distribution, then it yields an unbiased estimator of norms and dot products.

When `\bs` is centered, `\E[\bs^\top\bs]` is the covariance matrix of `\bs`, so
`\bs` centered and isotropic `\implies` entries of `\bs` uncorrelated.

It's sufficient but not necessary for the entries of `\bs` to be independent. Here we will just assume independence.

Concentration : averaging

Expectations are not enough: we want an estimate that holds with high probability.

We use an average of several sketches, stacking `m` scaled istropic row vectors on top of each other.

Suppose `\bS\in\R^{m\times n}` has
`\quad\E[\bS ] =0` (centered entries),
`\quad\E[\bS^\top\bS ]= \sum_j\E[\bS_{j*}^\top\bS_{j*}] = \Iden_n`, each `\bS_{j*}` having `\Var[\bS_{j*}] = \bfone/m`
`\quad`independent entries
Then

Product estimation: `\E[(\bS\bx )^\top\bS\by ] = \E[(\bx^\top\bS^\top)\bS\by ] = \bx^\top\E[\bS^\top\bS ]\by = \bx^\top\Iden\by = \bx^\top\by `
Norm estimation: letting `\bx = \by `, `\E[\norm{\bS\by }^2] = \E[(\bS\by )^\top\bS\by ] = \by^\top\by = \norm{\by }^2`

Concentration

We want \begin{equation} \Prob\{\norm{\bS\by }^2 \approx_\eps \norm{\by }^2\} \label{eq:foo} \end{equation} close to 1.

We will consider sketching vectors and matrices with entries satisfying the sub-Gaussian condition (defined shortly).

A sub-Gaussian sketching vector or matrix has i.d. sub-Gaussian entries

As discussed next:

`\bs\by `, a sum of sub-Gaussians, is also sub-Gaussian
`(\bs\by )^2` is a sub-exponential random variable
`\norm{\bS\by }^2 = \sum_{j\in [m]} (\bS_{j*}\by )^2` is :
- Also sub-exponential
- Concentrated (as in\eqref{eq:foo})

Cumulant generating functions

cumulant generating functions `\cgf_X(t)`
The cumulant generating function of real-valued random variable X is
`\cgf_X (t) = \log\E[\exp(tX)]`, for `t\in\R`
It holds that
`\cgf_X(t) = \E[X]t +\frac12\Var[X]t^2 +\sum_{i>2}\frac1{i!}\kappa_i t^i\ldots`
where for `i\ge 1`, `\kappa_i` is the `i`th cumulant

Examples:

(Gaussian) If `X\sim N(0,1)`, then `\cgf_X(t) = \frac12t^2`
(Sign) If `X= \pm 1` with equal probability, `\cgf_X(t) = \log((\exp(t) +\exp(-t))/2) = \log\cosh(t)`

additivity of `\cgf` If `X` and `Y` are independent, then `\cgf_{X+Y}(t) = \cgf_X(t) +\cgf_Y(t)`

Sub-Gaussians

`X` sub-Gaussian`\iff` there is value `\subg[X]` with `\cgf_{X-\E[X]}(t)\le\subg[X]^2 t^2/2` for all `t`

Examples:

(Gaussian) If `X\sim N(0,\sigma)` , `\subg[X]= \sigma`
(Sign) If `X= \pm 1` with equal probability, `\frac{e^t + e^{-t}}{2}\le e^{t^2/2}\implies \cgf_X(t) = \log\cosh(t)\le t^2/2 \implies \subg[X]\le 1`
(Bounded) If `|X|\le M` for some `M`, then `\cgf_{X-\E[X]} (t)\le M^2t^2/2 \implies\subg[X]\le M`

If `X` is centered sub-Gaussian, `\cgf_{\alpha X}(t) = \cgf_X (\alpha t)\le\subg[X]^2\alpha^2 t^2/2`

If `\bs` is a sub-Gaussian sketching vector with `\gamma\equiv \max_{i\in [n]} \subg[\bs_i]`,
then `\cgf_{\bs\by }(t) = \sum_{i\in [n]}\cgf_{\bs_i\by_i}(t)\le\sum_{i\in [n]}\by_i^2\gamma^2 t^2/2\le\gamma^2\norm{\by }^2t^2/2`
That is, `\bs\by ` is sub-Gaussian with `\subg[\bs\by] \le \gamma \norm{\by }`
(Recall that `\Var(\bs\by) = \sigma^2\norm{\by}^2` under analogous conditions.)

Concentration for sketching vectors

If `Y\ge 0` then for `C\gt 0`, `\Prob\{Y\ge C\}\le\E[Y]/C`.
Letting `Y= \exp(tX)` and `C= \exp(t\alpha)` for `\alpha,t\ge 0`, with `X` centered, `\Prob\{X\ge\alpha\} = \Prob\{\exp(tX)\ge\exp(t\alpha)\}\le\E[\exp(tX)]\exp(-t\alpha)`

If `X` is sub-Gaussian then
`\begin{align*} \log\Prob\{X\ge\alpha\} & \le\inf_{t\ge 0} -t\alpha +\cgf_X(t) \\ & = -\sup_{t\ge 0} t\alpha -\cgf_X(t) \le -\alpha^2/2\subg[X]^2 \end{align*}`

(Irrelevant fun fact: that `\sup` is the convex conjugate (a.k.a. Legendre-Fenchel transform) of `\cgf_X(t)`)
Putting this together with the CGF of a linear combination, a "Hoeffding-like":

If `\bs` is a sub-Gaussian sketching vector with `\subg[\bs_i]\le\sigma` for `i\in [n]`,
then
`\Prob\{\bs\by\ge\beta\} \le \exp(-\beta^2/2\sigma^2\norm{y}^2)`

Concentration for sketching matrices, 1/4

But:

We want to have an estimate closer to `\norm{\by }`
We need even yet more concentration

We turn back to `\bS\in\R^{m\times n}`, a stack of sketching vectors
The summands in `\norm{\bS\by }^2 = \sum_{j\in[m]} (\bS_{j*}\by )^2` are sub-exponential:

`X^2` is sub-exponential `\iff` `X` is sub-Gaussian

So, the tail of `Y=X^2` decays "only" exponentially:
`\log\Prob\{Y\ge\beta\}\le -\beta/2\subg[X]^2`

Concentration for sketching matrices, 2/4

For small `t`, the `\cgf` of a centered sub-exponential looks like that of a centered sub-Gaussian:

sub-exponential CGF bound (NPH) if `Y=X^2-\E[X^2]`
for sub-Gaussian `X`, then $$\cgf_Y(t)\le K_1 t^2\subg[X]^4$$ when `|t|\le K_2 /\subg[X]^2`, for constants `K_1` and `K_2`

Concentration for sketching matrices, 3/4

A "Bernstein-like" tail estimate for sums of sub-exponentials:

If `X_i`, for `i\in [m]`, are centered sub-Gaussian with `\subg[X_i]\le\sigma/\sqrt{m}` and `\E[X_i^2] = 1/m`, then `Z= \sum_{i\in [m]} X_i^2` has $$Z\approx_\eps\E[Z]=1$$ with failure probability at most $$2\exp(-K_3 m\min\{\eps^2/\sigma^4,\eps/\sigma^2\}),$$ for a constant `K_3`.

Proof: When `|t|\le K_2/\sigma^2`, $$\cgf_{Z-\E [Z]}(t) = \sum_{i\in[m]}\cgf_{X_i^2-\E [X_i^2]}(t)\le K_1 t^2\sigma^4/m,$$ So
`\begin{align*} \log\Prob\{Z-\E[Z]\ge\alpha\} & \le\inf_{t\in [0,K_2m/\sigma^2] } -t\alpha + K_1 t^2\sigma^4/m, \\ & = -\min\{m\alpha^2/4K_1\sigma^4, m\alpha K_2/2\sigma^2\} \end{align*}`
Now let `\alpha= \eps\E[Z] = \eps`. Bound for `-Z`, union bound.

Concentration for sketching matrices, 4/4

Translated to sketching matrices,

Fix unit `\by\in\R^n`. Let `\bS\in\R^{m\times n}` with i.i.d. sub-Gaussian entries, and with `\E[\norm{\bS\by }^2] = 1` and `\subg[\bS_{ij}]\le 1`. Then $$\norm{\bS\by }^2\approx_\eps 1$$ with failure probability at most $$2\exp(-\KJL m\eps^2 ),$$ for a constant `\KJL`.

That is, independent sub-Gaussian entries yield sketching matrices so that with very high probability, the norm of a single vector is preserved.

More than one vector: finite sets

We want sketching matrices that preserve the norms of large sets of vectors.

For a set of `k` vectors, $$m\ge\log(4k)/\KJL\eps^2$$ and a union bound gives a total failure probability

$$\begin{align*} k\times & 2\exp(-\KJL m\eps^2 ) \\ & \le k\times 2\exp(-\KJL\frac{\log(4k)}{\KJL\eps^2}\eps^2 ) \\ & \lt 1/2 \end{align*}$$

More than one vector: distances

For the "classic" JL result, preserving distances between rows of `\bA`, apply norm-approximation to `\lt n^2` vectors of the form `\bA_{i,*} -\bA_{j,*}, i,j\in [n]`.

We have:

Let `\bS\in\R^{m\times n}` with i.i.d. sub-Gaussian entries, and with `\E[\norm{\bS\by }^2] = 1` and `\subg[\bS_{ij}]\le 1`. For given `\bA\in\R^{n\times d}`, there is `m\le 2\KJL\eps^{-2}\log(2n)n` such that the set of `n` vectors `\bA_{i,*}, i\in [n]` has $$\norm{\bS(\bA_{i,*} - \bA_{j,*})}^2\approx_\eps \norm{\bA_{i,*} - \bA_{j,*}}$$ for all `i,j\in [n]`, with failure probability at most $1/2$.

More than one vector: dot products

For given `\bS `, let $$\Err_\bS(\ba,\bb)\equiv\frac{ |\ba^\top\bS^\top\bS\bb -\ba^\top\bb |}{\norm{\ba}\,\norm{\bb}}\text{ and }\Err_\bS(\ba)\equiv\Err_\bS(\ba,\ba).$$

So $$\norm{\bS\by }^2\approx_\eps\norm{\by }^2\iff\Err(\by )\le\eps.$$

From norm err to dot err $$\Err(\ba ,\bb ) = \Err(\frac{\ba }{\norm{\ba }},\frac{\bb }{\norm{\bb }}) \le\Err(\frac{\ba }{\norm{\ba }}+\frac{\bb }{\norm{\bb }}) +\Err(\frac{\ba }{\norm{\ba }}-\frac{\bb }{\norm{\bb }}) $$

More generally,
$$\max_{i,j\in[k]}\Err(\mathbf{v}_i,\mathbf{v}_j)\le 2\max_{i,j\in[k]}\Err(\frac{\mathbf{v}_i}{\norm{\mathbf{v}_i}}\pm\frac{\mathbf{v}_j}{\norm{\mathbf{v}_j}})$$

So an embedding for `\lt 2k^2` vectors preserves pairwise dot products of `k` vectors

More than one vector: `\eps`-nets

We want to show sketching is good for infinite sets of vectors: linear spaces.

To do this, we find a finite set of vectors `\cN` so that:

`\bS ` good on `\cN` `\implies` `\bS ` good for all.

Our set `\cN` is an `\eps`-net on a unit sphere `\cal S`.

`\eps`-nets

`\cN= \cN(\epsilon)` is an `\epsilon`-net if it is both an:

`\epsilon`-packing: all `p\in\cN` at least `\epsilon` from `\cN`
- `D(p,\cN\setminus\{p\})\ge\epsilon` for `p\in\cN`
`\epsilon`-covering: all `p\in\cal S` at most `\epsilon` from `\cN`
- `D(p,\cN)\le\epsilon` for `p\in\cal S`, so `D(\cN,{\cal S})\le\epsilon`

Not range-space `\epsilon`-nets
We only really need the covering property, but a maximal packing is a minimal covering

Sphere covering number (NPH)
The unit sphere in `\R^d` has an `\eps`-net of size at most `(1+\frac2\eps)^d`

From `\eps`-nets to `\R^d`: dilation

For `\cN` an `\eps`-net of the unit sphere `\cS` in `\R^d` and `\eps\lt 1`,
if matrix `\bW ` is symmetric, then $$\norm{\bW }_2\le\norm{\bW }_\cN/(1-2\eps),$$ where $$\norm{\bW }_\cN\equiv\sup_{\bx\in\cN} |\bx^\top\bW\bx |,$$ and recalling $$\norm{\bW }_2 = \sup_{\bx\in\cS} |\bx^\top\bW\bx |.$$

Proof:
Let unit `\bx ` have `| \bx^\top\bW\bx | = \norm{\bW }_2`. (There is such an `\bx`.)

Let `\mathbf{z}` have `\norm{\mathbf{z}}\le\eps` and `\bx -\mathbf{z}\in\cN`.

We have
`\begin{align*} \norm{\bW }_2 & = | \bx^\top\bW\bx | = | (\bx -\mathbf{z})^\top\bW (\bx -\mathbf{z}) +\mathbf{z}^\top\bW\bx +\mathbf{z}^\top\bW (\bx -\mathbf{z}) | \\ & \le\norm{\bW }_\cN + 2\eps\norm{\bW }_2. \end{align*}`

So `\norm{\bW }_2 \le\norm{\bW }_\cN/(1-2\eps)`, as claimed

Norms and embeddings

$$\norm{\bB^\top\bB -\Iden}_2 = \sup_{\bx\in\cS} |\,\norm{\bB\bx }^2 - 1| $$

Proof:
`\norm{\bB^\top\bB -\Iden}_2 = \sup_{\bx\in\cS} |\, \bx^\top(\bB^\top\bB -\Iden) \bx \,|`, as noted.

and: `\bx^\top(\bB^\top\bB -\Iden) \bx = \bx^\top\bB^\top \bB \bx - \bx^\top \bx = \norm{\bB\bx}^2 - \norm{\bx}^2`

Sketching matrix embedding `\R^n`

Let `\bS\in\R^{m\times n}` with i.i.d. sub-Gaussian entries, and with `\E[\norm{\bS\by}^2] = 1` for unit `\by`, and `\subg[\bS_{ij}]\le 1`. There is `K>0` so that with failure probability at most `\exp(-n)`,
`\bS ` is a `2\sqrt{K n/m}`-embedding for `\R^n`, for small enough `n/m`.

Pick a `(1/4)`-net `\cN` of `\cS` in `\R^n`;
the covering number bound says `|\cN|\le (1+2/(1/4))^n = 9^n`.

Let `\bW\equiv \bS^\top \bS - \Iden`.
From sketching matrix concentration, and a union bound, $$\norm{\bW}_\cN = \sup_{\bx\in\cN} |\,\bx^\top W \bx\,| = \sup_{\bx\in\cN} |\,\norm{\bS\bx }^2 - 1 |\le \eps,$$ with failure probability at most `9^ n 2\exp(-\KJL m\eps^2)`.

Assuming success, the dilation bound implies
$$\sup_{\bx\in\cS} |\,\norm{\bS\bx }^2 - 1 \,| \le \norm{\bW}_2 \le \norm{\bW}_\cN/(1-2(1/4)) \le 2\eps,$$ as claimed.

Sketching matrix embedding `\R^n`:
failure probability

For `\eps= \sqrt{Kn/m}`,
`\begin{align*} 9^ n 2\exp(-\KJL m\eps^2) & \le\exp(n\log(9) -\KJL m (\sqrt{Kn/m})^2 ) \\ & = \exp(-n(\KJL K -\log(9))) \le \exp(-n) \end{align*}`
for large enough `K`.

Subspace Embedding

Consider `\bS ` for `\colspan(\bA )\equiv\{\bA\bx\mid\bx\in\R^d\}`
`\colspan(\bA )`:
- Is isomorphic to `\R^{\rank(\bA)}`
- Has `\eps_0`-nets of size `(1+2/\eps_0)^{\rank(A)} \le (1+2/\eps_0)^d`
Apply dilation bound to bound `\norm{\bW }_2` for `\bW = \bA^\top\bS^\top\bS\bA - \bA^\top\bA = \bA^\top\bS^\top\bS\bA - \Iden` (assuming WLOG that `\bA` is orthonormal)

Let `\bS\in\R^{m\times n}` with

i.i.d. sub-Gaussian entries
`\E[\norm{\bS\by }^2] = \norm{\by}^2`
`\subg[\bS_{ij}]\le C` for a constant `C`.

For `\bA\in\R^{n\times d}`, there is `K>0` so that with failure probability at most `\exp(-d)`,
`\bS ` is a `2\sqrt{K d/m}`-embedding for `\colspan(\bA )`.
That is, there is `m=O(d/\eps^2)` so that `\bS ` is a subspace `\eps`-embedding for `\bA`.

Sketching for Matrix Computations: Sub-Gaussian Matrices

Ken Clarkson IBM Research, Almaden