\(
\newcommand\norm[1]{|| #1 ||}
\newcommand\nnz[1]{{\textrm{nnz}(#1)}}
\newcommand\E{\mathbf{E}}
\newcommand\Dist{\mathbf{Dist}}
\newcommand\reals{{\mathrm{I\!R}}}
\newcommand\R{{\mathrm{I\!R}}}
\newcommand\poly{{\textrm{poly}}}
\newcommand\rank{{\textrm{rank}}}
\newcommand\colspan{{\textrm{range}}}
\newcommand\diam{{\textrm{diam}}}
\newcommand\polylog{{\textrm{polylog}}}
\newcommand\rowspan{{\textrm{rowspan}}}
\newcommand\colbasis{{\textrm{colbasis}}}
\newcommand\argmin{{\mathrm{argmin}}}
\newcommand\Err{{\mathtt{NE}}}
\newcommand\eps{{\epsilon}}
\newcommand\bfone{{\mathbf{1}}}
\newcommand\Var{{\mathbf{Var}}}
\newcommand\stdev[1]{{\mathbf{stdev(#1)}}}
\newcommand\cgf{{\mathbf{CGF}}}
\newcommand\bfone{{\mathbb{1}}}
\newcommand\Iden{{\mathbf{I}}}
\newcommand\Prob{{\mathbf{Pr}}}
\DeclareMathOperator\subg{\mathbf{subG}}
\newcommand\cN{{\cal N}}
\newcommand\cS{{\cal S}}
\newcommand\cD{{\cal D}}
\newcommand\cM{{\cal M}}
\newcommand\cE{{\cal E}}
\newcommand\cC{{\cal C}}
\newcommand\KJL{K_{\mathbb{JL}}}
\newcommand\tO{{\tilde O}}
\newcommand\tx{{\tilde{\mathbf{x}}}}
\newcommand\tX{{\tilde{\mathbf{X}}}}
\newcommand\tY{{\tilde{\mathbf{Y}}}}
\newcommand\tA{{\tilde{\mathbf{A}}}}
\newcommand\bA{{\mathbf{A}}}
\newcommand\ba{{\mathbf{a}}}
\newcommand\bw{{\mathbf{w}}}
\newcommand\bW{{\mathbf{W}}}
\newcommand\bR{{\mathbf{R}}}
\newcommand\bB{{\mathbf{B}}}
\newcommand\bQ{{\mathbf{Q}}}
\newcommand\bZ{{\mathbf{Z}}}
\newcommand\bz{{\mathbf{z}}}
\newcommand\bU{{\mathbf{U}}}
\newcommand\bV{{\mathbf{V}}}
\newcommand\bX{{\mathbf{X}}}
\newcommand\bS{{\mathbf{S}}}
\newcommand\hS{\hat {\mathbf{S}}}
\newcommand\bs{{\mathbf{s}}}
\newcommand\bx{{\mathbf{x}}}
\newcommand\by{{\mathbf{y}}}
\newcommand\bb{{\mathbf{b}}}
\newcommand\bT{{\mathbf{T}}}
\newcommand\bP{{\mathbf{P}}}
\newcommand\bG{{\mathbf{G}}}
\newcommand\bY{{\mathbf{Y}}}
\newcommand\hY{\hat {\mathbf{Y}}}
\newcommand\bH{{\mathbf{H}}}
\newcommand\bD{{\mathbf{D}}}
\newcommand\be{{\mathbf{e}}}
\newcommand\br{{\mathbf{r}}}
\newcommand\bSig{{\mathbf{\Sigma}}}
\newcommand\bPi{{\mathbf{\Pi}}}
\newcommand\hphi{{\hat\phi}}
\)
Sketching for Matrix Computations:
Sub-Gaussian Matrices
Ken Clarkson
IBM Research, Almaden
- These slides at
http://kenclarkson.org/G2S3_2015/
- Enable javascript, navigate as in powerpoint
- Math may take some time to render properly; if it doesn't end up looking right, try reloading
The very beginning: getting the right mean
If `X` is a random variable with `\E[X] = 0`, then for `\alpha\in\R`:
- `\Var[\alpha X] = \E[(\alpha X)^2] = \alpha^2\Var[X]`
- If `Y` is random and independent of `X`, then `\Var[X+Y] = \Var[X] +\Var[Y]`
If `\bs\in\R^{1\times n}` is a random row vector with
- `\E[\bs ] = 0`, `\Var[\bs ] = (1,1,\ldots, 1) = \bfone`, and
- `\bs ` has independent coordinates
Then for `\by\in\R^n`:
`\quad\E[\bs\by ] = \E[\sum_i\bs_i\by_i] = \sum_i\E[\bs_i]\by_i = \E[\bs ]\by = 0`, and
`\quad\E[(\bs\by )^2] = \Var[\bs\by ] = \sum_i\Var[\bs_i\by_i] = \sum_i\Var[\bs_i]\by_i^2 = \norm{\by }^2`
Norm and product estimation
Suppose `\bs\in\R^{1\times n}` has
`\quad\E[\bs ] =0`, (`\bs ` has
centered entries)
`\quad\Var[\bs ]= \bfone_{1\times n}`
`\quad`independent entries
Then
- Norm estimation: for any fixed `\by\in\R^n`, `\E[(\bs\by )^2] = \norm{\by }^2`, and
- Product estimation:
`\E[\bs\bs^\top ] = \E\left[
\begin{matrix}
\bs_1^2 & \bs_1\bs_2& \ldots & \bs_1\bs_n
\\ \bs_2\bs_1 & \bs_2^2 & \ldots
\\...
\\ \bs_n\bs_1 & \bs_d\bs_2 & \ldots & \bs_n^2
\end{matrix}\right] = \Iden`,
and so for any fixed `\bx ,\by\in\R^n`,
`\E[(\bs\bx )^\top \bs\by ] = \E[\bx^\top\bs^\top\bs\by ] = \bx^\top\E[\bs^\top\bs ]\by = \bx^\top\Iden\by = \bx^\top\by `
Isotropic random vectors
The conditions on `\bs` can be generalized:
A random row vector `\bs\in\R^n` is isotropic when `\E[\bs^\top \bs] = \Iden_n`.
If `\bs` has an isotropic distribution, then it yields an unbiased estimator of norms and dot products.
When `\bs` is centered, `\E[\bs^\top\bs]` is the covariance matrix of `\bs`, so
`\bs` centered and isotropic `\implies` entries of `\bs` uncorrelated.
It's sufficient but not necessary for the entries of `\bs` to be independent. Here we will just assume independence.
Concentration : averaging
Expectations are not enough: we want an estimate that holds with high probability.
We use an average of several sketches, stacking `m` scaled istropic row vectors on top of each other.
Suppose `\bS\in\R^{m\times n}` has
`\quad\E[\bS ] =0` (centered entries),
`\quad\E[\bS^\top\bS ]= \sum_j\E[\bS_{j*}^\top\bS_{j*}] = \Iden_n`, each `\bS_{j*}` having `\Var[\bS_{j*}] = \bfone/m`
`\quad`independent entries
Then
- Product estimation:
`\E[(\bS\bx )^\top\bS\by ] = \E[(\bx^\top\bS^\top)\bS\by ] = \bx^\top\E[\bS^\top\bS ]\by = \bx^\top\Iden\by = \bx^\top\by `
- Norm estimation: letting `\bx = \by `, `\E[\norm{\bS\by }^2] = \E[(\bS\by )^\top\bS\by ] = \by^\top\by = \norm{\by }^2`
Concentration
We want
\begin{equation}
\Prob\{\norm{\bS\by }^2 \approx_\eps \norm{\by }^2\}
\label{eq:foo}
\end{equation}
close to 1.
We will consider sketching vectors and matrices with entries satisfying the
sub-Gaussian condition (defined shortly).
A sub-Gaussian sketching vector or matrix has i.d. sub-Gaussian entries
As discussed next:
- `\bs\by `, a sum of sub-Gaussians, is also sub-Gaussian
- `(\bs\by )^2` is a sub-exponential random variable
- `\norm{\bS\by }^2 = \sum_{j\in [m]} (\bS_{j*}\by )^2` is :
- Also sub-exponential
- Concentrated (as in\eqref{eq:foo})
Cumulant generating functions
cumulant generating functions `\cgf_X(t)`
The cumulant generating function of real-valued random variable X is
`\cgf_X (t) = \log\E[\exp(tX)]`, for `t\in\R`
It holds that
`\cgf_X(t) = \E[X]t +\frac12\Var[X]t^2 +\sum_{i>2}\frac1{i!}\kappa_i t^i\ldots`
where for `i\ge 1`, `\kappa_i` is the `i`th cumulant
Examples:
- (Gaussian) If `X\sim N(0,1)`, then `\cgf_X(t) = \frac12t^2`
- (Sign) If `X= \pm 1` with equal probability, `\cgf_X(t) = \log((\exp(t) +\exp(-t))/2) = \log\cosh(t)`
additivity of `\cgf`
If `X` and `Y` are independent, then `\cgf_{X+Y}(t) = \cgf_X(t) +\cgf_Y(t)`
Sub-Gaussians
`X` sub-Gaussian`\iff` there is value `\subg[X]` with `\cgf_{X-\E[X]}(t)\le\subg[X]^2 t^2/2` for all `t`
Examples:
- (Gaussian) If `X\sim N(0,\sigma)` , `\subg[X]= \sigma`
- (Sign) If `X= \pm 1` with equal probability, `\frac{e^t + e^{-t}}{2}\le e^{t^2/2}\implies \cgf_X(t) = \log\cosh(t)\le t^2/2 \implies \subg[X]\le 1`
- (Bounded) If `|X|\le M` for some `M`, then `\cgf_{X-\E[X]} (t)\le M^2t^2/2 \implies\subg[X]\le M`
If `X` is centered sub-Gaussian, `\cgf_{\alpha X}(t) = \cgf_X (\alpha t)\le\subg[X]^2\alpha^2 t^2/2`
If `\bs` is a sub-Gaussian sketching vector with `\gamma\equiv \max_{i\in [n]} \subg[\bs_i]`,
then `\cgf_{\bs\by }(t) = \sum_{i\in [n]}\cgf_{\bs_i\by_i}(t)\le\sum_{i\in [n]}\by_i^2\gamma^2 t^2/2\le\gamma^2\norm{\by }^2t^2/2`
That is, `\bs\by ` is sub-Gaussian with `\subg[\bs\by] \le \gamma \norm{\by }`
(Recall that `\Var(\bs\by) = \sigma^2\norm{\by}^2` under analogous conditions.)
Concentration for sketching vectors
If `Y\ge 0` then for `C\gt 0`, `\Prob\{Y\ge C\}\le\E[Y]/C`.
Letting `Y= \exp(tX)` and `C= \exp(t\alpha)` for `\alpha,t\ge 0`, with `X` centered,
`\Prob\{X\ge\alpha\} = \Prob\{\exp(tX)\ge\exp(t\alpha)\}\le\E[\exp(tX)]\exp(-t\alpha)`
If `X` is sub-Gaussian then
`\begin{align*}
\log\Prob\{X\ge\alpha\}
& \le\inf_{t\ge 0} -t\alpha +\cgf_X(t)
\\ & = -\sup_{t\ge 0} t\alpha -\cgf_X(t)
\le -\alpha^2/2\subg[X]^2
\end{align*}`
(Irrelevant fun fact: that `\sup` is the convex conjugate (a.k.a. Legendre-Fenchel transform) of `\cgf_X(t)`)
Putting this together with the CGF of a linear combination, a "Hoeffding-like":
If `\bs` is a sub-Gaussian sketching vector with `\subg[\bs_i]\le\sigma` for `i\in [n]`,
then
`\Prob\{\bs\by\ge\beta\} \le \exp(-\beta^2/2\sigma^2\norm{y}^2)`
Concentration for sketching matrices, 1/4
But:
- We want to have an estimate closer to `\norm{\by }`
- We need even yet more concentration
We turn back to `\bS\in\R^{m\times n}`, a stack of sketching vectors
The summands in `\norm{\bS\by }^2 = \sum_{j\in[m]} (\bS_{j*}\by )^2` are sub-exponential:
`X^2` is sub-exponential `\iff` `X` is sub-Gaussian
So, the tail of `Y=X^2` decays "only" exponentially:
`\log\Prob\{Y\ge\beta\}\le -\beta/2\subg[X]^2`
Concentration for sketching matrices, 2/4
For small `t`, the `\cgf` of a centered sub-exponential looks like that of a centered sub-Gaussian:
sub-exponential CGF bound (NPH)
if `Y=X^2-\E[X^2]`
for sub-Gaussian `X`,
then $$\cgf_Y(t)\le K_1 t^2\subg[X]^4$$
when `|t|\le K_2 /\subg[X]^2`, for constants `K_1` and `K_2`
Concentration for sketching matrices, 3/4
A "Bernstein-like" tail estimate for sums of sub-exponentials:
If `X_i`, for `i\in [m]`, are centered sub-Gaussian with `\subg[X_i]\le\sigma/\sqrt{m}` and `\E[X_i^2] = 1/m`,
then `Z= \sum_{i\in [m]} X_i^2` has $$Z\approx_\eps\E[Z]=1$$ with failure probability
at most $$2\exp(-K_3 m\min\{\eps^2/\sigma^4,\eps/\sigma^2\}),$$ for a constant `K_3`.
Proof: When `|t|\le K_2/\sigma^2`,
$$\cgf_{Z-\E [Z]}(t) = \sum_{i\in[m]}\cgf_{X_i^2-\E [X_i^2]}(t)\le K_1 t^2\sigma^4/m,$$
So
`\begin{align*}
\log\Prob\{Z-\E[Z]\ge\alpha\}
& \le\inf_{t\in [0,K_2m/\sigma^2] } -t\alpha + K_1 t^2\sigma^4/m,
\\ & = -\min\{m\alpha^2/4K_1\sigma^4, m\alpha K_2/2\sigma^2\}
\end{align*}`
Now let `\alpha= \eps\E[Z] = \eps`. Bound for `-Z`, union bound.
Concentration for sketching matrices, 4/4
Translated to sketching matrices,
Fix unit `\by\in\R^n`. Let `\bS\in\R^{m\times n}` with i.i.d. sub-Gaussian entries,
and with `\E[\norm{\bS\by }^2] = 1` and
`\subg[\bS_{ij}]\le 1`. Then
$$\norm{\bS\by }^2\approx_\eps 1$$
with failure probability at most
$$2\exp(-\KJL m\eps^2 ),$$ for a constant `\KJL`.
That is, independent sub-Gaussian entries yield sketching matrices so
that with very high probability, the norm of a single vector
is preserved.
More than one vector: finite sets
We want sketching matrices that preserve the norms
of large sets of vectors.
For a set of `k` vectors, $$m\ge\log(4k)/\KJL\eps^2$$ and a union bound
gives a total failure probability
$$\begin{align*}
k\times & 2\exp(-\KJL m\eps^2 )
\\ & \le k\times 2\exp(-\KJL\frac{\log(4k)}{\KJL\eps^2}\eps^2 )
\\ & \lt 1/2
\end{align*}$$
More than one vector: distances
For the "classic" JL result, preserving distances between rows of `\bA`, apply
norm-approximation to `\lt n^2` vectors of the form `\bA_{i,*} -\bA_{j,*}, i,j\in [n]`.
We have:
Let `\bS\in\R^{m\times n}` with i.i.d. sub-Gaussian entries,
and with `\E[\norm{\bS\by }^2] = 1` and
`\subg[\bS_{ij}]\le 1`.
For given `\bA\in\R^{n\times d}`, there is `m\le 2\KJL\eps^{-2}\log(2n)n`
such that the set of `n` vectors
`\bA_{i,*}, i\in [n]` has
$$\norm{\bS(\bA_{i,*} - \bA_{j,*})}^2\approx_\eps \norm{\bA_{i,*} - \bA_{j,*}}$$
for all `i,j\in [n]`,
with failure probability at most $1/2$.
More than one vector: dot products
For given `\bS `, let
$$\Err_\bS(\ba,\bb)\equiv\frac{ |\ba^\top\bS^\top\bS\bb -\ba^\top\bb |}{\norm{\ba}\,\norm{\bb}}\text{ and }\Err_\bS(\ba)\equiv\Err_\bS(\ba,\ba).$$
So
$$\norm{\bS\by }^2\approx_\eps\norm{\by }^2\iff\Err(\by )\le\eps.$$
From norm err to dot err
$$\Err(\ba ,\bb )
= \Err(\frac{\ba }{\norm{\ba }},\frac{\bb }{\norm{\bb }})
\le\Err(\frac{\ba }{\norm{\ba }}+\frac{\bb }{\norm{\bb }}) +\Err(\frac{\ba }{\norm{\ba }}-\frac{\bb }{\norm{\bb }})
$$
More generally,
$$\max_{i,j\in[k]}\Err(\mathbf{v}_i,\mathbf{v}_j)\le 2\max_{i,j\in[k]}\Err(\frac{\mathbf{v}_i}{\norm{\mathbf{v}_i}}\pm\frac{\mathbf{v}_j}{\norm{\mathbf{v}_j}})$$
So an embedding for `\lt 2k^2` vectors preserves pairwise dot products of `k` vectors
More than one vector: `\eps`-nets
We want to show sketching is good for infinite sets of
vectors: linear spaces.
To do this, we find a finite set of vectors `\cN` so that:
`\bS ` good on `\cN` `\implies` `\bS ` good for all.
Our set `\cN` is an `\eps`-net on a unit sphere `\cal S`.
From `\eps`-nets to `\R^d`: dilation
For `\cN` an `\eps`-net of the unit sphere `\cS` in `\R^d`
and `\eps\lt 1`,
if matrix `\bW ` is symmetric, then
$$\norm{\bW }_2\le\norm{\bW }_\cN/(1-2\eps),$$
where $$\norm{\bW }_\cN\equiv\sup_{\bx\in\cN} |\bx^\top\bW\bx |,$$
and recalling $$\norm{\bW }_2 = \sup_{\bx\in\cS} |\bx^\top\bW\bx |.$$
Proof:
Let unit `\bx ` have `| \bx^\top\bW\bx | = \norm{\bW }_2`. (There is such an `\bx`.)
Let `\mathbf{z}` have `\norm{\mathbf{z}}\le\eps` and `\bx -\mathbf{z}\in\cN`.
We have
`\begin{align*}
\norm{\bW }_2 & = | \bx^\top\bW\bx | = | (\bx -\mathbf{z})^\top\bW (\bx -\mathbf{z}) +\mathbf{z}^\top\bW\bx +\mathbf{z}^\top\bW (\bx -\mathbf{z}) |
\\ & \le\norm{\bW }_\cN + 2\eps\norm{\bW }_2.
\end{align*}`
So `\norm{\bW }_2 \le\norm{\bW }_\cN/(1-2\eps)`, as claimed
Norms and embeddings
$$\norm{\bB^\top\bB -\Iden}_2 = \sup_{\bx\in\cS} |\,\norm{\bB\bx }^2 - 1| $$
Proof:
`\norm{\bB^\top\bB -\Iden}_2 = \sup_{\bx\in\cS} |\, \bx^\top(\bB^\top\bB -\Iden) \bx \,|`, as noted.
and:
`\bx^\top(\bB^\top\bB -\Iden) \bx = \bx^\top\bB^\top \bB \bx - \bx^\top \bx = \norm{\bB\bx}^2 - \norm{\bx}^2`
Sketching matrix embedding `\R^n`
Let `\bS\in\R^{m\times n}` with i.i.d. sub-Gaussian entries,
and with `\E[\norm{\bS\by}^2] = 1` for unit `\by`, and
`\subg[\bS_{ij}]\le 1`.
There is `K>0` so that with failure probability at most
`\exp(-n)`,
`\bS ` is a `2\sqrt{K n/m}`-embedding for `\R^n`, for small enough `n/m`.
Pick a `(1/4)`-net `\cN` of `\cS` in `\R^n`;
the covering number bound
says `|\cN|\le (1+2/(1/4))^n = 9^n`.
Let `\bW\equiv \bS^\top \bS - \Iden`.
From sketching matrix concentration,
and a union bound,
$$\norm{\bW}_\cN = \sup_{\bx\in\cN} |\,\bx^\top W \bx\,| = \sup_{\bx\in\cN} |\,\norm{\bS\bx }^2 - 1 |\le \eps,$$
with failure probability at most `9^ n 2\exp(-\KJL m\eps^2)`.
Assuming success, the dilation bound implies
$$\sup_{\bx\in\cS} |\,\norm{\bS\bx }^2 - 1 \,|
\le \norm{\bW}_2
\le \norm{\bW}_\cN/(1-2(1/4))
\le 2\eps,$$
as claimed.
Sketching matrix embedding `\R^n`:
failure probability
For `\eps= \sqrt{Kn/m}`,
`\begin{align*}
9^ n 2\exp(-\KJL m\eps^2)
& \le\exp(n\log(9) -\KJL m (\sqrt{Kn/m})^2 )
\\ & = \exp(-n(\KJL K -\log(9)))
\le \exp(-n)
\end{align*}`
for large enough `K`.
Subspace Embedding
- Consider `\bS ` for `\colspan(\bA )\equiv\{\bA\bx\mid\bx\in\R^d\}`
- `\colspan(\bA )`:
- Is isomorphic to `\R^{\rank(\bA)}`
- Has `\eps_0`-nets of size `(1+2/\eps_0)^{\rank(A)} \le (1+2/\eps_0)^d`
- Apply dilation bound to bound `\norm{\bW }_2` for `\bW = \bA^\top\bS^\top\bS\bA - \bA^\top\bA = \bA^\top\bS^\top\bS\bA - \Iden`
(assuming WLOG that `\bA` is orthonormal)
Let `\bS\in\R^{m\times n}` with
- i.i.d. sub-Gaussian entries
- `\E[\norm{\bS\by }^2] = \norm{\by}^2`
- `\subg[\bS_{ij}]\le C` for a constant `C`.
For `\bA\in\R^{n\times d}`,
there is `K>0` so that with failure probability at most
`\exp(-d)`,
`\bS ` is a `2\sqrt{K d/m}`-embedding for `\colspan(\bA )`.
That is, there is `m=O(d/\eps^2)` so that `\bS ` is a subspace `\eps`-embedding for `\bA`.