\( \newcommand\norm[1]{|| #1 ||} \newcommand{\myvertiii}[1]{{\vert\kern-0.2ex\vert\kern-0.2ex\vert #1 \vert\kern-0.2ex\vert\kern-0.2ex\vert}} \newcommand{\norme}[1]{{\myvertiii{#1}}} \newcommand\nnz[1]{{\textrm{nnz}(#1)}} \newcommand\E{\mathbf{E}} \newcommand\Dist{\mathbf{Dist}} \newcommand\reals{{\mathrm{I\!R}}} \newcommand\R{{\mathrm{I\!R}}} \newcommand\poly{{\textrm{poly}}} \newcommand\rank{{\textrm{rank}}} \newcommand\colspan{{\textrm{range}}} \newcommand\diam{{\textrm{diam}}} \newcommand\polylog{{\textrm{polylog}}} \newcommand\rowspan{{\textrm{rowspan}}} \newcommand\colbasis{{\textrm{colbasis}}} \newcommand\argmin{{\mathrm{argmin}}} \newcommand\Err{{\mathtt{NE}}} \newcommand\eps{{\epsilon}} \newcommand\bfone{{\mathbf{1}}} \newcommand\Var{{\mathbf{Var}}} \newcommand\stdev[1]{{\mathbf{stdev(#1)}}} \newcommand\cgf{{\mathbf{CGF}}} \newcommand\bfone{{\mathbb{1}}} \newcommand\Iden{{\mathbf{I}}} \newcommand\Prob{{\mathbf{Pr}}} \DeclareMathOperator\subg{\mathbf{subG}} \newcommand\cN{{\cal N}} \newcommand\cS{{\cal S}} \newcommand\cD{{\cal D}} \newcommand\cM{{\cal M}} \newcommand\cE{{\cal E}} \newcommand\cC{{\cal C}} \newcommand\KJL{K_{\mathbb{JL}}} \newcommand\tO{{\tilde O}} \newcommand\tx{{\tilde{\mathbf{x}}}} \newcommand\tX{{\tilde{\mathbf{X}}}} \newcommand\tY{{\tilde{\mathbf{Y}}}} \newcommand\tA{{\tilde{\mathbf{A}}}} \newcommand\bA{{\mathbf{A}}} \newcommand\ba{{\mathbf{a}}} \newcommand\bw{{\mathbf{w}}} \newcommand\bW{{\mathbf{W}}} \newcommand\bR{{\mathbf{R}}} \newcommand\bB{{\mathbf{B}}} \newcommand\bQ{{\mathbf{Q}}} \newcommand\bZ{{\mathbf{Z}}} \newcommand\bz{{\mathbf{z}}} \newcommand\bU{{\mathbf{U}}} \newcommand\bV{{\mathbf{V}}} \newcommand\bX{{\mathbf{X}}} \newcommand\bS{{\mathbf{S}}} \newcommand\hS{\hat {\mathbf{S}}} \newcommand\bs{{\mathbf{s}}} \newcommand\bx{{\mathbf{x}}} \newcommand\by{{\mathbf{y}}} \newcommand\bb{{\mathbf{b}}} \newcommand\bT{{\mathbf{T}}} \newcommand\bP{{\mathbf{P}}} \newcommand\bG{{\mathbf{G}}} \newcommand\bY{{\mathbf{Y}}} \newcommand\hY{\hat {\mathbf{Y}}} \newcommand\bH{{\mathbf{H}}} \newcommand\bD{{\mathbf{D}}} \newcommand\be{{\mathbf{e}}} \newcommand\br{{\mathbf{r}}} \newcommand\bSig{{\mathbf{\Sigma}}} \newcommand\bPi{{\mathbf{\Pi}}} \newcommand\hphi{{\hat\phi}} \)

Sketching for Matrix Computations:
Multiplication and Embedding

Ken Clarkson
IBM Research, Almaden



  • These slides at
    http://kenclarkson.org/G2S3_2015/
  • Enable javascript, navigate as in powerpoint
  • Math may take some time to render properly; if it doesn't end up looking right, try reloading

From matrix multiplication
to subspace embedding

Recall that an advantage of oblivious sketching is that $$ \norm{\bS\bA\bx }\approx_\eps\norm{\bx }\text{ for all }\bx\in\R^n\\ \iff\\ \norm{\bS\bU\bx }\approx_\eps\norm{\bx }\text{ for all }\bx\in\R^n, $$ where `\bU ` has orthonormal columns
(here assuming `\dim\colspan(\bA )=d`).

Recall that the embedding condition is equivalent to $$\norm{\bU^\top\bS^\top\bS\bU -\Iden}_2\le\eps.$$

Since `\bU^\top\bU = \Iden`:

A spectral error bound for $$\bU^\top\bS^\top\bS\bU $$ as an estimate of $$\bU^\top\bU $$ implies a subspace embedding bound.

From subspace embedding
to matrix multiplication

There is also a kind of converse:
if the sketching error of `\bS ` is well-behaved, then
`\bS ` is good for matrix multiplication.
Suppose for
  • Sketching matrix `\bS\in\R^{m\times n}`
  • Any fixed unit `\by\in\R^n`
  • Fixed `p\ge 2`
  • `X\equiv |\,\norm{\bS\by }^2-1\,| = \Err_\bS(\by)`

that $$\E[X^p]^{1/p}\le\frac12\eps\delta^{1/p}.$$ (In particular, sub-Gaussian i.d. `\bS ` satisfy this for `p=2`.)
Then for `\bA\in\R^{n\times d}`, `\bB\in\R^{n\times d'}`, $$\norm{\bA^\top\bS^\top\bS\bB -\bA^\top\bB }_F\le\eps\norm{\bA }_F\norm{\bB }_F$$ with failure probability `\delta`.

Proof: for `p=2`, about six slides from now.

From embedding to matrix mult, and back

Applying this to $\bU$ an orthonormal basis for `\colspan(\bA )`, we have
Suppose for
  • Sketching matrix `\bS\in\R^{m\times n}`
  • Fixed unit `\by\in\R^n`
  • Fixed `p\ge 2`
  • `X\equiv |\,\norm{\bS\by }^2-1\,|`

that $$\E[X^p]^{1/p}\le\beta\delta^{1/p}.$$ (In particular, sub-Gaussian i.d. `\bS ` satisfy this for `p=2`.)
Then for `\bA\in\R^{n\times d}` of rank `r`, and `\bU ` a basis for `\colspan(\bA )`, $$\norm{\bU^\top\bS^\top\bS\bU -\Iden}_2 \le \norm{\bU^\top\bS^\top\bS\bU -\Iden}_F \le\beta\norm{\bU }_F^2 = \beta r,$$ so that `\bS ` is a `\beta r`-embedding, with failure probability `\delta`.

Moments of sub-Gaussians
and sub-exponentials

The quantity `\E[|X|^p]^{1/p}` grows slowly for sub-Gaussians.
For random variables `X,Y`, let `\norme{X}_{p,Y}\equiv\E_Y[|X|^p]^{1/p}`, or just `\norme{X}_p` when `Y` is understood.
Moments of sub-G and sub-exp (NPH)
`X` is sub-Gaussian `\iff\norme{X}_{p}\lt K_X\sqrt{p}` for all `p\ge 1`
`X` is sub-exponential `\iff\norme{X}_{p}\lt K_X p` for all `p\ge 1`
(Here `K_X` depends on `X`, independent of `p`.)
  • (Gaussian) If `X\sim N(0,1)`, then `\norme{X}_p = K\Gamma(\frac{p+1}2)^{1/p}= \Theta(\sqrt{p})`
  • (Sign) If `\mathbf{X}\in\R^n` has `\mathbf{X}_i= \pm 1` uniformly i.i.d,
    then for unit `\by\in\R^n`,
    `\norme{\mathbf{X}^\top\by }_p = \Theta(\sqrt{p})`
    • This is an expression of Khintchine's inequality

The norm `\norme{\cdot}_{p, X}`

The notation is not a coincidence: random variables defined on a probability space with bounded `p`-th moments are a normed linear space, a subspace of `L_p`.


$$\norme{X+Y}_{p}\le\norme{X}_{p} +\norme{Y}_{p}$$

For `\alpha\ge 0`, `\norme{\alpha X}_{p} = \alpha\norme{X}_{p}`
`\Prob\{|X|\ge t\}\le\norme{X}_p^p/t^p`.

cf. the concentration of sub-Gaussians;
for `p=1`, this is Markov's inequality, and for `p=2`, Chebyshev

From embedding to matrix multiplication:
moment norm of error

Recall that for vectors `a,b`,
`\Err_\bS(\ba,\bb)\equiv\frac{ |\ba^\top\bS^\top\bS\bb -\ba^\top\bb | }{\norm{\ba}\,\norm{\bb}}` and `\Err_\bS(\ba)\equiv\Err(\ba,\ba)`,

Now generalize this to matrices:
`\Err_\bS(\bA ,\bB )\equiv\frac{\norm{\bA^\top\bS^\top\bS\bB -\bA^\top\bB }_F }{\norm{\bA }_F\norm{\bB }_F}` and `\Err_\bS(\bA )\equiv\Err_\bS (\bA ,\bA )`.


For fixed `\bA `, `\bB `, `p\ge 2` and `\gamma\equiv \sup_\by\norm{\Err(\by )}_{p,\bS }`, $$\norm{\Err(\bA ,\bB )}_{p,\bS }\le 2\gamma.$$

Proof (for `p=2` only):
Recall that `\Err(\ba,\bb)\le\Err(\frac{\ba}{\norm{\ba}}+\frac{\bb}{\norm{\bb}}) +\Err(\frac{\ba}{\norm{\ba}}-\frac{\bb}{\norm{\bb}})`;
for given `\ba,\bb`, using `\norme{\cdot}_2` triangle inequality, $$\E[\Err(\ba,\bb)^2]^{1/2} = \norme{\Err(\ba,\bb)}_2\le 2\gamma.$$

From embedding to matrix multiplication:
dot product to `F`-norm

Letting `\Err_{ij}\equiv\Err(\bA_{*i},\bB_{*j})` and `\mathbf{H}\equiv\bA^\top\bS^\top\bS\bB -\bA^\top\bB `, $$\norm{\mathbf{H}}_F^2 = \sum_{i\in[d], j'\in[d']}\mathbf{H}_{ij}^2 = \sum_{i\in[d],j\in[d']}\norm{\bA_{*i}}^2\norm{\bB_{*j}}^2\Err_{ij}^2.$$

We have $$ \begin{align*} \E[\norm{\mathbf{H}}_F^2] & = \sum_{i\in[d],j\in[d']}\norm{\bA_{*i}}^2\norm{\bB_{*j}}^2\E[\Err_{ij}^2] \\ & \le\sum_{i\in[d],j\in[d']}\norm{\bA_{*i}}^2\norm{\bB_{*j}}^2 (2\gamma)^2 = 4\gamma^2\norm{\bA }_F^2\norm{\bB }_F^2 \end{align*} $$

So $$\norme{\Err(\bA ,\bB )}_2 = \E[\norm{\mathbf{H}}_F^2]^{1/2}/\norm{\bA }_F\norm{\bB }_F\le 2\gamma,$$ as claimed, for `p=2`.

For `p>2`, use `\norme{X}_p^p = \norme{X^2}_{p/2}`, and the triangle inequality for `\norme{\cdot}_{p/2}`

From embedding to matrix multiplication:
`F`-norm to tail

We can now apply the moment Markov inequality with
$$\norme{\Err(\bA ,\bB )}_p\le\eps\delta^{1/p}\text{ and } t = \eps,$$ so that $$\Prob\{\Err(\bA ,\bB )\ge t\}\le\norme{\Err(\bA ,\bB )}_p^p/t^p\le (\eps\delta^{1/p})^p/\eps^p = \delta$$ to obtain the matrix multiplication result $$ \begin{align*} \sup_\by \norme{\Err(\by )}_p\le\frac12\eps\delta^{1/p} & \implies\norme{\Err(\bA ,\bB )}_p\le\eps\delta^{1/p} \\ & \implies\Prob\{\Err(\bA ,\bB )\ge\eps\}\le\delta. \end{align*} $$

Further Reading

The matrix multiplication results are cribbed from Jelani Nelson's lecture notes.
See also:
Daniel M. Kane, Jelani Nelson. Sparser Johnson-Lindenstrauss transforms. SODA, 1195-1206, 2012.