\( \newcommand\norm[1]{|| #1 ||} \newcommand\nnz[1]{{\textrm{nnz}(#1)}} \newcommand\E{\mathbf{E}} \newcommand\Dist{\mathbf{Dist}} \newcommand\reals{{\mathrm{I\!R}}} \newcommand\R{{\mathrm{I\!R}}} \newcommand\poly{{\textrm{poly}}} \newcommand\rank{{\textrm{rank}}} \newcommand\colspan{{\textrm{range}}} \newcommand\diam{{\textrm{diam}}} \newcommand\polylog{{\textrm{polylog}}} \newcommand\rowspan{{\textrm{rowspan}}} \newcommand\colbasis{{\textrm{colbasis}}} \newcommand\argmin{{\mathrm{argmin}}} \newcommand\Err{{\mathtt{NE}}} \newcommand\eps{{\epsilon}} \newcommand\bfone{{\mathbf{1}}} \newcommand\Var{{\mathbf{Var}}} \newcommand\stdev[1]{{\mathbf{stdev(#1)}}} \newcommand\cgf{{\mathbf{CGF}}} \newcommand\bfone{{\mathbb{1}}} \newcommand\Iden{{\mathbf{I}}} \newcommand\Prob{{\mathbf{Pr}}} \DeclareMathOperator\subg{\mathbf{subG}} \newcommand\cN{{\cal N}} \newcommand\cS{{\cal S}} \newcommand\cD{{\cal D}} \newcommand\cM{{\cal M}} \newcommand\cE{{\cal E}} \newcommand\cC{{\cal C}} \newcommand\KJL{K_{\mathbb{JL}}} \newcommand\tO{{\tilde O}} \newcommand\tx{{\tilde{\mathbf{x}}}} \newcommand\tX{{\tilde{\mathbf{X}}}} \newcommand\tY{{\tilde{\mathbf{Y}}}} \newcommand\tA{{\tilde{\mathbf{A}}}} \newcommand\bA{{\mathbf{A}}} \newcommand\ba{{\mathbf{a}}} \newcommand\bw{{\mathbf{w}}} \newcommand\bW{{\mathbf{W}}} \newcommand\bR{{\mathbf{R}}} \newcommand\bB{{\mathbf{B}}} \newcommand\bQ{{\mathbf{Q}}} \newcommand\bZ{{\mathbf{Z}}} \newcommand\bz{{\mathbf{z}}} \newcommand\bU{{\mathbf{U}}} \newcommand\bV{{\mathbf{V}}} \newcommand\bX{{\mathbf{X}}} \newcommand\bS{{\mathbf{S}}} \newcommand\hS{\hat {\mathbf{S}}} \newcommand\bs{{\mathbf{s}}} \newcommand\bx{{\mathbf{x}}} \newcommand\by{{\mathbf{y}}} \newcommand\bb{{\mathbf{b}}} \newcommand\bT{{\mathbf{T}}} \newcommand\bP{{\mathbf{P}}} \newcommand\bG{{\mathbf{G}}} \newcommand\bY{{\mathbf{Y}}} \newcommand\hY{\hat {\mathbf{Y}}} \newcommand\bH{{\mathbf{H}}} \newcommand\bD{{\mathbf{D}}} \newcommand\be{{\mathbf{e}}} \newcommand\br{{\mathbf{r}}} \newcommand\bSig{{\mathbf{\Sigma}}} \newcommand\bPi{{\mathbf{\Pi}}} \newcommand\hphi{{\hat\phi}} \)

Sketching for Matrix Computations:
Embeddings and Regression

Ken Clarkson
IBM Research, Almaden

  • These slides at
  • Enable javascript, navigate as in powerpoint
  • Math may take some time to render properly; if it doesn't end up looking right, try reloading

Sketching Matrices:
Oblivious Subspace Embeddings

Matrix `\bS\in\R^{m\times n}` is an `\eps`-embedding of set `P\subset\R^n` if $$\norm{\bS\by }\approx_\eps\norm{\by }\text{ for all }\by\in P.$$ `\bS` will be called a "sketching matrix".
`a\approx_\eps b\equiv | a - b |\le\eps |b|`
For `\bA\in\R^{n\times d}`,
a matrix `\bS ` is a subspace `\eps`-embedding for `\bA ` if `\bS ` is an `\eps`-embedding for `\colspan(\bA )= \{\bA\bx\mid\bx\in\R^d\}`. That is, $$\norm{\bS\bA\bx }\approx_\eps\norm{\bA\bx }\text{ for all }\bx\in\R^d.$$ `\bS\bA` will be called a "sketch".

An oblivious subspace embedding is:
  • A probability distribution `\cal D` over matrices `\bS\in\R^{m\times n}`, so that
  • For any unknown but fixed matrix `\bA`,
    `\bS ` is a subspace `\eps`-embedding for `\bA `, with high-enough probability.

Least-squares regression

  • Input is `n` points in `d+1` dimensions
  • Points have the form `(\bA_{i*},\bb_i)`, where:
    • `\bA ` is an `n\times d` matrix
    • `\bA_{i*}\in\reals^d` is row `i` of a matrix `\bA `
    • `\bb\in\reals^n` is a column vector
  • Regression [G95][L05]:
    find a good `\bx\in\reals^d`
    so that `\bA_i\bx ` is a good prediction for `\bb_i`
  • "Good"= $$ \min_{\bx\in\reals^d}\sum_i (\bA_i\bx -\bb_i)^2 = \min_{\bx\in\reals^d}\norm{\bA\bx -\bb}_2^2 $$
    • `\br\equiv\bA\bx -\bb` is the residual
    • `\norm{\br}_2^2 = \sum_i\br_i^2`, the squared Euclidean length of `\br`

Subspace embeddings for regression

Suppose `\bS` is a subspace embedding for `\colspan([\bA\ \bb])`, so $$\norm{\bS [\bA\ \bb]\by }\approx_\eps\norm{[\bA\ \bb]\by }\text{ for }\by\in\R^{d+1}.$$ Then in particular this is true when `\by=\left[\begin{smallmatrix}x\\ -1\end{smallmatrix}\right]`, `x\in\R^d`, so $$\norm{\bS (\bA\bx -\bb)}\approx_\eps\norm{\bA\bx -\bb} = \norm{\br}.$$

That is,
if $$\begin{align*} \tx\equiv & \argmin_{\bx\in\R^d}\norm{\bS (\bA\bx -\bb)} \\ \bx^*\equiv & \argmin_{\bx\in\R^d}\norm{\bA\bx -\bb},\end{align*} $$ then $$ \begin{align*} \norm{\bA\tx-\bb} & \le\frac{\norm{\bS (\bA\tx-\bb)}}{1-\eps} \le\frac{\norm{\bS (\bA\bx^*-\bb)}} {1-\eps} \le\norm{\bA\bx^*-\bb}\frac{1+\eps}{1-\eps}, \end{align*}$$
and so for `\eps\le 1/3`, `\norm{\bA\tx-\bb}\le\norm{\bA\bx^*-\bb}(1+3\eps)`.

Solving least-squares in "sketch space"
yields a good solution in the original space.

The advantages of obliviousness

`\bS ` is oblivious: the distribution `\cal D`
does not depend on the input data.
Algorithmic advantages:
  • Construct `\bS `, then input `\bA `
  • Streaming: when an entry of `\bA ` changes, `\bS\bA ` is easy to update
  • Distributed: if `\bA = \sum_p\bA^p` over processors `p`:
    • Compute `\bS\bA^p` at each processor
    • Ship to master for summation

The advantages of obliviousness: analytic

Analysis advantages: for any condition `\cE`
$$ \cE\text{ holds for }\bS\bA\bx\text{ for all }\bx\in\R^d\\ \iff\\ \cE\text{ holds for }\bS\by\text{ for all }\by\in\colspan(\bA )\\ \iff\\ \cE\text{ holds for }\bS\bU\mathbf{z}\text{ for all }\mathbf{z}\in\R^{\rank(\bA )},\\ \qquad\qquad\bU\text{ an orthogonal basis of }\colspan(\bA ) $$

So often WLOG,
`\bA ` is full rank and has orthonormal columns, that is, is an orthonormal matrix.

Also, constraints "for free", meaning: $$\min_{\bx\in{\cal C}}\norm{\bS\bA\bx }\approx\min_{\bx\in{\cal C}}\norm{\bA\bx },$$ for any constraint set `\cal C`, such as `\cC=\{x\mid x\ge 0\}`.