\( \newcommand\norm[1]{|| #1 ||} \newcommand\nnz[1]{{\textrm{nnz}(#1)}} \newcommand\E{\mathbf{E}} \newcommand\Dist{\mathbf{Dist}} \newcommand\reals{{\mathrm{I\!R}}} \newcommand\R{{\mathrm{I\!R}}} \newcommand\poly{{\textrm{poly}}} \newcommand\rank{{\textrm{rank}}} \newcommand\colspan{{\textrm{range}}} \newcommand\diam{{\textrm{diam}}} \newcommand\polylog{{\textrm{polylog}}} \newcommand\rowspan{{\textrm{rowspan}}} \newcommand\colbasis{{\textrm{colbasis}}} \newcommand\argmin{{\mathrm{argmin}}} \newcommand\Err{{\mathtt{NE}}} \newcommand\eps{{\epsilon}} \newcommand\bfone{{\mathbf{1}}} \newcommand\Var{{\mathbf{Var}}} \newcommand\stdev[1]{{\mathbf{stdev(#1)}}} \newcommand\cgf{{\mathbf{CGF}}} \newcommand\bfone{{\mathbb{1}}} \newcommand\Iden{{\mathbf{I}}} \newcommand\Prob{{\mathbf{Pr}}} \DeclareMathOperator\subg{\mathbf{subG}} \newcommand\cN{{\cal N}} \newcommand\cS{{\cal S}} \newcommand\cD{{\cal D}} \newcommand\cM{{\cal M}} \newcommand\cE{{\cal E}} \newcommand\cC{{\cal C}} \newcommand\KJL{K_{\mathbb{JL}}} \newcommand\tO{{\tilde O}} \newcommand\tx{{\tilde{\mathbf{x}}}} \newcommand\tX{{\tilde{\mathbf{X}}}} \newcommand\tY{{\tilde{\mathbf{Y}}}} \newcommand\tA{{\tilde{\mathbf{A}}}} \newcommand\bA{{\mathbf{A}}} \newcommand\ba{{\mathbf{a}}} \newcommand\bw{{\mathbf{w}}} \newcommand\bW{{\mathbf{W}}} \newcommand\bR{{\mathbf{R}}} \newcommand\bB{{\mathbf{B}}} \newcommand\bQ{{\mathbf{Q}}} \newcommand\bZ{{\mathbf{Z}}} \newcommand\bz{{\mathbf{z}}} \newcommand\bU{{\mathbf{U}}} \newcommand\bV{{\mathbf{V}}} \newcommand\bX{{\mathbf{X}}} \newcommand\bS{{\mathbf{S}}} \newcommand\hS{\hat {\mathbf{S}}} \newcommand\bs{{\mathbf{s}}} \newcommand\bx{{\mathbf{x}}} \newcommand\by{{\mathbf{y}}} \newcommand\bb{{\mathbf{b}}} \newcommand\bT{{\mathbf{T}}} \newcommand\bP{{\mathbf{P}}} \newcommand\bG{{\mathbf{G}}} \newcommand\bY{{\mathbf{Y}}} \newcommand\hY{\hat {\mathbf{Y}}} \newcommand\bH{{\mathbf{H}}} \newcommand\bD{{\mathbf{D}}} \newcommand\be{{\mathbf{e}}} \newcommand\br{{\mathbf{r}}} \newcommand\bSig{{\mathbf{\Sigma}}} \newcommand\bPi{{\mathbf{\Pi}}} \newcommand\hphi{{\hat\phi}} \)

Sketching for Matrix Computations:
Leverage Scores & Pre-Conditioned Least Squares

Ken Clarkson
IBM Research, Almaden



  • These slides at
    http://kenclarkson.org/G2S3_2015/
  • Enable javascript, navigate as in powerpoint
  • Math may take some time to render properly; if it doesn't end up looking right, try reloading

Applications of sparse embeddings:
leverage scores

So far:

  • Least-squares regression
  • Approximate matrix multiplication

Now: Leverage scores

Recall:

Given `\bA `, with `\bU ` an orthonormal basis for `\colspan(\bA )`, the `i`'th leverage score of `\bA ` is `\norm{\bU_{i*}}^2 = \norm{\be_i\bU }^2`.

Computing leverage scores

Given

  • Input matrix `\bA `
  • subspace `\eps`-embedding matrix `\bS\in\R^{m\times n}` for `\bA `
  • embedding matrix `\mathbf{\Pi}\in\R^{d\times m'}`,
    so `\norm{\bx\mathbf{\Pi}}\approx_\eps\norm{\bx }` for `n` row vectors `\bx `
    • `m'=O((\log n)/\eps^2)`

The algorithm:

`\bW = \bS *\bA `;// compute sketch
`[\bQ,\bR ] = \mathtt{qr}(\bW )`; // compute change-of-basis for `\bS\bA `
`\bZ = \bA *(\bR^{-1}*\mathbf{\Pi})`;// compute sketch of `\bA\bR^{-1}`
return `\mathtt{dot}(\bZ,\bZ,2)`; // return vector of squared row norms

Computing leverage scores: correctness

We have `\norm{\bA\bR^{-1}\bx }\approx_\eps\norm{\bS\bA\bR^{-1}\bx } = \norm{\bQ\bx } = \norm{\bx }`,
for all `\bx`, so `\bA\bR^{-1}` has singular values `1\pm\eps`, and can stand in as a basis for `\colspan(\bA )`.

The algorithm and this analysis assume `\rank(\bA )=d`. Let

  • `\bA = \bU\bSig\bV^\top`
  • `\bT\equiv\bR\bV\bSig^+`, so that `\bA\bR^{-1}\bT = \bU\bSig\bV^\top\bR^{-1}\bR\bV\bSig^+ = \bU `

Since for any `\bx\in\R^d`, `\begin{align*}\norm{\bT\bx } = \norm{\bQ\bT\bx } & = \norm{\bS\bA\bR^{-1}\bT\bx }\\ & \approx_\eps\norm{\bA\bR^{-1}\bT\bx } = \norm{\bU\bx } = \norm{\bx },\end{align*}`

`\bT ` has singular values `1\pm\eps`, so
`\bT^{-1}` has singular values `1\pm 2\eps` for `\eps\le 1/2`.

Therefore, the estimated row norm
$$\norm{\be_i^\top\bA\bR^{-1}\mathbf{\Pi}}\approx_\eps\norm{\be_i^\top\bA\bR^{-1}} = \norm{\be_i^\top\bU\bT^{-1}}\approx_{2\eps}\norm{\be_i^\top \bU }.$$


Computing leverage scores: running time

`\bW = \bS *\bA `;// `O(\nnz{\bA }s)` when `\bS ` is a sparse embedding, param `s`
`[\bQ,\bR ] = \mathtt{qr}(\bW )`; // `O(d^2m)`
`\bZ = \bA *(\bR^{-1}*\mathbf{\Pi})`;// `O(d^2m') +O(\nnz{\bA }m')`
return `\mathtt{dot}(\bZ,\bZ,2)`; // `O(nm')`

Grand total, for `s=O(1/\eps)`: $$\begin{align*} O(\nnz{\bA }(m'+s) & + O(d^2(m+m') \\ & = O(\nnz{\bA}\eps^{-2}\log n) + O(d^2\eps^{-2}(\log(n) + d^{1+\gamma}))\end{align*}$$

Leverage scores, finis

Given `\bA\in\R^{n\times d}` and `\eps\gt 0`, the leverage scores of `\bA` can be computed with relative error `\eps` in $$O(\nnz{\bA}\eps^{-2}\log n) + O(d^2\eps^{-2}(\log(n) + d^{1+\gamma}))$$ expected time.

Constant `\eps` is good enough for many purposes, such as leverage score sampling

A hybrid approach removes the `\gamma`, at the cost of `\Theta(d^3)` space.

Further Reading

The algorithm is basically due to:

Petros Drineas, Malik Magdon-Ismail, Michael W. Mahoney, David P. Woodruff. Fast approximation of matrix coherence and statistical leverage.

Least-squares regression,
now with pre-conditioner

`\bA\bR^{-1}` is so well-conditioned (with `\sigma_i(\bA\bR^{-1})\approx_\eps 1` for small enough `\eps`) that even a simple iterative scheme converges provably fast.

We solve `\min_\bx\norm{\bA\bR^{-1}\bx -\bb}` and return `\bR^{-1}\bx `.
For concision, let `\bZ\gets\bR^{-1}`.
Choose some `\bx^{(0)}`, and let $$\bx^{(m+1)}\gets\bx^{(m)} +(\bA\bZ)^\top (\bb-\bA\bZ\bx^{(m)}).$$

We show: $$\norm{\bA\bZ (\bx^{(m+1)} -\bx_*)} \le \frac12\norm{\bA\bZ (\bx^{m} -\bx_*)}$$ for small enough `\eps`.

The normal equations: $$(\bA\bZ )^\top \bb = (\bA\bZ )^\top (\bA\bZ )\bx_*,\text{ that is, }\bZ^\top\bA^\top \bb = \bZ^\top\bA^\top\bA\bZ\bx_*$$

Least-squares regression,
with pre-conditioner

We have $$\begin{align*} \bA\bZ & (\bx^{(m+1)} -\bx_*) \\ & = \bA\bZ (\bx^{(m)} +\bZ^\top\bA^\top(b-\bA\bZ\bx^{(m)}) -\bx_*) \\ & = (\bA\bZ -\bA\bZ\bZ^\top\bA^\top\bA\bZ )(\bx^{(m)}-\bx_*)\qquad (\text{ normal eq.s }) \\ & = \bU (\bSig -\bSig^3)\bV^\top (\bx^{(m)}-\bx_*), \end{align*}$$ where `\bA\bZ = \bU\bSig\bV^\top`, and `(\bU\bSig\bV^\top)(\bV\bSig\bU^\top)(\bU\bSig\bV^\top) = \bU\bSig (\bV^\top\bV )\bSig (\bU^\top\bU )\bSig\bV^\top = \bU\bSig^3\bV^\top`.

Since all `\sigma_i = \Sigma_{i,i} = 1\pm\eps`, we have `(1-\Sigma^2)_i = 1 - \sigma_i^2 = 1 - (1\pm\eps)^2 \le 2\eps`, so $$\norm{AZ(\bx^{(m+1)} -\bx_*)}\le\frac12\norm{AZ(\bx^{(m)} -\bx_*)}$$ for small enough `\eps`, as claimed.

Further Reading

A serious use of sketching for pre-conditioning iterative algorithms was:

Haim Avron, Petar Maymounkov, and Sivan Toledo. Blendenpik: Supercharging LAPACK's Least-Squares Solver. SIAM Journal on Scientific Computing 2010 32:3, 1217-1236.

The above simplified version is from:

Kenneth L. Clarkson and David P. Woodruff. Low rank approximation and regression in input sparsity time. Proc.45th Annual ACM Symposium on the Theory of Computing (STOC), 81-90, 2013.