So far:
Now: Leverage scores
Recall:
Given
The algorithm:
`\bW = \bS *\bA `; | // compute sketch |
`[\bQ,\bR ] = \mathtt{qr}(\bW )`; | // compute change-of-basis for `\bS\bA ` |
`\bZ = \bA *(\bR^{-1}*\mathbf{\Pi})`; | // compute sketch of `\bA\bR^{-1}` |
return `\mathtt{dot}(\bZ,\bZ,2)`; | // return vector of squared row norms |
We have `\norm{\bA\bR^{-1}\bx }\approx_\eps\norm{\bS\bA\bR^{-1}\bx } = \norm{\bQ\bx } = \norm{\bx }`,
for all `\bx`,
so `\bA\bR^{-1}` has singular values `1\pm\eps`, and can stand in as a basis for `\colspan(\bA )`.
The algorithm and this analysis assume `\rank(\bA )=d`. Let
Since for any `\bx\in\R^d`, `\begin{align*}\norm{\bT\bx } = \norm{\bQ\bT\bx } & = \norm{\bS\bA\bR^{-1}\bT\bx }\\ & \approx_\eps\norm{\bA\bR^{-1}\bT\bx } = \norm{\bU\bx } = \norm{\bx },\end{align*}`
`\bT ` has singular values `1\pm\eps`, soTherefore, the estimated row norm
$$\norm{\be_i^\top\bA\bR^{-1}\mathbf{\Pi}}\approx_\eps\norm{\be_i^\top\bA\bR^{-1}} = \norm{\be_i^\top\bU\bT^{-1}}\approx_{2\eps}\norm{\be_i^\top \bU }.$$
`\bW = \bS *\bA `; | // `O(\nnz{\bA }s)` when `\bS ` is a sparse embedding, param `s` |
`[\bQ,\bR ] = \mathtt{qr}(\bW )`; | // `O(d^2m)` |
`\bZ = \bA *(\bR^{-1}*\mathbf{\Pi})`; | // `O(d^2m') +O(\nnz{\bA }m')` |
return `\mathtt{dot}(\bZ,\bZ,2)`; | // `O(nm')` |
Grand total, for `s=O(1/\eps)`: $$\begin{align*} O(\nnz{\bA }(m'+s) & + O(d^2(m+m') \\ & = O(\nnz{\bA}\eps^{-2}\log n) + O(d^2\eps^{-2}(\log(n) + d^{1+\gamma}))\end{align*}$$
Constant `\eps` is good enough for many purposes, such as leverage score sampling
A hybrid approach removes the `\gamma`, at the cost of `\Theta(d^3)` space.
The algorithm is basically due to:
Petros Drineas, Malik Magdon-Ismail, Michael W. Mahoney, David P. Woodruff. Fast approximation of matrix coherence and statistical leverage.
`\bA\bR^{-1}` is so well-conditioned (with `\sigma_i(\bA\bR^{-1})\approx_\eps 1` for small enough `\eps`) that even a simple iterative scheme converges provably fast.
We solve `\min_\bx\norm{\bA\bR^{-1}\bx -\bb}` and return `\bR^{-1}\bx `.
For concision, let `\bZ\gets\bR^{-1}`.
Choose some `\bx^{(0)}`, and let
$$\bx^{(m+1)}\gets\bx^{(m)} +(\bA\bZ)^\top (\bb-\bA\bZ\bx^{(m)}).$$
We show: $$\norm{\bA\bZ (\bx^{(m+1)} -\bx_*)} \le \frac12\norm{\bA\bZ (\bx^{m} -\bx_*)}$$ for small enough `\eps`.
The normal equations: $$(\bA\bZ )^\top \bb = (\bA\bZ )^\top (\bA\bZ )\bx_*,\text{ that is, }\bZ^\top\bA^\top \bb = \bZ^\top\bA^\top\bA\bZ\bx_*$$
We have $$\begin{align*} \bA\bZ & (\bx^{(m+1)} -\bx_*) \\ & = \bA\bZ (\bx^{(m)} +\bZ^\top\bA^\top(b-\bA\bZ\bx^{(m)}) -\bx_*) \\ & = (\bA\bZ -\bA\bZ\bZ^\top\bA^\top\bA\bZ )(\bx^{(m)}-\bx_*)\qquad (\text{ normal eq.s }) \\ & = \bU (\bSig -\bSig^3)\bV^\top (\bx^{(m)}-\bx_*), \end{align*}$$ where `\bA\bZ = \bU\bSig\bV^\top`, and `(\bU\bSig\bV^\top)(\bV\bSig\bU^\top)(\bU\bSig\bV^\top) = \bU\bSig (\bV^\top\bV )\bSig (\bU^\top\bU )\bSig\bV^\top = \bU\bSig^3\bV^\top`.
Since all `\sigma_i = \Sigma_{i,i} = 1\pm\eps`, we have `(1-\Sigma^2)_i = 1 - \sigma_i^2 = 1 - (1\pm\eps)^2 \le 2\eps`, so $$\norm{AZ(\bx^{(m+1)} -\bx_*)}\le\frac12\norm{AZ(\bx^{(m)} -\bx_*)}$$ for small enough `\eps`, as claimed.
A serious use of sketching for pre-conditioning iterative algorithms was:
Haim Avron, Petar Maymounkov, and Sivan Toledo. Blendenpik: Supercharging LAPACK's Least-Squares Solver. SIAM Journal on Scientific Computing 2010 32:3, 1217-1236.
The above simplified version is from:
Kenneth L. Clarkson and David P. Woodruff. Low rank approximation and regression in input sparsity time. Proc.45th Annual ACM Symposium on the Theory of Computing (STOC), 81-90, 2013.