\(
\newcommand\norm[1]{|| #1 ||}
\newcommand\nnz[1]{{\textrm{nnz}(#1)}}
\newcommand\E{\mathbf{E}}
\newcommand\Dist{\mathbf{Dist}}
\newcommand\reals{{\mathrm{I\!R}}}
\newcommand\R{{\mathrm{I\!R}}}
\newcommand\poly{{\textrm{poly}}}
\newcommand\rank{{\textrm{rank}}}
\newcommand\colspan{{\textrm{range}}}
\newcommand\diam{{\textrm{diam}}}
\newcommand\polylog{{\textrm{polylog}}}
\newcommand\rowspan{{\textrm{rowspan}}}
\newcommand\colbasis{{\textrm{colbasis}}}
\newcommand\argmin{{\mathrm{argmin}}}
\newcommand\Err{{\mathtt{NE}}}
\newcommand\eps{{\epsilon}}
\newcommand\bfone{{\mathbf{1}}}
\newcommand\Var{{\mathbf{Var}}}
\newcommand\stdev[1]{{\mathbf{stdev(#1)}}}
\newcommand\cgf{{\mathbf{CGF}}}
\newcommand\bfone{{\mathbb{1}}}
\newcommand\Iden{{\mathbf{I}}}
\newcommand\Prob{{\mathbf{Pr}}}
\DeclareMathOperator\subg{\mathbf{subG}}
\newcommand\cN{{\cal N}}
\newcommand\cS{{\cal S}}
\newcommand\cD{{\cal D}}
\newcommand\cM{{\cal M}}
\newcommand\cE{{\cal E}}
\newcommand\cC{{\cal C}}
\newcommand\KJL{K_{\mathbb{JL}}}
\newcommand\tO{{\tilde O}}
\newcommand\tx{{\tilde{\mathbf{x}}}}
\newcommand\tX{{\tilde{\mathbf{X}}}}
\newcommand\tY{{\tilde{\mathbf{Y}}}}
\newcommand\tA{{\tilde{\mathbf{A}}}}
\newcommand\bA{{\mathbf{A}}}
\newcommand\ba{{\mathbf{a}}}
\newcommand\bw{{\mathbf{w}}}
\newcommand\bW{{\mathbf{W}}}
\newcommand\bR{{\mathbf{R}}}
\newcommand\bB{{\mathbf{B}}}
\newcommand\bQ{{\mathbf{Q}}}
\newcommand\bZ{{\mathbf{Z}}}
\newcommand\bz{{\mathbf{z}}}
\newcommand\bU{{\mathbf{U}}}
\newcommand\bV{{\mathbf{V}}}
\newcommand\bX{{\mathbf{X}}}
\newcommand\bS{{\mathbf{S}}}
\newcommand\hS{\hat {\mathbf{S}}}
\newcommand\bs{{\mathbf{s}}}
\newcommand\bx{{\mathbf{x}}}
\newcommand\by{{\mathbf{y}}}
\newcommand\bb{{\mathbf{b}}}
\newcommand\bT{{\mathbf{T}}}
\newcommand\bP{{\mathbf{P}}}
\newcommand\bG{{\mathbf{G}}}
\newcommand\bY{{\mathbf{Y}}}
\newcommand\hY{\hat {\mathbf{Y}}}
\newcommand\bH{{\mathbf{H}}}
\newcommand\bD{{\mathbf{D}}}
\newcommand\be{{\mathbf{e}}}
\newcommand\br{{\mathbf{r}}}
\newcommand\bSig{{\mathbf{\Sigma}}}
\newcommand\bPi{{\mathbf{\Pi}}}
\newcommand\hphi{{\hat\phi}}
\)
Sketching for Matrix Computations:
Subsampled Randomized Hadamard Transforms
Ken Clarkson
IBM Research, Almaden
- These slides at
http://kenclarkson.org/G2S3_2015/
- Enable javascript, navigate as in powerpoint
- Math may take some time to render properly; if it doesn't end up looking right, try reloading
Fast Hadamard
- In the original JL:
- Pick random orthogonal matrix `\bQ `
- Sketching matrix `\bS\equiv` the first `m` rows of `\bQ `
- `O(mnd)` time to compute `\bS\bA `
- A faster scheme: pick a random orthogonal matrix, but:
- "Less random"
- Faster to apply
The SRHT is a matrix $\bP\bH\bD$, applied on the left, where:
- `\bD` is diagonal with uniform i.d. `\pm 1` on diagonal
- `\bH` is a Hadamard matrix (orthogonal, fast multiply)
- `\bP\in\R^{m\times n}` uniformly samples rows of `\bH\bD`
Hadamard: definition
Hadamard matrices have a recursive construction
Let `\bH_0\in\R^{1\times 1}` be the matrix `[1]`.
Let `\bH_{i+1}\equiv\frac{1}{\sqrt{2}}\left[\begin{matrix}\bH_i & \bH_i\\\bH_i & -\bH_i\end{matrix}\right]` for `i\ge 0`
So
$$
\begin{align*}
\bH_1 & \equiv\frac{1}{\sqrt{2}}\left[\begin{matrix} 1 & 1\\ 1 & -1\end{matrix}\right],
\bH_2 = \frac{1}{2}\left[\begin{matrix} 1 & 1 & 1 & 1\\ 1 & -1 & 1 & -1\\1 & 1 & -1 & -1\\1 & -1 & -1 & 1\\\end{matrix}\right]
\end{align*}
$$
Hadamard: properties
- Orthogonality: easy to see inductively
$$\bH_{i+1}^\top\bH_{i+1} = \bH_{i+1}^2 = \frac12\left[\begin{matrix}\bH_i^2 & \bH_i^2 -\bH_i^2\\\bH_i^2 -\bH_i^2
& \bH_i^2\end{matrix}\right] = \Iden$$
- Fast multiply: suppose `\bx =\left[\begin{smallmatrix} \bx' \\ \bx''\end{smallmatrix}\right]\in\R^{2^k}`
- Where `\bx ',\bx ''\in\R^{2^{k-1}}`
- Then `\bH_{i+1}\bx =\left[\begin{smallmatrix} \bH_i \bx' +\bH_i\bx'' \\ \bH_i\bx' - \bH_i\bx''\end{smallmatrix}\right]`,
so `\bH_{i+1}\bx` is linear-time computable from `\bH_i\bx'` and `\bH_i\bx''`
Analysis of Fast Hadamard (Outline)
There is
$$m = O(\eps^{-2}(\sqrt{d}+\sqrt{\log(nd)})^2\log d) = \tO(d/\eps^2)$$
so that
with failure probability `O(1/d)`, `\bP\bH\bD ` is an `\eps`-embedding
of `\colspan(\bA )`. `\bP\bH\bD\bA ` can be computed in `O(nd\log m)` time.
Proof outline:
- Bound the leverage scores of `\bH\bD\bA ` by roughly `O(d/n)`:
$$(\sqrt{d/n}+\sqrt{8\log(n/\beta})^2$$
with failure probability `\beta`
- Use leverage score sampling, paying a factor `\eps^{-2}\log d`.
Part 2 is proven elsewhere in these lectures.
Leverage scores of `\bH\bD\bA ` (outline)
- We assume WLOG that `\bA ` has orthonormal columns
`\implies\bH\bD\bA ` has orthonormal columns
$$(\bH\bD\bA )^\top\bH\bD\bA = \bA^\top\bD^\top\bH^\top\bH\bD\bA = \Iden$$
`\implies` it's enough to bound the row norms of `\bH\bD\bA `
- For fixed `i`, let `f(\bD )\equiv\norm{\be_i^\top\bH\bD\bA }`
- A function on diagonal matrices
- `\E[f(\bD )]\le\sqrt{d/n}` (next slide)
- `f` is convex, Lipschitz `\implies` tail estimate for `f(\bD )`
- Use union bound over rows `i`
`\E[f(\bD )]\le\sqrt{d/n}`
We have
$$\E[f(\bD )]^2 = \E[\norm{\be_i^\top\bH\bD\bA }]^2\le\E[\norm{\be_i^\top\bH\bD\bA }^2]$$
and
$$
\norm{\be_i^\top\bH\bD\bA }^2 = \sum_{j\in[d]} (\be_i^\top\bH\bD\bA_{*j})^2.
$$
Since
- each column `\bA_{*j}` is a unit vector
- `\be_i^\top\bH\bD ` is a vector of i.d. `\pm 1`, scaled by `1/\sqrt{n}`
we have `\E[(\be_i^\top\bH\bD\bA_{*j})^2] = \norm{\bA_{i*}}^2/n = 1/n`,
and
$$\E[f(\bD )]^2\le\sum_{j\in[d]} 1/n = d/n,$$
so `\E[f(\bD )]\le\sqrt{d/n}`, as claimed.
Building a better `\bS `
- Sparse embeddings take `O(\nnz{\bA })` to apply, but `m=O(d^2/\eps^2)`, or `O(d^{1+\gamma}+d/\eps)` etc, is big.
- `O(\nnz{\bA }/\eps)` to apply, with `m=O(d^{1+\gamma}/\eps^2)`, is possible
- For sparse embeddings with `s\gt 1` nonzeros per column
- Or: `O(\nnz{\bA }) +\tO(d^3/\eps^2)` to map to `m= \tO(d/\eps^2)`
- First apply a sparse embedding, then SHRT
- Subspace embeddings compose
- Matrix product approximators compose