
### Sketching for Matrix Computations: Subsampled Randomized Hadamard Transforms

#### Ken Clarkson IBM Research, Almaden

• These slides at
http://kenclarkson.org/G2S3_2015/
• Enable javascript, navigate as in powerpoint
• Math may take some time to render properly; if it doesn't end up looking right, try reloading

• In the original JL:
• Pick random orthogonal matrix \bQ
• Sketching matrix \bS\equiv the first m rows of \bQ
• O(mnd) time to compute \bS\bA
• A faster scheme: pick a random orthogonal matrix, but:
• "Less random"
• Faster to apply

The SRHT is a matrix $\bP\bH\bD$, applied on the left, where:

• \bD is diagonal with uniform i.d. \pm 1 on diagonal
• \bH is a Hadamard matrix (orthogonal, fast multiply)
• \bP\in\R^{m\times n} uniformly samples rows of \bH\bD

Hadamard matrices have a recursive construction

Let \bH_0\in\R^{1\times 1} be the matrix [1].
Let \bH_{i+1}\equiv\frac{1}{\sqrt{2}}\left[\begin{matrix}\bH_i & \bH_i\\\bH_i & -\bH_i\end{matrix}\right] for i\ge 0
So \begin{align*} \bH_1 & \equiv\frac{1}{\sqrt{2}}\left[\begin{matrix} 1 & 1\\ 1 & -1\end{matrix}\right], \bH_2 = \frac{1}{2}\left[\begin{matrix} 1 & 1 & 1 & 1\\ 1 & -1 & 1 & -1\\1 & 1 & -1 & -1\\1 & -1 & -1 & 1\\\end{matrix}\right] \end{align*}

• Orthogonality: easy to see inductively $$\bH_{i+1}^\top\bH_{i+1} = \bH_{i+1}^2 = \frac12\left[\begin{matrix}\bH_i^2 & \bH_i^2 -\bH_i^2\\\bH_i^2 -\bH_i^2 & \bH_i^2\end{matrix}\right] = \Iden$$
• Fast multiply: suppose \bx =\left[\begin{smallmatrix} \bx' \\ \bx''\end{smallmatrix}\right]\in\R^{2^k}
• Where \bx ',\bx ''\in\R^{2^{k-1}}
• Then \bH_{i+1}\bx =\left[\begin{smallmatrix} \bH_i \bx' +\bH_i\bx'' \\ \bH_i\bx' - \bH_i\bx''\end{smallmatrix}\right],
so \bH_{i+1}\bx is linear-time computable from \bH_i\bx' and \bH_i\bx''

### Analysis of Fast Hadamard (Outline)

There is $$m = O(\eps^{-2}(\sqrt{d}+\sqrt{\log(nd)})^2\log d) = \tO(d/\eps^2)$$ so that with failure probability O(1/d), \bP\bH\bD  is an \eps-embedding of \colspan(\bA ). \bP\bH\bD\bA  can be computed in O(nd\log m) time.
Proof outline:
1. Bound the leverage scores of \bH\bD\bA  by roughly O(d/n): $$(\sqrt{d/n}+\sqrt{8\log(n/\beta})^2$$ with failure probability \beta
2. Use leverage score sampling, paying a factor \eps^{-2}\log d.

Part 2 is proven elsewhere in these lectures.

### Leverage scores of \bH\bD\bA  (outline)

• We assume WLOG that \bA  has orthonormal columns
• Via obliviousness
\implies\bH\bD\bA  has orthonormal columns $$(\bH\bD\bA )^\top\bH\bD\bA = \bA^\top\bD^\top\bH^\top\bH\bD\bA = \Iden$$ \implies it's enough to bound the row norms of \bH\bD\bA
• For fixed i, let f(\bD )\equiv\norm{\be_i^\top\bH\bD\bA }
• A function on diagonal matrices
• \E[f(\bD )]\le\sqrt{d/n} (next slide)
• f is convex, Lipschitz \implies tail estimate for f(\bD )
• Use union bound over rows i

### \E[f(\bD )]\le\sqrt{d/n}

We have $$\E[f(\bD )]^2 = \E[\norm{\be_i^\top\bH\bD\bA }]^2\le\E[\norm{\be_i^\top\bH\bD\bA }^2]$$ and $$\norm{\be_i^\top\bH\bD\bA }^2 = \sum_{j\in[d]} (\be_i^\top\bH\bD\bA_{*j})^2.$$

Since

• each column \bA_{*j} is a unit vector
• \be_i^\top\bH\bD  is a vector of i.d. \pm 1, scaled by 1/\sqrt{n}
we have \E[(\be_i^\top\bH\bD\bA_{*j})^2] = \norm{\bA_{i*}}^2/n = 1/n, and $$\E[f(\bD )]^2\le\sum_{j\in[d]} 1/n = d/n,$$ so \E[f(\bD )]\le\sqrt{d/n}, as claimed.

### Building a better \bS

• Sparse embeddings take O(\nnz{\bA }) to apply, but m=O(d^2/\eps^2), or O(d^{1+\gamma}+d/\eps) etc, is big.
• O(\nnz{\bA }/\eps) to apply, with m=O(d^{1+\gamma}/\eps^2), is possible
• For sparse embeddings with s\gt 1 nonzeros per column
• Or: O(\nnz{\bA }) +\tO(d^3/\eps^2) to map to m= \tO(d/\eps^2)
• First apply a sparse embedding, then SHRT
• Subspace embeddings compose
• Matrix product approximators compose