\(
\newcommand\norm[1]{|| #1 ||}
\newcommand\nnz[1]{{\textrm{nnz}(#1)}}
\newcommand\E{\mathbf{E}}
\newcommand\Dist{\mathbf{Dist}}
\newcommand\reals{{\mathrm{I\!R}}}
\newcommand\R{{\mathrm{I\!R}}}
\newcommand\poly{{\textrm{poly}}}
\newcommand\rank{{\textrm{rank}}}
\newcommand\colspan{{\textrm{range}}}
\newcommand\diam{{\textrm{diam}}}
\newcommand\polylog{{\textrm{polylog}}}
\newcommand\rowspan{{\textrm{rowspan}}}
\newcommand\colbasis{{\textrm{colbasis}}}
\newcommand\argmin{{\mathrm{argmin}}}
\newcommand\Err{{\mathtt{NE}}}
\newcommand\eps{{\epsilon}}
\newcommand\bfone{{\mathbf{1}}}
\newcommand\Var{{\mathbf{Var}}}
\newcommand\stdev[1]{{\mathbf{stdev(#1)}}}
\newcommand\cgf{{\mathbf{CGF}}}
\newcommand\bfone{{\mathbb{1}}}
\newcommand\Iden{{\mathbf{I}}}
\newcommand\Prob{{\mathbf{Pr}}}
\DeclareMathOperator\subg{\mathbf{subG}}
\newcommand\cN{{\cal N}}
\newcommand\cS{{\cal S}}
\newcommand\cD{{\cal D}}
\newcommand\cM{{\cal M}}
\newcommand\cE{{\cal E}}
\newcommand\cC{{\cal C}}
\newcommand\KJL{K_{\mathbb{JL}}}
\newcommand\tO{{\tilde O}}
\newcommand\tx{{\tilde{\mathbf{x}}}}
\newcommand\tX{{\tilde{\mathbf{X}}}}
\newcommand\tY{{\tilde{\mathbf{Y}}}}
\newcommand\tA{{\tilde{\mathbf{A}}}}
\newcommand\bA{{\mathbf{A}}}
\newcommand\ba{{\mathbf{a}}}
\newcommand\bw{{\mathbf{w}}}
\newcommand\bW{{\mathbf{W}}}
\newcommand\bR{{\mathbf{R}}}
\newcommand\bB{{\mathbf{B}}}
\newcommand\bQ{{\mathbf{Q}}}
\newcommand\bZ{{\mathbf{Z}}}
\newcommand\bz{{\mathbf{z}}}
\newcommand\bU{{\mathbf{U}}}
\newcommand\bV{{\mathbf{V}}}
\newcommand\bX{{\mathbf{X}}}
\newcommand\bS{{\mathbf{S}}}
\newcommand\hS{\hat {\mathbf{S}}}
\newcommand\bs{{\mathbf{s}}}
\newcommand\bx{{\mathbf{x}}}
\newcommand\by{{\mathbf{y}}}
\newcommand\bb{{\mathbf{b}}}
\newcommand\bT{{\mathbf{T}}}
\newcommand\bP{{\mathbf{P}}}
\newcommand\bG{{\mathbf{G}}}
\newcommand\bY{{\mathbf{Y}}}
\newcommand\hY{\hat {\mathbf{Y}}}
\newcommand\bH{{\mathbf{H}}}
\newcommand\bD{{\mathbf{D}}}
\newcommand\be{{\mathbf{e}}}
\newcommand\br{{\mathbf{r}}}
\newcommand\bSig{{\mathbf{\Sigma}}}
\newcommand\bPi{{\mathbf{\Pi}}}
\newcommand\hphi{{\hat\phi}}
\)
Sketching for Matrix Computations:
Random Fourier Features
Ken Clarkson
IBM Research, Almaden
- These slides at
http://kenclarkson.org/G2S3_2015/
- Enable javascript, navigate as in powerpoint
- Math may take some time to render properly; if it doesn't end up looking right, try reloading
Random Fourier features: kernels
- Sometimes `\Dist(\bx ,\by )` is only accurate/meaningful for close points
`\implies` better to have "all large distances equal"
- e.g. `\Dist'(\bx ,\by ) = \sqrt{1-\exp(-\Dist(\bx ,\by )^2/2)}`
- Or: starting with an appropriate kernel `K(\bx ,\by )`,
- (With `K(\bx ,\by )=K(\by ,\bx )` and `K(\bx ,\by )\ge 0` for all `\bx ,\by `)
- Distance `\Dist_K(\bx ,\by )\equiv\sqrt{K(\bx ,\bx ) + K(\by ,\by ) - 2K(\bx ,\by )}`
- e.g. the Gaussian kernel `K(\bx ,\by ) = \exp(-\norm{\bx -\by }^2/2)`
- So `\Dist_K(\bx ,\by ) = \sqrt{2}\Dist'(\bx ,\by )` for `\Dist(\bx ,\by ) = \norm{\bx -\by }`
*See here, or this book
Random Fourier features:
kernel products and distances
Mercer kernels, roughly (NPH)
If for all finite `P= \{p_1\ldots p_n\}\subset\R^d`,
the matrix `\bW ` with `\bW_{ij}=K(p_i,p_j)` is positive semidefinite
Then there is `q` (possibly infinite) and
a map `\phi:\R^d\mapsto\R^q` such that
$$
\begin{align*}
\phi(\bx )^\top\phi(\by ) & = K(\bx ,\by )
\\\Dist_K(\bx ,\by ) & = \norm{\phi(\bx )-\phi(\by )}
\end{align*}
$$
That is, `\Dist_K` is Euclidean.
(The eigenfunctions of `T_K(f)) = \int f(\by ) K(\bx ,\by )` are `\phi_i(\bx )= \phi(\bx )_i`, extending
the eigenvectors of `\bW `)
- `R^q\equiv` feature space, `\phi\equiv` feature map
- Since possibly `q= \infty`, use `K(\bx ,\by )` as a "black box" instead of `\phi(\bx )`
- the "kernel trick" (can be slow)
Random Fourier features:
a sketching goal
When `\Dist_K(\cdot,\cdot)` is Euclidean, there are sketches for it.
However: we want to short-circuit `\R^d\rightarrow\R^q\rightarrow\R^m`
Goal: find `\hat\phi :\R^d\rightarrow\R^m` with small `m` so that:
$$\norm{\hat\phi(\bx )-\hat\phi(\by )}\approx_\eps\norm{\phi(\bx )-\phi(\by )}$$
That is, an (implicit) approximate subspace embedding of feature space into `\R^m`
Random Fourier features:
some trig & calc
- When `Z\sim\cN(0,1)`, for `t\in\R`,
$$\begin{align*}
& \E[\exp(-iZt)] = \E[\cos(Zt) + i\sin(Zt)]\\ & = \E[\cos(Zt)] + 0 = \exp(-t^2/2)
\end{align*}$$
- For large `t`, `\cos(Zt)` wraps around `2\pi`, averages to zero
- When `U\sim {\cal U}[0,2\pi]`, for `a,b\in\R`,
$$
\begin{align*}
& \color{red}{\E[\cos(a+U)\cos(b+U)]}
\\ & = \E[(\cos(a)\cos(U)-\sin(a)\sin(U))\\ & \quad\times (\cos(b)\cos(U)-\sin(b)\sin(U))]
\\ & = \E[\cos(U)^2]\cos(a)\cos(b)\\ & \quad +\E[\sin(U)^2]\sin(a)\sin(b)\\ & \quad +\E[\sin(U)\cos(U)](\ldots)
\\ & = {\scriptstyle\frac12}\cos(a)\cos(b) + {\scriptstyle\frac12}\sin(a)\sin(b) + 0\cdot (\ldots)
\\ & = \color{red}{{\scriptstyle\frac12}\cos(a-b)}.
\end{align*}
$$
Random Fourier features:
for the Gaussian kernel
For `\ba\in\R^d`,
`\bZ\in\R^d` with `\bZ\sim\cN(0,\Iden_d)`,
`\bZ^\top\ba\sim\norm{\ba}\cN(0,1)`
So:
Random Fourier Features, Gaussian kernel, expectation
Let
- `\bZ\in\R^d` with `\bZ\sim\cN(0,\Iden_d)`
- `U\sim{\cal U}[0,2\pi]`
- `\bx ,\by\in\R^d`
Then
`\hphi:\bx\mapsto\sqrt{2}\cos(\bZ^\top\bx + U)`
has
$$\begin{align*}
& \E[\hphi(\bx )\hphi(\by )]
\\ & = \E_\bZ\E_U[2\cos(\bZ^\top\bx + U)\cos(\bZ^\top\by + U)]
= \E_\bZ [\cos(\bZ^\top (\bx -\by ))]
\\ & \quad = \exp(-\norm{\bx -\by }^2/2)
\end{align*}$$
Random Fourier features:
other kernels
More generally:
`K(\bx ,\by )` is translation-invariant if
$$K(\bx ,\by )=K(\bx +\bw,\by +\bw)\textrm{ for any }\bx ,\by ,\bw\in\R^d$$
We write `K(\bx -\by ) \equiv K(\bx -\by ,0) = K(\bx ,\by )`.
Translation-invariant kernels have `\hphi(\bx )` analogous to those
for Gaussian kernels.
`\hphi(\bx )` for translation-invariant (NPH)
If Mercer kernel `K(\bx ,\by )` is translation-invariant,
then the Fourier transform `\tilde K(\bz)` of `K(\bx )` is a non-negative measure.
With `K(\bx )` scaled so that `\tilde K(\bz)` is a probability distribution,
and `\bZ\sim\tilde K(\bz)`,
$$\E_\bZ [\cos(\bZ^\top\bx )] = K(\bx ).$$
(`\E_\bZ [\cos(\bZ^\top\bx )]` comes from the inverse Fourier transform of `\tilde K(\bz)`.)
Examples
Name | `K(\bx)` | `\tilde K(\bz)` |
Gaussian | `\exp(-\norm{\bx}^2/2)` | `(2\pi)^{-m/2}\exp(-\norm{\bz}^2/2)` |
Laplacian | `\exp(-\norm{\bx}_1)` | `\prod_i \frac{1}{\pi(1+\bz_i^2)}` |
Cauchy | `\prod_i \frac{2}{1+\bx_i^2}` | `\exp(-\norm{\bz}_1)` |
Random Fourier features: concentration
For better concentration:
- `\bZ\in\R^{m\times d}` with `\bZ_{ij}\sim\tilde K(z)`
- `\bU\in\R^m` with `\bU_i\sim\mathbf{U}[0,2\pi]`
- `\hphi(\bx )_i\equiv\cos(\bZ_{i*}^\top\bx +\bU_i)`
Random Fourier feature concentration (NPH)
For
- `\cM\subset\R^d` of diameter `D_\cM`
- `\gamma_K` depending on `K(\cdot,\cdot)`
there is `m=O(\eps^{-2} d\log (D_\cM\gamma_K/\eps))`
so that with constant failure probability,
$$\sup_{\bx\in\cM} |\hphi(\bx )^\top\hphi(\bx ) -\phi(\bx )^\top\phi(\bx ) |\le\eps$$
- Bound is the same as for embedding a `d`-dimensional subspace
- ...or `d`-manifold, which `\phi(\R^d)` is
Random Fourier feature concentration:
proof outline
- With `m` as given, Hoeffding implies
\begin{equation}\label{eq:rff}
|\hphi(\bx )^\top\hphi(\bx ) -\phi(\bx )^\top\phi(\bx ) |\le\eps
\end{equation}
with failure probability at most
$$\exp(-\eps^2m)\le (\eps/D_\cM\gamma_K)^d$$
- Use union bound for\eqref{eq:rff} over `\bx\in\cN`, for `\cN` an `\eps`-net
- Use smoothness of `\phi` and `\hphi` to show that this implies\eqref{eq:rff} holds in `\cM`
Fast Food: SRHT Random Fourier features
- For Gaussian kernel, replaces i.i.d. Gaussians with Hadamard and diagonal Gaussians
- Up to a scale factor:
$$\bS\bH\bG\bPi\bH\bB,$$
where:
- `\bB` is a diagonal sign matrix
- `\bH` is a Hadamard matrix
- `\bPi` is a permutation matrix
- `\bG` is a diagonal Gaussian matrix
- `\bS` is a diagonal Gaussian matrix, re-scaling `\bG`
- Computationally: compact and fast to apply
- Analysis, in outline, combines that for RFF and SRHT