\( \newcommand\norm[1]{|| #1 ||} \newcommand\nnz[1]{{\textrm{nnz}(#1)}} \newcommand\E{\mathbf{E}} \newcommand\Dist{\mathbf{Dist}} \newcommand\reals{{\mathrm{I\!R}}} \newcommand\R{{\mathrm{I\!R}}} \newcommand\poly{{\textrm{poly}}} \newcommand\rank{{\textrm{rank}}} \newcommand\colspan{{\textrm{range}}} \newcommand\diam{{\textrm{diam}}} \newcommand\polylog{{\textrm{polylog}}} \newcommand\rowspan{{\textrm{rowspan}}} \newcommand\colbasis{{\textrm{colbasis}}} \newcommand\argmin{{\mathrm{argmin}}} \newcommand\Err{{\mathtt{NE}}} \newcommand\eps{{\epsilon}} \newcommand\bfone{{\mathbf{1}}} \newcommand\Var{{\mathbf{Var}}} \newcommand\stdev[1]{{\mathbf{stdev(#1)}}} \newcommand\cgf{{\mathbf{CGF}}} \newcommand\bfone{{\mathbb{1}}} \newcommand\Iden{{\mathbf{I}}} \newcommand\Prob{{\mathbf{Pr}}} \DeclareMathOperator\subg{\mathbf{subG}} \newcommand\cN{{\cal N}} \newcommand\cS{{\cal S}} \newcommand\cD{{\cal D}} \newcommand\cM{{\cal M}} \newcommand\cE{{\cal E}} \newcommand\cC{{\cal C}} \newcommand\KJL{K_{\mathbb{JL}}} \newcommand\tO{{\tilde O}} \newcommand\tx{{\tilde{\mathbf{x}}}} \newcommand\tX{{\tilde{\mathbf{X}}}} \newcommand\tY{{\tilde{\mathbf{Y}}}} \newcommand\tA{{\tilde{\mathbf{A}}}} \newcommand\bA{{\mathbf{A}}} \newcommand\ba{{\mathbf{a}}} \newcommand\bw{{\mathbf{w}}} \newcommand\bW{{\mathbf{W}}} \newcommand\bR{{\mathbf{R}}} \newcommand\bB{{\mathbf{B}}} \newcommand\bQ{{\mathbf{Q}}} \newcommand\bZ{{\mathbf{Z}}} \newcommand\bz{{\mathbf{z}}} \newcommand\bU{{\mathbf{U}}} \newcommand\bV{{\mathbf{V}}} \newcommand\bX{{\mathbf{X}}} \newcommand\bS{{\mathbf{S}}} \newcommand\hS{\hat {\mathbf{S}}} \newcommand\bs{{\mathbf{s}}} \newcommand\bx{{\mathbf{x}}} \newcommand\by{{\mathbf{y}}} \newcommand\bb{{\mathbf{b}}} \newcommand\bT{{\mathbf{T}}} \newcommand\bP{{\mathbf{P}}} \newcommand\bG{{\mathbf{G}}} \newcommand\bY{{\mathbf{Y}}} \newcommand\hY{\hat {\mathbf{Y}}} \newcommand\bH{{\mathbf{H}}} \newcommand\bD{{\mathbf{D}}} \newcommand\be{{\mathbf{e}}} \newcommand\br{{\mathbf{r}}} \newcommand\bSig{{\mathbf{\Sigma}}} \newcommand\bPi{{\mathbf{\Pi}}} \newcommand\hphi{{\hat\phi}} \)

Sketching for Matrix Computations:
Random Fourier Features

Ken Clarkson
IBM Research, Almaden

  • These slides at
  • Enable javascript, navigate as in powerpoint
  • Math may take some time to render properly; if it doesn't end up looking right, try reloading

Random Fourier features: kernels

  • Sometimes `\Dist(\bx ,\by )` is only accurate/meaningful for close points
    `\implies` better to have "all large distances equal"
  • e.g. `\Dist'(\bx ,\by ) = \sqrt{1-\exp(-\Dist(\bx ,\by )^2/2)}`
    • If `\Dist(\bx ,\by )\ge 3` or so, `\Dist'(\bx ,\by )\gt .99`.
    • If `\Dist(\bx ,\by )` is Euclidean, so is* `\Dist'(\bx ,\by )`
  • Or: starting with an appropriate kernel `K(\bx ,\by )`,
    • (With `K(\bx ,\by )=K(\by ,\bx )` and `K(\bx ,\by )\ge 0` for all `\bx ,\by `)
    • Distance `\Dist_K(\bx ,\by )\equiv\sqrt{K(\bx ,\bx ) + K(\by ,\by ) - 2K(\bx ,\by )}`
    • e.g. the Gaussian kernel `K(\bx ,\by ) = \exp(-\norm{\bx -\by }^2/2)`
      • So `\Dist_K(\bx ,\by ) = \sqrt{2}\Dist'(\bx ,\by )` for `\Dist(\bx ,\by ) = \norm{\bx -\by }`

*See here, or this book

Random Fourier features:
kernel products and distances

Mercer kernels, roughly (NPH)

If for all finite `P= \{p_1\ldots p_n\}\subset\R^d`,
the matrix `\bW ` with `\bW_{ij}=K(p_i,p_j)` is positive semidefinite

Then there is `q` (possibly infinite) and a map `\phi:\R^d\mapsto\R^q` such that $$ \begin{align*} \phi(\bx )^\top\phi(\by ) & = K(\bx ,\by ) \\\Dist_K(\bx ,\by ) & = \norm{\phi(\bx )-\phi(\by )} \end{align*} $$ That is, `\Dist_K` is Euclidean.

(The eigenfunctions of `T_K(f)) = \int f(\by ) K(\bx ,\by )` are `\phi_i(\bx )= \phi(\bx )_i`, extending the eigenvectors of `\bW `)

  • `R^q\equiv` feature space, `\phi\equiv` feature map
  • Since possibly `q= \infty`, use `K(\bx ,\by )` as a "black box" instead of `\phi(\bx )`
    • the "kernel trick" (can be slow)

Random Fourier features:
a sketching goal

When `\Dist_K(\cdot,\cdot)` is Euclidean, there are sketches for it.

However: we want to short-circuit `\R^d\rightarrow\R^q\rightarrow\R^m`

Goal: find `\hat\phi :\R^d\rightarrow\R^m` with small `m` so that: $$\norm{\hat\phi(\bx )-\hat\phi(\by )}\approx_\eps\norm{\phi(\bx )-\phi(\by )}$$

That is, an (implicit) approximate subspace embedding of feature space into `\R^m`

Random Fourier features:
some trig & calc

  • When `Z\sim\cN(0,1)`, for `t\in\R`, $$\begin{align*} & \E[\exp(-iZt)] = \E[\cos(Zt) + i\sin(Zt)]\\ & = \E[\cos(Zt)] + 0 = \exp(-t^2/2) \end{align*}$$
    • For large `t`, `\cos(Zt)` wraps around `2\pi`, averages to zero
  • When `U\sim {\cal U}[0,2\pi]`, for `a,b\in\R`, $$ \begin{align*} & \color{red}{\E[\cos(a+U)\cos(b+U)]} \\ & = \E[(\cos(a)\cos(U)-\sin(a)\sin(U))\\ & \quad\times (\cos(b)\cos(U)-\sin(b)\sin(U))] \\ & = \E[\cos(U)^2]\cos(a)\cos(b)\\ & \quad +\E[\sin(U)^2]\sin(a)\sin(b)\\ & \quad +\E[\sin(U)\cos(U)](\ldots) \\ & = {\scriptstyle\frac12}\cos(a)\cos(b) + {\scriptstyle\frac12}\sin(a)\sin(b) + 0\cdot (\ldots) \\ & = \color{red}{{\scriptstyle\frac12}\cos(a-b)}. \end{align*} $$

Random Fourier features:
for the Gaussian kernel

For `\ba\in\R^d`,
`\bZ\in\R^d` with `\bZ\sim\cN(0,\Iden_d)`,

Random Fourier Features, Gaussian kernel, expectation Let
  • `\bZ\in\R^d` with `\bZ\sim\cN(0,\Iden_d)`
  • `U\sim{\cal U}[0,2\pi]`
  • `\bx ,\by\in\R^d`

Then `\hphi:\bx\mapsto\sqrt{2}\cos(\bZ^\top\bx + U)`
has $$\begin{align*} & \E[\hphi(\bx )\hphi(\by )] \\ & = \E_\bZ\E_U[2\cos(\bZ^\top\bx + U)\cos(\bZ^\top\by + U)] = \E_\bZ [\cos(\bZ^\top (\bx -\by ))] \\ & \quad = \exp(-\norm{\bx -\by }^2/2) \end{align*}$$

Random Fourier features:
other kernels

More generally:
`K(\bx ,\by )` is translation-invariant if $$K(\bx ,\by )=K(\bx +\bw,\by +\bw)\textrm{ for any }\bx ,\by ,\bw\in\R^d$$ We write `K(\bx -\by ) \equiv K(\bx -\by ,0) = K(\bx ,\by )`.
Translation-invariant kernels have `\hphi(\bx )` analogous to those for Gaussian kernels.
`\hphi(\bx )` for translation-invariant (NPH)
If Mercer kernel `K(\bx ,\by )` is translation-invariant,
then the Fourier transform `\tilde K(\bz)` of `K(\bx )` is a non-negative measure.
With `K(\bx )` scaled so that `\tilde K(\bz)` is a probability distribution, and `\bZ\sim\tilde K(\bz)`, $$\E_\bZ [\cos(\bZ^\top\bx )] = K(\bx ).$$
(`\E_\bZ [\cos(\bZ^\top\bx )]` comes from the inverse Fourier transform of `\tilde K(\bz)`.)


Name `K(\bx)``\tilde K(\bz)`
Laplacian`\exp(-\norm{\bx}_1)``\prod_i \frac{1}{\pi(1+\bz_i^2)}`
Cauchy`\prod_i \frac{2}{1+\bx_i^2}``\exp(-\norm{\bz}_1)`

Random Fourier features: concentration

For better concentration:
  • `\bZ\in\R^{m\times d}` with `\bZ_{ij}\sim\tilde K(z)`
  • `\bU\in\R^m` with `\bU_i\sim\mathbf{U}[0,2\pi]`
  • `\hphi(\bx )_i\equiv\cos(\bZ_{i*}^\top\bx +\bU_i)`
Random Fourier feature concentration (NPH) For
  • `\cM\subset\R^d` of diameter `D_\cM`
  • `\gamma_K` depending on `K(\cdot,\cdot)`

there is `m=O(\eps^{-2} d\log (D_\cM\gamma_K/\eps))`
so that with constant failure probability, $$\sup_{\bx\in\cM} |\hphi(\bx )^\top\hphi(\bx ) -\phi(\bx )^\top\phi(\bx ) |\le\eps$$
  • Bound is the same as for embedding a `d`-dimensional subspace
  • ...or `d`-manifold, which `\phi(\R^d)` is

Random Fourier feature concentration:
proof outline

  1. With `m` as given, Hoeffding implies \begin{equation}\label{eq:rff} |\hphi(\bx )^\top\hphi(\bx ) -\phi(\bx )^\top\phi(\bx ) |\le\eps \end{equation} with failure probability at most $$\exp(-\eps^2m)\le (\eps/D_\cM\gamma_K)^d$$
  2. Use union bound for\eqref{eq:rff} over `\bx\in\cN`, for `\cN` an `\eps`-net
  3. Use smoothness of `\phi` and `\hphi` to show that this implies\eqref{eq:rff} holds in `\cM`

Further Reading

Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. Advances in neural information processing systems. 2007.

Fast Food: SRHT Random Fourier features

  • For Gaussian kernel, replaces i.i.d. Gaussians with Hadamard and diagonal Gaussians
  • Up to a scale factor: $$\bS\bH\bG\bPi\bH\bB,$$ where:
    • `\bB` is a diagonal sign matrix
    • `\bH` is a Hadamard matrix
    • `\bPi` is a permutation matrix
    • `\bG` is a diagonal Gaussian matrix
    • `\bS` is a diagonal Gaussian matrix, re-scaling `\bG`
  • Computationally: compact and fast to apply
  • Analysis, in outline, combines that for RFF and SRHT