\( \newcommand\norm[1]{|| #1 ||} \newcommand\nnz[1]{{\textrm{nnz}(#1)}} \newcommand\E{\mathbf{E}} \newcommand\Dist{\mathbf{Dist}} \newcommand\reals{{\mathrm{I\!R}}} \newcommand\R{{\mathrm{I\!R}}} \newcommand\poly{{\textrm{poly}}} \newcommand\rank{{\textrm{rank}}} \newcommand\sr{{\textrm{sr}}} \newcommand\colspan{{\textrm{colspan}}} \newcommand\diam{{\textrm{diam}}} \newcommand\polylog{{\textrm{polylog}}} \newcommand\rowspan{{\textrm{rowspan}}} \newcommand\colbasis{{\textrm{colbasis}}} \newcommand\argmin{{\mathrm{argmin}}} \newcommand\Err{{\mathtt{NE}}} \newcommand\eps{{\epsilon}} \newcommand\bfone{{\mathbf{1}}} \newcommand\Var{{\mathbf{Var}}} \newcommand\stdev[1]{{\mathbf{stdev(#1)}}} \newcommand\cgf{{\mathbf{CGF}}} \newcommand\bfone{{\mathbb{1}}} \newcommand\Iden{{\mathbf{I}}} \newcommand\Prob{{\mathbf{Pr}}} \DeclareMathOperator\subg{\mathbf{subG}} \DeclareMathOperator\tr{\mathbf{tr}} \DeclareMathOperator\dist{\mathrm{dist}} \DeclareMathOperator\cost{\mathrm{cost}} \DeclareMathOperator\SF{\mathrm{SF}} \newcommand\cN{{\cal N}} \newcommand\cS{{\cal S}} \newcommand\cD{{\cal D}} \newcommand\cM{{\cal M}} \newcommand\cE{{\cal E}} \newcommand\cu{{\cal U}} \newcommand\cC{{\cal C}} \newcommand\KJL{K_{\mathbb{JL}}} \newcommand\tO{{\tilde O}} \newcommand\tL{{\tilde L}} \newcommand\maybemathbf[1]{\mathbf{#1}} \newcommand\matA{\maybemathbf{A}} \newcommand\matB{\maybemathbf{B}} \newcommand\matC{\maybemathbf{C}} \newcommand\matD{\maybemathbf{D}} \newcommand\matSigma{\maybemathbf{\Sigma}} \newcommand\matU{\maybemathbf{U}} \newcommand\matV{\maybemathbf{V}} \newcommand\matE{\maybemathbf{E}} \newcommand\matzero{\maybemathbf{0}} \newcommand\matW{\maybemathbf{W}} \newcommand\matZ{\maybemathbf{Z}} \newcommand\matT{\maybemathbf{T}} \newcommand\matY{\maybemathbf{Y}} \newcommand\matX{\maybemathbf{X}} \newcommand\matR{\maybemathbf{R}} \newcommand\matS{\maybemathbf{S}} \newcommand\matN{\maybemathbf{N}} \newcommand\matP{\maybemathbf{P}} \newcommand\vecx{\maybemathbf{x}} \newcommand\vecb{\maybemathbf{b}} \newcommand\hvecb{\hat{\maybemathbf{b}}} \newcommand\vecy{\maybemathbf{y}} \newcommand{\xdownarrow}[1]{{\left\downarrow\vbox to #1{}\right.\kern-\nulldelimiterspace}} \newcommand{\da}[1]{\Bigg\downarrow\raise.5ex\rlap{\scriptstyle#1}} \newcommand\tmatW{\tilde{\maybemathbf{W}}} \newcommand\tmatA{\tilde{\maybemathbf{A}}} \newcommand\tmatB{\tilde{\maybemathbf{B}}} \newcommand\tmatZ{\tilde{\maybemathbf{Z}}} \newcommand\tmatY{\tilde{\maybemathbf{Y}}} \newcommand\tmatX{\tilde{\maybemathbf{X}}} \newcommand\tvecx{\tilde{\maybemathbf{x}}} \newcommand\hmatS{{ \maybemathbf{\hat S}}} \newcommand\hmatR{{ \maybemathbf{\hat R}}} \newcommand\hmatA{{ \maybemathbf{\hat A}}} \newcommand\twomat[2]{\left[\begin{smallmatrix} #1 \\ #2 \end{smallmatrix} \right] } \newcommand\twomatr[2]{\left[\begin{smallmatrix} #1 & #2 \end{smallmatrix} \right] } \newcommand\tx{{\tilde{\mathbf{x}}}} \newcommand\tX{{\tilde{\mathbf{X}}}} \newcommand\tY{{\tilde{\mathbf{Y}}}} \newcommand\tA{{\tilde{\mathbf{A}}}} \newcommand\tW{{\tilde{\mathbf{W}}}} \newcommand\tZ{{\tilde{\mathbf{Z}}}} \newcommand\hS{{\hat {\mathbf{S}}}} \newcommand\hR{{\hat {\mathbf{R}}}} \newcommand\hA{{\hat {\mathbf{A}}}} \newcommand\hY{\hat {\mathbf{Y}}} \newcommand\bA{{\mathbf{A}}} \newcommand\bN{{\mathbf{N}}} \newcommand\ba{{\mathbf{a}}} \newcommand\bw{{\mathbf{w}}} \newcommand\W{{\mathbf{W}}} \newcommand\R{{\mathbf{R}}} \newcommand\bB{{\mathbf{B}}} \newcommand\bQ{{\mathbf{Q}}} \newcommand\bZ{{\mathbf{Z}}} \newcommand\bz{{\mathbf{z}}} \newcommand\bU{{\mathbf{U}}} \newcommand\bV{{\mathbf{V}}} \newcommand\bX{{\mathbf{X}}} \newcommand\bS{{\mathbf{S}}} \newcommand\bs{{\mathbf{s}}} \newcommand\bx{{\mathbf{x}}} \newcommand\by{{\mathbf{y}}} \newcommand\bb{{\mathbf{b}}} \newcommand\bT{{\mathbf{T}}} \newcommand\bP{{\mathbf{P}}} \newcommand\bG{{\mathbf{G}}} \newcommand\bY{{\mathbf{Y}}} \newcommand\bH{{\mathbf{H}}} \newcommand\bD{{\mathbf{D}}} \newcommand\be{{\mathbf{e}}} \newcommand\br{{\mathbf{r}}} \newcommand\Sig{{\mathbf{\Sigma}}} \newcommand\bLam{{\mathbf{\Lambda}}} \newcommand\bPi{{\mathbf{\Pi}}} \newcommand\hphi{{\hat\phi}} \newcommand\OPT{{\mathtt{OPT}}} \newcommand\sd{{\mathtt{sd}}} \)

Low-Rank PSD Approximation in Input-Sparsity Time

`\qquad\quad`
Ken Clarkson
IBM Research, Almaden


Joint with

David Woodruff
IBM Research, Almaden



Fitting data: low-rank approximation

  • Given: points `\matA_{i*}` as rows of matrix `\matA`
    Fit: a `k`-dimensional subspace `L^*` minimizing $$\cost(\matA,L) \equiv \sum_{i\in [n]} \dist(\matA_{i*},L)^2$$ among all `k`-dimensional subspaces
  • That is, $$\qquad\qquad\quad\min_{\substack{\matY\in\R^{n\times k}\\ \matX\in\R^{k\times d}}} \norm{\matA - \matY\matX}_F^2$$
    • `\matY\matX` is rank-`k`
    • `L=\rowspan(\matX)`
    • `\norm{\matE}_F^2 \equiv \sum_{i,j} E_{ij}^2`
  • Solve using Singular Value Decomposition (SVD) (PCA, LSI, EigenX,...)
  • For fixed `\matY`, least-squares for each column of `\matX`
  • For fixed `\matX`, least-squares for each row of `\matY`

Approximation, Input-sparsity time

Given `\eps\gt 0`, find `\tmatY,\tmatX` (with `k` columns, rows) with $$\norm{\matA - \tmatY\tmatX}_F^2 \le (1+\eps)\norm{\matA-\matA_k}_F^2,$$ a `(1+\eps)`-approximation. Here `\matA_k` is the best rank-`k` approximation.

With fixed probability near one,
`(1+\eps)`-approximation can be found in $$ O(\nnz{\matA}) + \tO(k^2\eps^{-1}(n+d)) + \poly(\eps^{-1}k) $$ time.
`\nnz{\quad}` `\nnz{\matA}\equiv` the Number of Non-Zero entries of `\matA`

Similar results hold for some versions that are
regularized, and/or robust and/or kernelized

Positive semidefinite low-rank approximation

  • Suppose we want reduce the dimension of a pointset,
    approximately preserving all dot products:
    • Pointset is rows of `\matB`
    • Find `\tmatY` such that ` \tmatY_{i*} \tmatY_{j*}^\top\approx \matB_{i*} \matB_{j*}^\top`
  • Sometimes `\matA = \matB\matB^\top` is the input


Positive Semidefinite (PSD) Square matrix `\matA` is PSD when `\vecx^\top \matA\vecx\ge 0` for all `\vecx`

`\matA=\matB\matB^\top` is PSD: `\vecx^\top\matB\matB^\top\vecx = (\matB^\top\vecx)^\top \matB^\top\vecx = \norm{\matB^\top\vecx}^2\ge 0`

Covariance matrices, Laplacians, kernel matrices

In low-rank approximation `\matY\matX\approx\matA`,
for PSD need `\matX=\matY^\top`, want `\matY\matY^\top \approx \matB\matB^\top`

Non-PSD inputs

Due to roundoff error
Via multi-dimensional scaling for not-very-Euclidean distance

Our results do not require PSD inputs, unlike many methods

Low-rank PSD approximation

Given `\matA\in\R^{n\times n}` and integer `k`, let $$\matA_{k,+}\equiv \matY\matY^\top, \text{ where } \matY \equiv \argmin_{\matY\in\R^{n\times k}} \norm{\matA - \matY\matY^\top}_F^2$$

Let `\matA^\text{sym} \equiv (\matA+\matA^\top)/2`
Then `\matA_{k,+} = \matA^\text{sym}_{k,+}`
`\matA = \matA^\text{sym} + \matA^\text{asym}`, where `\matA^\text{asym} \equiv (\matA-\matA^\top)/2`

When `\matB` is symmetric, $$\norm{\matB-\matA}_F^2 = \norm{\matB - \matA^\text{sym}}_F^2 + \norm{\matA^\text{asym}}_F^2$$ using `\tr(\matB - \matA^\text{sym})\matA^\text{asym} = 0` and Pythagorean theorem

Note that `\nnz{\matA^\text{sym}} \le 2*\nnz{\matA}`

What is `A_{k,+}`?

Suppose symmetric `\matA` has eigen-decomposition `\matA=\matU\matD\matU^\top`
(`\matU` is orthogonal, `\matD` is diagonal)
Then `\matA_{k,+} = \matU\matD_{k,+}\matU^\top`
Using `\norm{\matU^\top \matB}_F = \norm{\matB\matU}_F = \norm{\matB}_F` for all `\matB` and all orthogonal `\matU`

`\matD_{k,+}` is diagonal, nonzero entries are top `k` nonnegative entries of `\matD`


So computing such top `k` eigenvalues/vectors suffices
`\qquad` (but might be expensive)

Our results, 1/2

Given `\matA\in\R^{n\times n}`, integer `k`, and `\eps\gt 0`, in $$O(\nnz{\matA} + n\poly(k/\eps))$$ time, we find:
a rank-`k` PSD matrix `\tmatA_{k,+}` such that $$\norm{\matA-\tmatA_{k,+}}_F \le (1+\eps)\norm{\matA - \matA_{k,+} }_F$$

In comparison:

[KMT]
Nyström method (uniform sampling) is very fast, but needs more assumptions for quality guarantees
`\qquad` e.g. incoherence
[GM]
Similar speed, but quality bound is $$\norm{\matA - \matA'}_F \le \norm{\matA - \matA_{k,+}}_F + \eps\norm{\matA-\matA_{k,+}}_*$$ where `\norm{\cdot}_*` is the nuclear (trace) norm
[WLZ]
Running time `\Omega(n^2k/\eps)` and approximation has rank `\Omega(k/\eps)`

Our results, 2/2

Given symmetric `\matA\in\R^{n\times n}`, integer `k`, `\eps\gt 0`,
in $$O(\nnz{\matA}\log(n) + n\poly(\log(n)k/\eps))$$ time, we find:
`\qquad` a matrix `\matC` with `O(k/\eps)` columns from `\matA`,
`\qquad` a rank `k` matrix `\matW`
such that $$\norm{\matA-\matC\matW\matC^\top}_F \le (1+\eps)\norm{\matA - \matA_{k,+} }_F$$

Related work: same as above

Sketching Matrices:
Oblivious Subspace Embeddings

Matrix `\matS\in\R^{m\times n}` is an `\eps`-embedding of set `P\subset\R^n` if $$\norm{\matS\vecy }= (1\pm\eps)\norm{\vecy }\text{ for all }\vecy\in P.$$ `\matS` will be called a "sketching matrix".
For `\matA\in\R^{n\times d}`,
a matrix `\matS ` is a subspace `\eps`-embedding for `\matA ` if `\matS ` is an `\eps`-embedding for `\colspan(\matA )= \{\matA\vecx\mid\vecx\in\R^d\}`. That is, $$\norm{\matS\matA\vecx }= (1\pm\eps)\norm{\matA\vecx }\text{ for all }\vecx\in\R^d.$$ `\matS\matA` will be called a "sketch".

An oblivious subspace embedding is:
  • A probability distribution `\cal D` over matrices `\matS\in\R^{m\times n}`, so that
  • For any unknown but fixed matrix `\matA`,
    `\matS ` is a subspace `\eps`-embedding for `\matA `, with high-enough probability.

Approach to (non-PSD) low-rank approximation

Using thin sketching matrices `\matR, \matS,\matS_\ell, \matS_r`:



$$\begin{align*} \min \norm{\matA - \,\, & \!\!\matY \matX }_F \\[-10em] & \Bigg\downarrow & \\[-10em] \min_{\rank(\matZ)=k} \norm{\matA - \matA\matR & \matZ \matS\matA}_F \\[-10em] & \Bigg\downarrow & \\[-40em] \min_{\rank(\matZ)=k} \norm{ S_\ell (\matA - \matA\matR & \matZ \matS\matA) \matS_r}_F \end{align*}$$

To get symmetry, we need `\matR=\matS^\top`,
but this breaks the analysis.
We need a better understanding of what sketches can do.

Sketch approximation guarantees

The approximation properties of sketches have long been used:

Let `\matP` project onto the rowspace of `\matS\matA`
For a large enough sketching matrix `\matS`, with constant prob.,
`\norm{\matA-\matA\matP}_F^2\le (1+\eps)\norm{\matA-\matA_k}_F^2`

Note: `\matP = (\matS\matA)^+\matS\matA`,
where `\matB^+` is the Moore-Penrose pseudo-inverse of `\matB`


Here we use a sharper bound:

Spectral-Frobenius approx `\SF(\eps,k)` Projection `\matP` has the `\SF(\eps,k)` property when
`\norm{\matA-\matA\matP}_2^2\le \frac{\eps}{k}\norm{\matA-\matA_k}_F^2`
Spectral norm `\norm{\matA}_2` `\norm{\matA}_2 \equiv \sup_{\vecx} \norm{\matA\vecx}/\norm{\vecx}`

`\SF(\eps,k)` is stronger

`\SF(\eps,k)\implies` Frobenius If `\matP` is `\SF(\eps,k)`,
then `\norm{\matA-\matA\matP}_F^2\le (1+O(\eps))\norm{\matA-\matA_k}_F^2`

By the Pythagorean theorem, $$\begin{align*} \norm{\matA(\Iden-\matP)}_F^2 & = \norm{(\matA-\matA_k)(\Iden - \matP)}_F^2 + \norm{\matA_k(\Iden - \matP)}_F^2 \\ & \le \norm{\matA-\matA_k}_F^2 + \norm{\matA_k(\Iden - \matP)}_F^2 \end{align*}$$

and
` \begin{align*} \norm{\matA_k(\Iden - \matP)}_F^2 & \le k\,\norm{\matA_k(\Iden - \matP)}_2^2 & & \color{red}{\rank(\matA_k)=k} \\ & \le k\,\norm{\matA(\Iden - \matP)}_2^2 & & \color{red}{\norm{\matA x}\ge\norm{\matA_kx}\text{ for all }x} \\ & \le k\,\frac{\eps}{k} \norm{\matA - \matA_k}_F^2 & & \color{red}{\matP\text{ is }\SF(\eps,k)} \\ & = \eps\,\norm{\matA - \matA_k}_F^2 & \end{align*} `
So
$$\norm{\matA(\Iden-\matP)}_F^2\le \norm{\matA-\matA_k}_F^2 + \norm{\matA_k(\Iden - \matP)}_F^2 \le (1+\eps)\norm{\matA-\matA_k}_F^2$$

`\SF(\eps,k)` projections give good solutions

`\SF(\eps,k)\implies` symmetric approx If `\matP` is `\SF(\eps,k)`, then $$\norm{\matA - \matP \matA_{k,+}\matP}_F^2 \le (1+O(\eps)) \norm{\matA - \matA_{k,+}}_F^2$$ So same upper bound holds for $$\tX := \argmin_{\matX\text{ rank-k PSD}} \norm{\matA - \matP\matX\matP}_F^2$$
The proof uses the fact that,
for `\matA,\matB` symmetric, `(\matA-\matB)\matB = 0`, and `\matP` a projection, $$\norm{\matA-\matP \matB \matP}_F^2 = \norm{\matA - \matB}_F^2 + \norm{\matB - \matP \matB \matP}_F^2 + \tr(\matA - \matB)(\Iden - \matP) \matB \matP$$ (via standard facts)

Letting `\matB := \matA_{k,+}`, so that `(\matA - \matA_{k,+}) \matA_{k,+} = 0`, $$\begin{align*} \norm{\matA-\matP \matA_{k,+} \matP}_F^2 & = \norm{\matA - \matA_{k,+}}_F^2 \\ & + \norm{ \matA_{k,+} - \matP \matA_{k,+} \matP}_F^2 \\ & + \tr(\matA - \matA_{k,+})(\Iden - \matP) \matA_{k,+} \matP \end{align*} $$ Latter two terms are small, via `\SF(\eps,k)`

Getting and using `\SF(\eps,k)` projections

Examples: `\SF(\eps,k)` projections via sketching Via [CEMMP],
many "standard" sketching matrices `\matS`
yield `\SF(\eps,k)` projections `\matP_{\matS\matA} \equiv (\matS\matA)^+\matS\matA`:
  • i.i.d. Gaussian
  • OSEs (e.g., CountSketch): sketching takes `O(\nnz{A})`
  • Subsampled Fast Hadamard Transform


So: $$\argmin_{\matX\text{ rank-k PSD}} \norm{\matA - \matP_{\matS\matA}\matX\matP_{\matS\matA}}_F^2$$ is a good solution. Since $$\matP_{\matS\matA}\matX\matP_{\matS\matA} = (\matS\matA)^\top\left[ (\matS\matA)^{+\top} \matX (\matS\matA)^+\right]\matS\matA,$$ and `\matZ = (\matS\matA)^{+\top} \matX (\matS\matA)^+` is rank-`k` PSD, it's enough to find $$\argmin_{\matZ\text{ rank-k PSD}} \norm{\matA - \matA\matS^\top \matZ \matS\matA}_F^2$$
We can use `\matR=\matS^\top`.

Approach to low-rank approximation, for PSD

Almost as before:


$$\begin{align*} \min \norm{\matA - \,\, & \!\!\matY \matY^\top }_F \\[-10em] & \Bigg\downarrow & \\[-10em] \min_{\rank(\matZ)=k} \norm{\matA - \matA\matS^\top & \matZ \matS\matA }_F \\[-10em] & \Bigg\downarrow & \\[-40em] \min_{\rank(\matZ)=k} \norm{ S_\ell (\matA - \matA\matS^\top & \matZ \matS\matA) \matS_r}_F \end{align*}$$


However: `\matS_\ell \ne \matS_r`, and must be; what to do?

Approach to low-rank approximation, for PSD, revised

Express `\matA\matS^\top = \matU\matT`,
where `\matU` has nearly orthogonal columns, `\matT` is triangular


$$\begin{align*} \min_{\rank(\matZ)=k} \norm{\matA - \matU\matT & \matZ \matT^\top\matU^\top}_F \\[-10em] & \Bigg\downarrow & \\[-40em] \min_{\rank(\matZ)=k} \norm{ \matS_\ell (\matA - \matU & \matW \matU^\top) \matS_r}_F \end{align*}$$
Using the fact that `\matS_\ell\matU` and `\matU^\top \matS_r` are "almost" orthogonal,
this reduces to finding the best rank-`k` approx to
$$(\matS_\ell \matU)^+\matS_\ell\matA\matS_r(\matU^\top S_r)^+$$

Low rank PSD column selection

  • There are sampling matrices `\matS`,
    selecting `O(k/\eps)` rows,
    that are `\SF(\eps,k)`
    • Via [CEMPP], taking some work
    • Uses [BSS]
  • All else as above

Also

  • For (only) symmetric approximation, same runtime and guarantees
  • A slightly different performance guarantee.
Better approx for decaying spectra Let `t\equiv 2k/\eps`
A rank-`k` PSD matrix `\tA` can be found in $$O(\nnz{\matA} + n\poly(k/\eps))$$ time with $$\norm{\matA-\tA}_F^2 \le \norm{\matA-\matA_{k,+}}_F^2 + \norm{\matA_t - \matA_{t+k}}_F^2$$

Note that $$\norm{\matA_t - \matA_{t+k}}_F^2 = \sum_{i\in [t,t+k]} \sigma_i^2(\matA) \le \frac{k}{t-k}\sum_{j\ge k} \sigma_i^2(\matA) \le\eps \norm{\matA-\matA_k}_F^2$$

Conclusions

  • First rank-`k` PSD approximation of arbitrary matrix in `O(\nnz{A})` time
  • Optimal $$O(k/\eps)$$ rows for rank-`k` PSD subset selection, in $$O(\nnz{A}\log n)+\ldots$$ time
    • Shave off `\log`?
  • Other structured approximations?
    • PSD-rank, one-bit, NMF,...