\(
\newcommand\norm[1]{|| #1 ||}
\newcommand\nnz[1]{{\textrm{nnz}(#1)}}
\newcommand\E{\mathbf{E}}
\newcommand\Dist{\mathbf{Dist}}
\newcommand\reals{{\mathrm{I\!R}}}
\newcommand\R{{\mathrm{I\!R}}}
\newcommand\poly{{\textrm{poly}}}
\newcommand\rank{{\textrm{rank}}}
\newcommand\sr{{\textrm{sr}}}
\newcommand\colspan{{\textrm{colspan}}}
\newcommand\diam{{\textrm{diam}}}
\newcommand\polylog{{\textrm{polylog}}}
\newcommand\rowspan{{\textrm{rowspan}}}
\newcommand\colbasis{{\textrm{colbasis}}}
\newcommand\argmin{{\mathrm{argmin}}}
\newcommand\Err{{\mathtt{NE}}}
\newcommand\eps{{\epsilon}}
\newcommand\bfone{{\mathbf{1}}}
\newcommand\Var{{\mathbf{Var}}}
\newcommand\stdev[1]{{\mathbf{stdev(#1)}}}
\newcommand\cgf{{\mathbf{CGF}}}
\newcommand\bfone{{\mathbb{1}}}
\newcommand\Iden{{\mathbf{I}}}
\newcommand\Prob{{\mathbf{Pr}}}
\DeclareMathOperator\subg{\mathbf{subG}}
\DeclareMathOperator\tr{\mathbf{tr}}
\DeclareMathOperator\dist{\mathrm{dist}}
\DeclareMathOperator\cost{\mathrm{cost}}
\DeclareMathOperator\SF{\mathrm{SF}}
\newcommand\cN{{\cal N}}
\newcommand\cS{{\cal S}}
\newcommand\cD{{\cal D}}
\newcommand\cM{{\cal M}}
\newcommand\cE{{\cal E}}
\newcommand\cu{{\cal U}}
\newcommand\cC{{\cal C}}
\newcommand\KJL{K_{\mathbb{JL}}}
\newcommand\tO{{\tilde O}}
\newcommand\tL{{\tilde L}}
\newcommand\maybemathbf[1]{\mathbf{#1}}
\newcommand\matA{\maybemathbf{A}}
\newcommand\matB{\maybemathbf{B}}
\newcommand\matC{\maybemathbf{C}}
\newcommand\matD{\maybemathbf{D}}
\newcommand\matSigma{\maybemathbf{\Sigma}}
\newcommand\matU{\maybemathbf{U}}
\newcommand\matV{\maybemathbf{V}}
\newcommand\matE{\maybemathbf{E}}
\newcommand\matzero{\maybemathbf{0}}
\newcommand\matW{\maybemathbf{W}}
\newcommand\matZ{\maybemathbf{Z}}
\newcommand\matT{\maybemathbf{T}}
\newcommand\matY{\maybemathbf{Y}}
\newcommand\matX{\maybemathbf{X}}
\newcommand\matR{\maybemathbf{R}}
\newcommand\matS{\maybemathbf{S}}
\newcommand\matN{\maybemathbf{N}}
\newcommand\matP{\maybemathbf{P}}
\newcommand\vecx{\maybemathbf{x}}
\newcommand\vecb{\maybemathbf{b}}
\newcommand\hvecb{\hat{\maybemathbf{b}}}
\newcommand\vecy{\maybemathbf{y}}
\newcommand{\xdownarrow}[1]{{\left\downarrow\vbox to #1{}\right.\kern-\nulldelimiterspace}}
\newcommand{\da}[1]{\Bigg\downarrow\raise.5ex\rlap{\scriptstyle#1}}
\newcommand\tmatW{\tilde{\maybemathbf{W}}}
\newcommand\tmatA{\tilde{\maybemathbf{A}}}
\newcommand\tmatB{\tilde{\maybemathbf{B}}}
\newcommand\tmatZ{\tilde{\maybemathbf{Z}}}
\newcommand\tmatY{\tilde{\maybemathbf{Y}}}
\newcommand\tmatX{\tilde{\maybemathbf{X}}}
\newcommand\tvecx{\tilde{\maybemathbf{x}}}
\newcommand\hmatS{{ \maybemathbf{\hat S}}}
\newcommand\hmatR{{ \maybemathbf{\hat R}}}
\newcommand\hmatA{{ \maybemathbf{\hat A}}}
\newcommand\twomat[2]{\left[\begin{smallmatrix} #1 \\ #2 \end{smallmatrix} \right] }
\newcommand\twomatr[2]{\left[\begin{smallmatrix} #1 & #2 \end{smallmatrix} \right] }
\newcommand\tx{{\tilde{\mathbf{x}}}}
\newcommand\tX{{\tilde{\mathbf{X}}}}
\newcommand\tY{{\tilde{\mathbf{Y}}}}
\newcommand\tA{{\tilde{\mathbf{A}}}}
\newcommand\tW{{\tilde{\mathbf{W}}}}
\newcommand\tZ{{\tilde{\mathbf{Z}}}}
\newcommand\hS{{\hat {\mathbf{S}}}}
\newcommand\hR{{\hat {\mathbf{R}}}}
\newcommand\hA{{\hat {\mathbf{A}}}}
\newcommand\hY{\hat {\mathbf{Y}}}
\newcommand\bA{{\mathbf{A}}}
\newcommand\bN{{\mathbf{N}}}
\newcommand\ba{{\mathbf{a}}}
\newcommand\bw{{\mathbf{w}}}
\newcommand\W{{\mathbf{W}}}
\newcommand\R{{\mathbf{R}}}
\newcommand\bB{{\mathbf{B}}}
\newcommand\bQ{{\mathbf{Q}}}
\newcommand\bZ{{\mathbf{Z}}}
\newcommand\bz{{\mathbf{z}}}
\newcommand\bU{{\mathbf{U}}}
\newcommand\bV{{\mathbf{V}}}
\newcommand\bX{{\mathbf{X}}}
\newcommand\bS{{\mathbf{S}}}
\newcommand\bs{{\mathbf{s}}}
\newcommand\bx{{\mathbf{x}}}
\newcommand\by{{\mathbf{y}}}
\newcommand\bb{{\mathbf{b}}}
\newcommand\bT{{\mathbf{T}}}
\newcommand\bP{{\mathbf{P}}}
\newcommand\bG{{\mathbf{G}}}
\newcommand\bY{{\mathbf{Y}}}
\newcommand\bH{{\mathbf{H}}}
\newcommand\bD{{\mathbf{D}}}
\newcommand\be{{\mathbf{e}}}
\newcommand\br{{\mathbf{r}}}
\newcommand\Sig{{\mathbf{\Sigma}}}
\newcommand\bLam{{\mathbf{\Lambda}}}
\newcommand\bPi{{\mathbf{\Pi}}}
\newcommand\hphi{{\hat\phi}}
\newcommand\OPT{{\mathtt{OPT}}}
\newcommand\sd{{\mathtt{sd}}}
\)
Low-Rank PSD Approximation in Input-Sparsity Time
`\qquad\quad`
Ken Clarkson
IBM Research, Almaden
Joint with
David Woodruff
IBM Research, Almaden
Fitting data: low-rank approximation
-
Given: points `\matA_{i*}` as rows of matrix `\matA`
Fit: a `k`-dimensional subspace `L^*` minimizing
$$\cost(\matA,L) \equiv \sum_{i\in [n]} \dist(\matA_{i*},L)^2$$
among all `k`-dimensional subspaces
- That is,
$$\qquad\qquad\quad\min_{\substack{\matY\in\R^{n\times k}\\ \matX\in\R^{k\times d}}} \norm{\matA - \matY\matX}_F^2$$
- `\matY\matX` is rank-`k`
- `L=\rowspan(\matX)`
- `\norm{\matE}_F^2 \equiv \sum_{i,j} E_{ij}^2`
- Solve using Singular Value Decomposition (SVD) (PCA, LSI, EigenX,...)
- For fixed `\matY`, least-squares for each column of `\matX`
- For fixed `\matX`, least-squares for each row of `\matY`
Approximation, Input-sparsity time
Given `\eps\gt 0`, find `\tmatY,\tmatX` (with `k` columns, rows) with
$$\norm{\matA - \tmatY\tmatX}_F^2 \le (1+\eps)\norm{\matA-\matA_k}_F^2,$$
a `(1+\eps)`-approximation. Here `\matA_k` is the best rank-`k` approximation.
With fixed probability near one,
`(1+\eps)`-approximation can be found in
$$
O(\nnz{\matA}) + \tO(k^2\eps^{-1}(n+d)) + \poly(\eps^{-1}k)
$$
time.
`\nnz{\quad}`
`\nnz{\matA}\equiv` the Number of Non-Zero entries of `\matA`
Similar results hold for some versions that are
regularized, and/or robust and/or kernelized
Positive semidefinite low-rank approximation
- Suppose we want reduce the dimension of a pointset,
approximately preserving all dot products:
- Pointset is rows of `\matB`
- Find `\tmatY` such that ` \tmatY_{i*} \tmatY_{j*}^\top\approx \matB_{i*} \matB_{j*}^\top`
- Sometimes `\matA = \matB\matB^\top` is the input
Positive Semidefinite (PSD)
Square matrix `\matA` is PSD when `\vecx^\top \matA\vecx\ge 0` for all `\vecx`
`\matA=\matB\matB^\top` is PSD: `\vecx^\top\matB\matB^\top\vecx = (\matB^\top\vecx)^\top \matB^\top\vecx = \norm{\matB^\top\vecx}^2\ge 0`
Covariance matrices, Laplacians, kernel matrices
In low-rank approximation `\matY\matX\approx\matA`,
for PSD need `\matX=\matY^\top`, want `\matY\matY^\top \approx \matB\matB^\top`
Non-PSD inputs
Due to roundoff error
Via multi-dimensional scaling for not-very-Euclidean distance
Our results do not require PSD inputs, unlike many methods
Low-rank PSD approximation
Given `\matA\in\R^{n\times n}` and integer `k`, let
$$\matA_{k,+}\equiv \matY\matY^\top, \text{ where } \matY \equiv \argmin_{\matY\in\R^{n\times k}} \norm{\matA - \matY\matY^\top}_F^2$$
Let `\matA^\text{sym} \equiv (\matA+\matA^\top)/2`
Then `\matA_{k,+} = \matA^\text{sym}_{k,+}`
`\matA = \matA^\text{sym} + \matA^\text{asym}`, where `\matA^\text{asym} \equiv (\matA-\matA^\top)/2`
When `\matB` is symmetric,
$$\norm{\matB-\matA}_F^2 = \norm{\matB - \matA^\text{sym}}_F^2 + \norm{\matA^\text{asym}}_F^2$$
using `\tr(\matB - \matA^\text{sym})\matA^\text{asym} = 0` and Pythagorean theorem
Note that `\nnz{\matA^\text{sym}} \le 2*\nnz{\matA}`
What is `A_{k,+}`?
Suppose symmetric `\matA` has eigen-decomposition `\matA=\matU\matD\matU^\top`
(`\matU` is orthogonal, `\matD` is diagonal)
Then `\matA_{k,+} = \matU\matD_{k,+}\matU^\top`
Using `\norm{\matU^\top \matB}_F = \norm{\matB\matU}_F = \norm{\matB}_F` for all `\matB` and all orthogonal `\matU`
`\matD_{k,+}` is diagonal, nonzero entries are top `k` nonnegative entries of `\matD`
So computing such top `k` eigenvalues/vectors suffices
`\qquad` (but might be expensive)
Our results, 1/2
Given `\matA\in\R^{n\times n}`, integer `k`, and `\eps\gt 0`, in
$$O(\nnz{\matA} + n\poly(k/\eps))$$
time, we find:
a rank-`k` PSD matrix `\tmatA_{k,+}` such that
$$\norm{\matA-\tmatA_{k,+}}_F \le (1+\eps)\norm{\matA - \matA_{k,+} }_F$$
In comparison:
- [KMT]
- Nyström method (uniform sampling) is very fast, but needs more assumptions for quality guarantees
`\qquad` e.g. incoherence
- [GM]
- Similar speed, but quality bound is
$$\norm{\matA - \matA'}_F \le \norm{\matA - \matA_{k,+}}_F + \eps\norm{\matA-\matA_{k,+}}_*$$
where `\norm{\cdot}_*` is the nuclear (trace) norm
- [WLZ]
- Running time `\Omega(n^2k/\eps)` and approximation has rank `\Omega(k/\eps)`
Our results, 2/2
Given symmetric `\matA\in\R^{n\times n}`, integer `k`, `\eps\gt 0`,
in
$$O(\nnz{\matA}\log(n) + n\poly(\log(n)k/\eps))$$
time, we find:
`\qquad` a matrix `\matC` with `O(k/\eps)` columns from `\matA`,
`\qquad` a rank `k` matrix `\matW`
such that
$$\norm{\matA-\matC\matW\matC^\top}_F \le (1+\eps)\norm{\matA - \matA_{k,+} }_F$$
Related work: same as above
Sketching Matrices:
Oblivious Subspace Embeddings
Matrix `\matS\in\R^{m\times n}` is an `\eps`-embedding of set `P\subset\R^n`
if
$$\norm{\matS\vecy }= (1\pm\eps)\norm{\vecy }\text{ for all }\vecy\in P.$$
`\matS` will be called a "sketching matrix".
For `\matA\in\R^{n\times d}`,
a matrix `\matS ` is a subspace `\eps`-embedding
for `\matA ` if `\matS ` is an `\eps`-embedding for `\colspan(\matA )= \{\matA\vecx\mid\vecx\in\R^d\}`. That is,
$$\norm{\matS\matA\vecx }= (1\pm\eps)\norm{\matA\vecx }\text{ for all }\vecx\in\R^d.$$
`\matS\matA` will be called a "sketch".
An
oblivious subspace embedding is:
- A probability distribution `\cal D` over matrices `\matS\in\R^{m\times n}`, so that
- For any unknown but fixed matrix `\matA`,
`\matS ` is a subspace `\eps`-embedding for `\matA `, with high-enough probability.
Approach to (non-PSD) low-rank approximation
Using thin sketching matrices `\matR, \matS,\matS_\ell, \matS_r`:
$$\begin{align*}
\min \norm{\matA - \,\, & \!\!\matY \matX }_F \\[-10em]
& \Bigg\downarrow & \\[-10em]
\min_{\rank(\matZ)=k} \norm{\matA - \matA\matR & \matZ \matS\matA}_F \\[-10em]
& \Bigg\downarrow & \\[-40em]
\min_{\rank(\matZ)=k} \norm{ S_\ell (\matA - \matA\matR & \matZ \matS\matA) \matS_r}_F
\end{align*}$$
To get symmetry, we need `\matR=\matS^\top`,
but this breaks the analysis.
We need a better understanding of what sketches can do.
Sketch approximation guarantees
The approximation properties of sketches have long been used:
Let `\matP` project onto the rowspace of `\matS\matA`
For a large enough sketching matrix `\matS`, with constant prob.,
`\norm{\matA-\matA\matP}_F^2\le (1+\eps)\norm{\matA-\matA_k}_F^2`
Note: `\matP = (\matS\matA)^+\matS\matA`,
where `\matB^+` is the Moore-Penrose pseudo-inverse of `\matB`
Here we use a sharper bound:
Spectral-Frobenius approx `\SF(\eps,k)`
Projection `\matP` has the `\SF(\eps,k)` property when
`\norm{\matA-\matA\matP}_2^2\le \frac{\eps}{k}\norm{\matA-\matA_k}_F^2`
Spectral norm `\norm{\matA}_2`
`\norm{\matA}_2 \equiv \sup_{\vecx} \norm{\matA\vecx}/\norm{\vecx}`
`\SF(\eps,k)` is stronger
`\SF(\eps,k)\implies` Frobenius
If `\matP` is `\SF(\eps,k)`,
then
`\norm{\matA-\matA\matP}_F^2\le (1+O(\eps))\norm{\matA-\matA_k}_F^2`
By the Pythagorean theorem,
$$\begin{align*}
\norm{\matA(\Iden-\matP)}_F^2
& = \norm{(\matA-\matA_k)(\Iden - \matP)}_F^2 + \norm{\matA_k(\Iden - \matP)}_F^2
\\ & \le \norm{\matA-\matA_k}_F^2 + \norm{\matA_k(\Iden - \matP)}_F^2
\end{align*}$$
and
`
\begin{align*}
\norm{\matA_k(\Iden - \matP)}_F^2
& \le k\,\norm{\matA_k(\Iden - \matP)}_2^2 & & \color{red}{\rank(\matA_k)=k}
\\ & \le k\,\norm{\matA(\Iden - \matP)}_2^2 & & \color{red}{\norm{\matA x}\ge\norm{\matA_kx}\text{ for all }x}
\\ & \le k\,\frac{\eps}{k} \norm{\matA - \matA_k}_F^2 & & \color{red}{\matP\text{ is }\SF(\eps,k)}
\\ & = \eps\,\norm{\matA - \matA_k}_F^2 &
\end{align*}
`
So
$$\norm{\matA(\Iden-\matP)}_F^2\le \norm{\matA-\matA_k}_F^2 + \norm{\matA_k(\Iden - \matP)}_F^2 \le (1+\eps)\norm{\matA-\matA_k}_F^2$$
`\SF(\eps,k)` projections give good solutions
`\SF(\eps,k)\implies` symmetric approx
If `\matP` is `\SF(\eps,k)`, then
$$\norm{\matA - \matP \matA_{k,+}\matP}_F^2 \le (1+O(\eps)) \norm{\matA - \matA_{k,+}}_F^2$$
So same upper bound holds for
$$\tX := \argmin_{\matX\text{ rank-k PSD}} \norm{\matA - \matP\matX\matP}_F^2$$
The proof uses the fact that,
for `\matA,\matB` symmetric, `(\matA-\matB)\matB = 0`, and `\matP` a projection,
$$\norm{\matA-\matP \matB \matP}_F^2 = \norm{\matA - \matB}_F^2 + \norm{\matB - \matP \matB \matP}_F^2 + \tr(\matA - \matB)(\Iden - \matP) \matB \matP$$
(via standard facts)
Letting `\matB := \matA_{k,+}`, so that `(\matA - \matA_{k,+}) \matA_{k,+} = 0`,
$$\begin{align*}
\norm{\matA-\matP \matA_{k,+} \matP}_F^2 & = \norm{\matA - \matA_{k,+}}_F^2
\\ & + \norm{ \matA_{k,+} - \matP \matA_{k,+} \matP}_F^2
\\ & + \tr(\matA - \matA_{k,+})(\Iden - \matP) \matA_{k,+} \matP
\end{align*}
$$
Latter two terms are small, via `\SF(\eps,k)`
Getting and using `\SF(\eps,k)` projections
Examples: `\SF(\eps,k)` projections via sketching
Via [CEMMP],
many "standard" sketching matrices `\matS`
yield `\SF(\eps,k)` projections `\matP_{\matS\matA} \equiv (\matS\matA)^+\matS\matA`:
- i.i.d. Gaussian
- OSEs (e.g., CountSketch): sketching takes `O(\nnz{A})`
- Subsampled Fast Hadamard Transform
So:
$$\argmin_{\matX\text{ rank-k PSD}} \norm{\matA - \matP_{\matS\matA}\matX\matP_{\matS\matA}}_F^2$$
is a good solution. Since
$$\matP_{\matS\matA}\matX\matP_{\matS\matA} = (\matS\matA)^\top\left[ (\matS\matA)^{+\top} \matX (\matS\matA)^+\right]\matS\matA,$$
and `\matZ = (\matS\matA)^{+\top} \matX (\matS\matA)^+` is rank-`k` PSD, it's enough to find
$$\argmin_{\matZ\text{ rank-k PSD}} \norm{\matA - \matA\matS^\top \matZ \matS\matA}_F^2$$
We can use `\matR=\matS^\top`.
Approach to low-rank approximation, for PSD
Almost as before:
$$\begin{align*}
\min \norm{\matA - \,\, & \!\!\matY \matY^\top }_F \\[-10em]
& \Bigg\downarrow & \\[-10em]
\min_{\rank(\matZ)=k} \norm{\matA - \matA\matS^\top & \matZ \matS\matA }_F \\[-10em]
& \Bigg\downarrow & \\[-40em]
\min_{\rank(\matZ)=k} \norm{ S_\ell (\matA - \matA\matS^\top & \matZ \matS\matA) \matS_r}_F
\end{align*}$$
However: `\matS_\ell \ne \matS_r`, and must be; what to do?
Approach to low-rank approximation, for PSD, revised
Express `\matA\matS^\top = \matU\matT`,
where `\matU` has nearly orthogonal columns, `\matT` is triangular
$$\begin{align*}
\min_{\rank(\matZ)=k} \norm{\matA - \matU\matT & \matZ \matT^\top\matU^\top}_F \\[-10em]
& \Bigg\downarrow & \\[-40em]
\min_{\rank(\matZ)=k} \norm{ \matS_\ell (\matA - \matU & \matW \matU^\top) \matS_r}_F
\end{align*}$$
Using the fact that `\matS_\ell\matU` and `\matU^\top \matS_r` are "almost" orthogonal,
this reduces to finding the best rank-`k` approx to
$$(\matS_\ell \matU)^+\matS_\ell\matA\matS_r(\matU^\top S_r)^+$$
Low rank PSD column selection
- There are sampling matrices `\matS`,
selecting `O(k/\eps)` rows,
that are `\SF(\eps,k)`
- Via [CEMPP], taking some work
- Uses [BSS]
- All else as above
Also
- For (only) symmetric approximation, same runtime and guarantees
- A slightly different performance guarantee.
Better approx for decaying spectra
Let `t\equiv 2k/\eps`
A rank-`k` PSD matrix `\tA` can be found in
$$O(\nnz{\matA} + n\poly(k/\eps))$$
time with
$$\norm{\matA-\tA}_F^2 \le \norm{\matA-\matA_{k,+}}_F^2 + \norm{\matA_t - \matA_{t+k}}_F^2$$
Note that
$$\norm{\matA_t - \matA_{t+k}}_F^2 = \sum_{i\in [t,t+k]} \sigma_i^2(\matA) \le \frac{k}{t-k}\sum_{j\ge k} \sigma_i^2(\matA) \le\eps \norm{\matA-\matA_k}_F^2$$
Conclusions
- First rank-`k` PSD approximation of arbitrary matrix in `O(\nnz{A})` time
- Optimal $$O(k/\eps)$$ rows for rank-`k` PSD subset selection, in $$O(\nnz{A}\log n)+\ldots$$ time
- Other structured approximations?
- PSD-rank, one-bit, NMF,...