\(
\newcommand\norm[1]{|| #1 ||}
\newcommand\nnz[1]{{\textrm{nnz}(#1)}}
\newcommand\E{\mathbf{E}}
\newcommand\Dist{\mathbf{Dist}}
\newcommand\reals{{\mathrm{I\!R}}}
\newcommand\R{{\mathrm{I\!R}}}
\newcommand\poly{{\textrm{poly}}}
\newcommand\rank{{\textrm{rank}}}
\newcommand\colspan{{\textrm{range}}}
\newcommand\diam{{\textrm{diam}}}
\newcommand\polylog{{\textrm{polylog}}}
\newcommand\rowspan{{\textrm{rowspan}}}
\newcommand\colbasis{{\textrm{colbasis}}}
\newcommand\argmin{{\mathrm{argmin}}}
\newcommand\Err{{\mathtt{NE}}}
\newcommand\eps{{\epsilon}}
\newcommand\bfone{{\mathbf{1}}}
\newcommand\Var{{\mathbf{Var}}}
\newcommand\stdev[1]{{\mathbf{stdev(#1)}}}
\newcommand\cgf{{\mathbf{CGF}}}
\newcommand\bfone{{\mathbb{1}}}
\newcommand\Iden{{\mathbf{I}}}
\newcommand\Prob{{\mathbf{Pr}}}
\DeclareMathOperator\subg{\mathbf{subG}}
\DeclareMathOperator\dist{\mathrm{dist}}
\DeclareMathOperator\cost{\mathrm{cost}}
\newcommand\cN{{\cal N}}
\newcommand\cS{{\cal S}}
\newcommand\cD{{\cal D}}
\newcommand\cM{{\cal M}}
\newcommand\cE{{\cal E}}
\newcommand\cC{{\cal C}}
\newcommand\KJL{K_{\mathbb{JL}}}
\newcommand\tO{{\tilde O}}
\newcommand\tL{{\tilde L}}
\newcommand\tx{{\tilde{\mathbf{x}}}}
\newcommand\tX{{\tilde X}}
\newcommand\tY{{\tilde{\mathbf{Y}}}}
\newcommand\tA{{\tilde{\mathbf{A}}}}
\newcommand\bA{{\mathbf{A}}}
\newcommand\ba{{\mathbf{a}}}
\newcommand\bw{{\mathbf{w}}}
\newcommand\bW{{\mathbf{W}}}
\newcommand\bR{{\mathbf{R}}}
\newcommand\bB{{\mathbf{B}}}
\newcommand\bQ{{\mathbf{Q}}}
\newcommand\bZ{{\mathbf{Z}}}
\newcommand\bz{{\mathbf{z}}}
\newcommand\bU{{\mathbf{U}}}
\newcommand\bV{{\mathbf{V}}}
\newcommand\bX{{\mathbf{X}}}
\newcommand\bS{{\mathbf{S}}}
\newcommand\hS{{\hat S}}
\newcommand\hR{{\hat R}}
\newcommand\bs{{\mathbf{s}}}
\newcommand\bx{{\mathbf{x}}}
\newcommand\by{{\mathbf{y}}}
\newcommand\bb{{\mathbf{b}}}
\newcommand\bT{{\mathbf{T}}}
\newcommand\bP{{\mathbf{P}}}
\newcommand\bG{{\mathbf{G}}}
\newcommand\bY{{\mathbf{Y}}}
\newcommand\hY{\hat {\mathbf{Y}}}
\newcommand\bH{{\mathbf{H}}}
\newcommand\bD{{\mathbf{D}}}
\newcommand\be{{\mathbf{e}}}
\newcommand\br{{\mathbf{r}}}
\newcommand\bSig{{\mathbf{\Sigma}}}
\newcommand\bPi{{\mathbf{\Pi}}}
\newcommand\hphi{{\hat\phi}}
\newcommand\OPT{{\mathtt{OPT}}}
\)
Input Sparsity and Hardness for Robust Subspace Approximation
Ken Clarkson, David Woodruff
IBM Research, Almaden
Fitting Data: the Singular Value Decomposition
- Input `n` points `a_1,\ldots,a_n` in `d` dimensions
- Arranged as rows of `A\in\R^{n\times d}`
- Fit: a `k`-dimensional subspace `L^*` minimizing
$$\cost(A,L) \equiv \sum_{i\in [n]} \dist(a_i,L)^2$$
among all `k`-dimensional subspaces
- Solved by the SVD: `L^*` is the span of top `k` right singular vectors `V_k^\top`
- That is, there is a best rank-`k` `A_k=U_k\Sigma_k V_k^\top` with
`\cost(A,L^*) = \norm{A - A_k}_F^2 = \sum_{i\in [n]} \dist(a_i, a_iV_k V_k^\top)^2`
- Fast approximation algorithms:
can find `L'` with `\dim(L') = k` and `\cost(A,L')\le (1+\eps)\cost(A,L^*)`
in
$$O(\nnz{A} + (n+d)\poly(k/\eps))$$
time. [CW13], see also [NN13],[MM13]
SVD is Not Very Robust
- Here compared against minimizing the sum of Euclidean distances
- A single point can exercise a lot of "leverage"
`M`-estimators for Regression
- Regression `M`-estimators: `\min_x \sum_i M(r_i)`,
for some function `M`,
where `r\equiv Ax-b`
- `M(z) = z^2`: `\ell_2^2` least-squares
`\min_x \norm{Ax-b}_2^2`
- `M(z) = |z|`: `\ell_1` regression
- `M(z) = \begin{cases}
z^2/2\tau & z \le \tau
\\ |z| - \tau/2& z\ge \tau
\end{cases}`
- ...the Huber estimator [H64]
- Robust, yet smooth
- Tukey: even more robust
Robust Low-Rank Approximation
- What's a robust analog of `\norm{\cdot}_F^2`?
- Often want rotational invariance:
for all rotations `Q`, $$\cost(AQ, LQ) = \cost(A,L)$$
`\implies` independent of choice of basis.
But also
`\implies` can't use the cost `\sum_{i\in[n],j\in[d]} |W_{ij}|` for `W=A_k-A`
- Could use `\sum_{i\in[n]} M(\dist(a_i, L))`, e.g. `M(z)=z^p`
- Rotationally invariant, more robust when `p\lt 2`
- Proposed in ML and applications [DZHZ06][BWSDY09]
- For `V` a basis for `L`, `\cost(A,L) = \cost(A,V) \equiv \sum_{i\in [n]} M(\dist(a_i, a_i V V^\top))`
Prior theoretical work
Fix `\eps>0`, `p\in [1,\infty)`, `M(z) = z^p`
$$\OPT \equiv \min_{\dim{L}=k} \cost(A,L)$$
Subspace `\tL` is a `(1+\eps)`-approximation
if
- `\dim(\tL) = k`
- `\cost(A,\tL)\le (1+\eps)\OPT`
Bi-criteria: allow `\dim(\tL)\gt k`, say `\dim(\tL) = \poly(k/\eps)`
Weak coreset: `L'` with `L'\supset \tL` and `\dim(L')` small (maybe `\poly(k/\eps)`)
- [SV07] `(1+\eps)`-approximation in `nd\exp(\poly(k/\eps))` time
- [DTV10] `\sqrt{2}\gamma_p`-approximation, for `\gamma_p` a function of `p`
- [DV07] Weak coreset
- [DV07] Bi-criteria
- [GRSW12] For `p\gt 2`, NP-hard to get `\gamma_p`-approx
Our results: hardness
- For `p\in [1,2)`, NP-hard to find `(1+O(1/d))`-approximation
- That is, no `\poly(ndk/\eps)` algorithm unless `\mathrm{P}=\mathrm{NP}`
- With prior work, `\implies` singularity at `p=2`:
NP-hard unless `p=2+o(1)`
- Open: hardness when `\eps` is fixed
Our results: algorithms for `M(z) = z^p`
- For `p\in[1,2)`, an algorithm that with constant probability returns an `(1+\eps)`-approx in
$$O(\nnz{A} + (n+d)\poly(k/\eps) + \exp(\poly(k/\eps))$$
time.
- "Constant probability" means, say, 9/10
- `\Omega(\nnz{A})` is unavoidable for `\eps<\infty`
- Hardness results suggest `\exp(\poly(k/\eps))` may be unavoidable
Our results: nice `M`-estimators
- Results for functions `M(z)` that are nice:
- Even: `M(z)=M(-z)`
- Monotone: `|a|\gt |b| \implies M(a)\gt M(b)`
- Bounded growth: there is `C_M` so that for `a\ge b\gt 0`,
$$C_M\frac{a}{b} \le \frac{M(a)}{M(b)} \le \frac{a^2}{b^2}$$
- Root-subadditive:
$$\sqrt{M(a+b)} \le \sqrt{M(a)} + \sqrt{M(b)}$$
- $M()$ functions in the literature are nice, or almost
- For vector `x`, `\sqrt{\sum_{i\in [n]} M(x_i)}` is almost a norm
- All properties except:
scale invariance: `\norm{\kappa x} = |\kappa|\cdot\norm{x}`
- But have
scale insensitive: `\sqrt{\kappa C_M} \norm{x}_M \le \norm{\kappa x}_M \le \kappa\norm{x}_M`, for `\kappa\ge 1`
- Can use `\eps`-net/`\eps`-cover arguments
Our results:
algorithms for nice $M$-estimators
For a value `K=(\log n)^{O(\log k)}`:
- Reduce to a problem of size `\poly(K/\eps)` in time
$$O(\nnz{A}) + (n+d)\poly(K/\eps)$$
- Bi-criteria: find `K`-approx solution of dimension `\poly(k\log n)` in time
$$O(\nnz{A}) + (n+d)\poly(k)$$
- Weak(er) coreset: find `L` of dim `\poly(K/\eps)` that contains an `(1+\eps)`-approximation
Techniques: Overall Strategy
We use a series of row and column reductions that preserve approximate solutions
- Find a weak coreset and its subspace `L_\eps`
- Working within `L_\eps`, reduce to a "small" problem
- For `M(z)=z^p`: "small" = $\poly(k/\eps)\times \poly(k/\eps)$
- Solve small problem with algorithm taking $\exp$-time
- [BPR96], e.g.: solves polynomial inequalities;
given:
- `c` polynomial inequalities
- `q` degree
- `z` variables
takes `\poly(cq)^z` time.
- Or, use some nice heuristic
Techniques
Outline of remaining talk:
- Row and column reduction methods for regression
- Application of those methods to low-rank approximation
- Focus on `M(z)=|z|`
- Most techniques also apply to nice `M(z)`, and to `p\ge 2`
- Let `\norm{W}_v` be the sum of the Euclidean norms of the rows:
$$\norm{W}_v\equiv \norm{W_{1,*}}_2 + \norm{W_{2,*}}_2 + \ldots + \norm{W_{n,*}}_2$$
- a.k.a. the `L_{2,1}` or `R^1` norm
- So `\cost(A,L) = \sum_{i\in [n]} \dist(a_i, a_iVV^\top)^1 = \norm{A-AVV^\top}_v`
- `L=\rowspan(V^\top)`, rows of `V^\top` are orthonormal
Row and Column Reductions
The algorithm uses:
- Sampling matrices `S`, `S_\eps`, and `\hS`, with few rows
- Sketching matrices `R` and `\hR`, with few columns
- Use `S` to sample rows: `SA` has few rows
- Use `R` to compress rows: `AR` has few columns
- Reducing the number of columns
- Each sampling and sketching matrix has different properties
The algorithm
Explanation to follow!
- Find a weak coreset and its subspace `L_\eps`
- Find bi-criteria `O(1)`-approximation
- Construct `R` and compute `AR`
- Construct `S` using leverage scores of `AR`
- Construct `S_\eps` using `\dist(a_i, \rowspan(SA))`
- `L_\eps\equiv \rowspan(S_\eps A)` is the weak coreset
- Working within `L_\eps`, reduce to a "small" problem
- Construct basis `U^\top` for `L_\eps`
- Construct `\hR`, and construct `\hS` from `AU` and `A\hR`
- Small problem is `\min_{\rank(Y)=k} \norm{\hS A U Y U^\top \hR - \hS A\hR}_v`
- Solve small problem with algorithm taking $\exp$-time
Reducing Rows: Sampling the rows
- `\norm{W}_v` is a sum, estimate with a sample
- Use weighted sampling
- Pick row `W_{i,*}` independently with probability `q_i`
- Weight by `1/q_i`
- Express as `S\in\R^{m\times n}`
- Each row of `S` is a scaled multiple of basis vector `e_i`
- Each row of `SW` is a scaled row of `W`
- `m` is a random variable, `\E[m] = \sum_{i\in [n]} q_i`
- For given `W`:
- `\E[\norm{SW}_v] = \norm{W}_v`, by construction
- `\norm{SW}_v=(1\pm\eps)\norm{W}_v`, with high probability
- For the right `q_i`
- `\E[m]` large enough
Sampling rows for Regression
With the right `q_i` and sample size, with probability `9/10`:
- For all `Y`, `\norm{SAY}_v = (1\pm\eps)\norm{AY}_v`
- A kind of subspace embedding:
embedding `\{AY\mid Y\in\R^{d\times d'}\}\subset\R^{n\times d'}`
in `\R^{m\times d'}`
- Or approximate regression:
For `\OPT = \min_Y \norm{AY-B}_v`,
`Y^*_S \equiv \argmin_{Y}\norm{S(AY-B)}_v` has
`\norm{AY^*_S-B}_v \le (1+\eps)\OPT`
- Multiple-response regression, `B` is a matrix not a vector
- Follows from lopsided embedding:
`\norm{S(AY-B)}_v\ge (1-\eps)\norm{AY-B}_v` for all `Y`,
`\norm{S(AY^*-B)}_v\le (1+\eps)\norm{AY^*-B}_v`
- Approximate regression for sampling and `\norm{\cdot}_v`:
- `q_i\propto` leverage scores: large constant `\eps`, `m=\poly(d)`
- `q_i\propto` residuals: can get small `\eps`
Residuals: cost of `a_i` for a `C`-approx solution
Reducing Columns: Sketching the rows
- `\norm{W}_v` summands are `\ell_2` norms of rows
- Compress to rows with about the same norm
- As in JL, random projections
- Express as `R\in\R^{d\times m}`
- Sketched version of matrix `W` is `WR`
- Each row of `WR` is a compressed row of `W`
- e.g, `R` is i.i.d. Gaussians, `\norm{W_{i,*} R}_2 = (1\pm\eps)\norm{W_{i,*}}_2`
- With high probability, if `m\approx\log(n)/\eps^2`
- For given `W`:
- `\E[\norm{WR}_v] = \norm{W}_v`, by construction
- `\norm{WR}_v=(1\pm\eps)\norm{W}_v`, w.h.p, for enough columns
Sketching rows for regression
With the right sketching method and dimension, with probability `9/10`:
- For all `X`,
`\norm{XAR}_v \approx \norm{XA}_v` for all `X`
- Subspace embedding (here, not new)
- Approximate regression:
For `\OPT = \min_X \norm{XA-B}_v`,
`X^*_R \equiv \argmin_{X}\norm{(XA-B)R}_v` has
`\norm{X^*_R A-B}_v \le (1+\eps)\OPT`
- When `R` has `\poly(\rank(A)/\eps)` columns
(need here `\rank(A)\ll d`)
- New structural result: formerly only for `\norm{\cdot}_F`
- Helpful properties:
- Many constructions are oblivious: don't look at input data
- Can include constraints:
- For given set of matrices `\cC`, such as `\cC=\{X\mid \rank(X)=k\}`
- `X^*_R \equiv \argmin_{X\color{red}{\in\cC}}\norm{(XA-B)R}_v`
good for
`\min_{X\color{red}{\in\cC}} \norm{XA-B}_v`
- Speed!
Faster sketching: sparse embeddings
- Sketching matrix `R` can be sparse embedding
- Adaptation of
CountSketch
from streaming algorithms
- `R` looks like:
$\qquad\qquad\left[\begin{matrix}
0 & -1 & 0
\\ +1 & 0 & 0
\\ 0 & 0 & +1
\\ 0 & -1 & 0
\\ 0 & +1 & 0
\\ 0 & 0 & +1
\\ -1 & 0 & 0
\end{matrix}
\right]$
- One random `\pm 1` per row, column randomly chosen
- Each column of `WR` is sum of `\pm W_{*,j}`
- Takes `O(\nnz{W})` to compute `WR`
- [CW13] introduced sparse embeddings for low-rank approx, used here too
Using the techniques: column reduction
- Suppose `A_k= \argmin_{\rank(Z)=k} \norm{Z-A}_v`,
`\OPT\equiv \norm{A_k-A}_v`
- The minimizer of `\min_X \norm{XA_k-A}_v` is `X^*=I_n`
- Suppose `R` yields approximate regression for
`\min_X \norm{XA_k - A}_v`
- Need `\poly(k/\eps)` columns
- `R` is oblivious to the unknown `A_k`
- Let `X_R^* = \argmin_X \norm{(XA_k-A)R}_v`. We have:
- `\norm{X_R^* A_k-A}_v \le (1+\eps)\OPT` (with const. prob.)
- Row `i` of `X_R^*` solves `\min_{x\in\R^n} \norm{x^\top A_kR - A_{i,*}R}_2`
- That is, `X^*_R = AR(A_kR)^+`
- So:
`\begin{align*}
\min_{\rank(Y)=k}\norm{ARY-A}_v
& \le \norm{AR(A_kR)^+A_k-A}_v
\\ & = \norm{X_R^* A_k - A}_v
\\ & \le (1 + \eps)\OPT
\end{align*}`
Using the techniques: row reduction
`R` approximates regression for `A_k, A`
`\implies`
`\min_{\rank(Y)=k}\norm{ARY-A}_v \le (1+\eps)\OPT`
- Too slow to solve for `Y` directly
- Apply leverage score sampling to thin matrix `AR`
- But that only gives a constant factor approximation
- So `Y_S^* \equiv \argmin_{\rank(Y)=k} \norm{S(ARY-A)}_v`
has
`\norm{ARY_S^*-A}_v \le C \min_{\rank(Y)=k}\norm{ARY-A}_v`
- Observation: `Y_S^*` is in the rowspan of `SA`
- Let $P_{SA}$ be the matrix projecting onto `\rowspan(SA)`
- Projecting `SAR Y_S^* P_{SA}` can only make it closer to `SA`
- `ARY_S^*` is an `C`-approx solution in `\rowspan(SA)`
`S` approximates regression for `AR, A`
`\implies`
`L_{SA}\equiv\rowspan(SA)` is a good `C`-approx subspace
Sampling
Our story thus far:
- Find a weak coreset and its subspace `L_\eps`
- Find bi-criteria `O(1)`-approximation:
- Construct `R` and compute `AR`
- Construct `S` using leverage scores of `AR`
- Construct `S_\eps` using `\dist(a_i, \rowspan(SA))`
- `L_\eps\equiv \rowspan(S_\eps A)` is the weak coreset
- Working within `L_\eps`, reduce to a "small" problem
- Solve small problem with algorithm taking $\exp$-time
Constructing a weak coreset
- Use residual sampling:
`S_\eps` samples rows with probability `q_i\propto \dist(a_i, L_{SA})`
- `S_\eps A` is a weak coreset:
`X_{S_\eps}^* = \argmin_{\rank(X)=k}\norm{S_\eps(AX-A)}_v` with
`\norm{AX_{S_\eps}^* - A}_v \le (1+\eps)\min_{\rank(X)=k}\norm{AX-A}_v`
- New improved:
follows [DV07], except not adaptive: each sample chosen independently
- More specialized than multiple-response regression
Step 2: Working within `L_\eps`, reduce to a "small" problem
- Let `U^\top` have orthonormal rows, and `\rowspan(U^\top)=\rowspan(S_\eps A)`
- Projection of row vector `a`
onto `k`-dim `L_k\subset\rowspan(U^\top)` is
`aUXU^\top`, for a particular `X` with `\rank(X)=k`
- That is, we want to solve `\min_{\rank(X)=k} \norm{AUXU^\top -A}_v`
- We again sketch on the right, and sample on the left:
`\tX \equiv \argmin_{\rank(X)=k} \norm{SAUXU^\top R - SAR}_v`
has `\norm{AU\tX U^\top -A}_v \le (1+\eps)\min_{\rank(X)=k}\norm{AUXU^\top - A}_v`
- Use `R` that approximates regression for `U^\top`, `A`
- With constraint that `Y=AUX` for some rank-`k` `X`
- `S` from leverage score sampling of `[AU\, AR]`
- Both tall, thin matrices
- Use subspace embedding with quality `\eps`
Conclusions
- First provably good input-sparsity-time algorithm for robust low-rank approximation
- Some reductions for nice `M`-estimators
- (Sampling is harder to apply)
- Showed NP-hardness for `\sum_i \dist(a_i, L)^p` with `p\in [1,2)`
- Also (not mentioned so far)
- Prior work: approximate regression for Huber estimator [CW15]
- Used techniques here to extend to nice `M`-estimators
Thank you for your attention!