Robust Subspace Fitting

\( \newcommand\norm[1]{|| #1 ||} \newcommand\nnz[1]{{\textrm{nnz}(#1)}} \newcommand\E{\mathbf{E}} \newcommand\Dist{\mathbf{Dist}} \newcommand\reals{{\mathrm{I\!R}}} \newcommand\R{{\mathrm{I\!R}}} \newcommand\poly{{\textrm{poly}}} \newcommand\rank{{\textrm{rank}}} \newcommand\colspan{{\textrm{range}}} \newcommand\diam{{\textrm{diam}}} \newcommand\polylog{{\textrm{polylog}}} \newcommand\rowspan{{\textrm{rowspan}}} \newcommand\colbasis{{\textrm{colbasis}}} \newcommand\argmin{{\mathrm{argmin}}} \newcommand\Err{{\mathtt{NE}}} \newcommand\eps{{\epsilon}} \newcommand\bfone{{\mathbf{1}}} \newcommand\Var{{\mathbf{Var}}} \newcommand\stdev[1]{{\mathbf{stdev(#1)}}} \newcommand\cgf{{\mathbf{CGF}}} \newcommand\bfone{{\mathbb{1}}} \newcommand\Iden{{\mathbf{I}}} \newcommand\Prob{{\mathbf{Pr}}} \DeclareMathOperator\subg{\mathbf{subG}} \DeclareMathOperator\dist{\mathrm{dist}} \DeclareMathOperator\cost{\mathrm{cost}} \newcommand\cN{{\cal N}} \newcommand\cS{{\cal S}} \newcommand\cD{{\cal D}} \newcommand\cM{{\cal M}} \newcommand\cE{{\cal E}} \newcommand\cC{{\cal C}} \newcommand\KJL{K_{\mathbb{JL}}} \newcommand\tO{{\tilde O}} \newcommand\tL{{\tilde L}} \newcommand\tx{{\tilde{\mathbf{x}}}} \newcommand\tX{{\tilde X}} \newcommand\tY{{\tilde{\mathbf{Y}}}} \newcommand\tA{{\tilde{\mathbf{A}}}} \newcommand\bA{{\mathbf{A}}} \newcommand\ba{{\mathbf{a}}} \newcommand\bw{{\mathbf{w}}} \newcommand\bW{{\mathbf{W}}} \newcommand\bR{{\mathbf{R}}} \newcommand\bB{{\mathbf{B}}} \newcommand\bQ{{\mathbf{Q}}} \newcommand\bZ{{\mathbf{Z}}} \newcommand\bz{{\mathbf{z}}} \newcommand\bU{{\mathbf{U}}} \newcommand\bV{{\mathbf{V}}} \newcommand\bX{{\mathbf{X}}} \newcommand\bS{{\mathbf{S}}} \newcommand\hS{{\hat S}} \newcommand\hR{{\hat R}} \newcommand\bs{{\mathbf{s}}} \newcommand\bx{{\mathbf{x}}} \newcommand\by{{\mathbf{y}}} \newcommand\bb{{\mathbf{b}}} \newcommand\bT{{\mathbf{T}}} \newcommand\bP{{\mathbf{P}}} \newcommand\bG{{\mathbf{G}}} \newcommand\bY{{\mathbf{Y}}} \newcommand\hY{\hat {\mathbf{Y}}} \newcommand\bH{{\mathbf{H}}} \newcommand\bD{{\mathbf{D}}} \newcommand\be{{\mathbf{e}}} \newcommand\br{{\mathbf{r}}} \newcommand\bSig{{\mathbf{\Sigma}}} \newcommand\bPi{{\mathbf{\Pi}}} \newcommand\hphi{{\hat\phi}} \newcommand\OPT{{\mathtt{OPT}}} \)

Input Sparsity and Hardness for Robust Subspace Approximation

Ken Clarkson, David Woodruff
IBM Research, Almaden

Fitting Data: the Singular Value Decomposition

Input `n` points `a_1,\ldots,a_n` in `d` dimensions
Arranged as rows of `A\in\R^{n\times d}`
Fit: a `k`-dimensional subspace `L^*` minimizing $$\cost(A,L) \equiv \sum_{i\in [n]} \dist(a_i,L)^2$$ among all `k`-dimensional subspaces
Solved by the SVD: `L^*` is the span of top `k` right singular vectors `V_k^\top`
- That is, there is a best rank-`k` `A_k=U_k\Sigma_k V_k^\top` with `\cost(A,L^*) = \norm{A - A_k}_F^2 = \sum_{i\in [n]} \dist(a_i, a_iV_k V_k^\top)^2`

Fast approximation algorithms: can find `L'` with `\dim(L') = k` and `\cost(A,L')\le (1+\eps)\cost(A,L^*)` in $$O(\nnz{A} + (n+d)\poly(k/\eps))$$ time. [CW13], see also [NN13],[MM13]

SVD is Not Very Robust

Here compared against minimizing the sum of Euclidean distances
- That is, truncated SVD
A single point can exercise a lot of "leverage"

`M`-estimators for Regression

Regression `M`-estimators: `\min_x \sum_i M(r_i)`,
for some function `M`,
where `r\equiv Ax-b`
- `M(z) = z^2`: `\ell_2^2` least-squares
  `\min_x \norm{Ax-b}_2^2`
- `M(z) = |z|`: `\ell_1` regression
- `M(z) = \begin{cases} z^2/2\tau & z \le \tau \\ |z| - \tau/2& z\ge \tau \end{cases}`
  - ...the Huber estimator [H64]
  - Robust, yet smooth
- Tukey: even more robust

Robust Low-Rank Approximation

What's a robust analog of `\norm{\cdot}_F^2`?
Often want rotational invariance:
for all rotations `Q`, $$\cost(AQ, LQ) = \cost(A,L)$$
`\implies` independent of choice of basis.
But also `\implies` can't use the cost `\sum_{i\in[n],j\in[d]} |W_{ij}|` for `W=A_k-A`
Could use `\sum_{i\in[n]} M(\dist(a_i, L))`, e.g. `M(z)=z^p`
- Rotationally invariant, more robust when `p\lt 2`
- Proposed in ML and applications [DZHZ06][BWSDY09]
For `V` a basis for `L`, `\cost(A,L) = \cost(A,V) \equiv \sum_{i\in [n]} M(\dist(a_i, a_i V V^\top))`

Prior theoretical work

Fix `\eps>0`, `p\in [1,\infty)`, `M(z) = z^p`

$$\OPT \equiv \min_{\dim{L}=k} \cost(A,L)$$ Subspace `\tL` is a `(1+\eps)`-approximation if

`\dim(\tL) = k`
`\cost(A,\tL)\le (1+\eps)\OPT`

Bi-criteria: allow `\dim(\tL)\gt k`, say `\dim(\tL) = \poly(k/\eps)`
Weak coreset: `L'` with `L'\supset \tL` and `\dim(L')` small (maybe `\poly(k/\eps)`)

[SV07] `(1+\eps)`-approximation in `nd\exp(\poly(k/\eps))` time
[DTV10] `\sqrt{2}\gamma_p`-approximation, for `\gamma_p` a function of `p`
[DV07] Weak coreset
[DV07] Bi-criteria
[GRSW12] For `p\gt 2`, NP-hard to get `\gamma_p`-approx

Our results: hardness

For `p\in [1,2)`, NP-hard to find `(1+O(1/d))`-approximation
That is, no `\poly(ndk/\eps)` algorithm unless `\mathrm{P}=\mathrm{NP}`
With prior work, `\implies` singularity at `p=2`:
NP-hard unless `p=2+o(1)`
Open: hardness when `\eps` is fixed

Our results: algorithms for `M(z) = z^p`

For `p\in[1,2)`, an algorithm that with constant probability returns an `(1+\eps)`-approx in $$O(\nnz{A} + (n+d)\poly(k/\eps) + \exp(\poly(k/\eps))$$ time.
- "Constant probability" means, say, 9/10
`\Omega(\nnz{A})` is unavoidable for `\eps<\infty`
Hardness results suggest `\exp(\poly(k/\eps))` may be unavoidable

Our results: nice `M`-estimators

Results for functions `M(z)` that are nice:
- Even: `M(z)=M(-z)`
- Monotone: `|a|\gt |b| \implies M(a)\gt M(b)`
- Bounded growth: there is `C_M` so that for `a\ge b\gt 0`, $$C_M\frac{a}{b} \le \frac{M(a)}{M(b)} \le \frac{a^2}{b^2}$$
- Root-subadditive: $$\sqrt{M(a+b)} \le \sqrt{M(a)} + \sqrt{M(b)}$$
$M()$ functions in the literature are nice, or almost
For vector `x`, `\sqrt{\sum_{i\in [n]} M(x_i)}` is almost a norm
- All properties except:
  scale invariance: `\norm{\kappa x} = |\kappa|\cdot\norm{x}`
- But have
  scale insensitive: `\sqrt{\kappa C_M} \norm{x}_M \le \norm{\kappa x}_M \le \kappa\norm{x}_M`, for `\kappa\ge 1`
- Can use `\eps`-net/`\eps`-cover arguments

Our results:
algorithms for nice $M$-estimators

For a value `K=(\log n)^{O(\log k)}`:

Reduce to a problem of size `\poly(K/\eps)` in time $$O(\nnz{A}) + (n+d)\poly(K/\eps)$$
Bi-criteria: find `K`-approx solution of dimension `\poly(k\log n)` in time $$O(\nnz{A}) + (n+d)\poly(k)$$
Weak(er) coreset: find `L` of dim `\poly(K/\eps)` that contains an `(1+\eps)`-approximation

Techniques: Overall Strategy

We use a series of row and column reductions that preserve approximate solutions

Find a weak coreset and its subspace `L_\eps`
Working within `L_\eps`, reduce to a "small" problem
- For `M(z)=z^p`: "small" = $\poly(k/\eps)\times \poly(k/\eps)$
Solve small problem with algorithm taking $\exp$-time
- [BPR96], e.g.: solves polynomial inequalities;
  given:
  - `c` polynomial inequalities
  - `q` degree
  - `z` variables
  takes `\poly(cq)^z` time.
- Or, use some nice heuristic

Techniques

Outline of remaining talk:

Row and column reduction methods for regression
Application of those methods to low-rank approximation

Focus on `M(z)=|z|`
- Most techniques also apply to nice `M(z)`, and to `p\ge 2`
Let `\norm{W}_v` be the sum of the Euclidean norms of the rows: $$\norm{W}_v\equiv \norm{W_{1,*}}_2 + \norm{W_{2,*}}_2 + \ldots + \norm{W_{n,*}}_2$$
- a.k.a. the `L_{2,1}` or `R^1` norm
So `\cost(A,L) = \sum_{i\in [n]} \dist(a_i, a_iVV^\top)^1 = \norm{A-AVV^\top}_v`
- `L=\rowspan(V^\top)`, rows of `V^\top` are orthonormal

Row and Column Reductions

The algorithm uses:

Sampling matrices `S`, `S_\eps`, and `\hS`, with few rows
Sketching matrices `R` and `\hR`, with few columns
Use `S` to sample rows: `SA` has few rows
Use `R` to compress rows: `AR` has few columns
- Reducing the number of columns
Each sampling and sketching matrix has different properties

The algorithm

Explanation to follow!

Find a weak coreset and its subspace `L_\eps`
1. Find bi-criteria `O(1)`-approximation
  - Construct `R` and compute `AR`
  - Construct `S` using leverage scores of `AR`
2. Construct `S_\eps` using `\dist(a_i, \rowspan(SA))`
  - `L_\eps\equiv \rowspan(S_\eps A)` is the weak coreset
Working within `L_\eps`, reduce to a "small" problem
- Construct basis `U^\top` for `L_\eps`
- Construct `\hR`, and construct `\hS` from `AU` and `A\hR`
- Small problem is `\min_{\rank(Y)=k} \norm{\hS A U Y U^\top \hR - \hS A\hR}_v`
Solve small problem with algorithm taking $\exp$-time

Reducing Rows: Sampling the rows

`\norm{W}_v` is a sum, estimate with a sample
Use weighted sampling
- Pick row `W_{i,*}` independently with probability `q_i`
- Weight by `1/q_i`
Express as `S\in\R^{m\times n}`
- Each row of `S` is a scaled multiple of basis vector `e_i`
- Each row of `SW` is a scaled row of `W`
- `m` is a random variable, `\E[m] = \sum_{i\in [n]} q_i`
For given `W`:
- `\E[\norm{SW}_v] = \norm{W}_v`, by construction
- `\norm{SW}_v=(1\pm\eps)\norm{W}_v`, with high probability
  - For the right `q_i`
  - `\E[m]` large enough

Sampling rows for Regression

With the right `q_i` and sample size, with probability `9/10`:

For all `Y`, `\norm{SAY}_v = (1\pm\eps)\norm{AY}_v`
- A kind of subspace embedding:
  embedding `\{AY\mid Y\in\R^{d\times d'}\}\subset\R^{n\times d'}` in `\R^{m\times d'}`
Or approximate regression:
For `\OPT = \min_Y \norm{AY-B}_v`,
`Y^*_S \equiv \argmin_{Y}\norm{S(AY-B)}_v` has
`\norm{AY^*_S-B}_v \le (1+\eps)\OPT`
- Multiple-response regression, `B` is a matrix not a vector
- Follows from lopsided embedding:
  `\norm{S(AY-B)}_v\ge (1-\eps)\norm{AY-B}_v` for all `Y`, `\norm{S(AY^*-B)}_v\le (1+\eps)\norm{AY^*-B}_v`
Approximate regression for sampling and `\norm{\cdot}_v`:
- `q_i\propto` leverage scores: large constant `\eps`, `m=\poly(d)`
- `q_i\propto` residuals: can get small `\eps`

Reducing Columns: Sketching the rows

`\norm{W}_v` summands are `\ell_2` norms of rows
Compress to rows with about the same norm
- As in JL, random projections
Express as `R\in\R^{d\times m}`
- Sketched version of matrix `W` is `WR`
- Each row of `WR` is a compressed row of `W`
e.g, `R` is i.i.d. Gaussians, `\norm{W_{i,*} R}_2 = (1\pm\eps)\norm{W_{i,*}}_2`
- With high probability, if `m\approx\log(n)/\eps^2`
For given `W`:
- `\E[\norm{WR}_v] = \norm{W}_v`, by construction
- `\norm{WR}_v=(1\pm\eps)\norm{W}_v`, w.h.p, for enough columns

Sketching rows for regression

With the right sketching method and dimension, with probability `9/10`:

For all `X`, `\norm{XAR}_v \approx \norm{XA}_v` for all `X`
- Subspace embedding (here, not new)
Approximate regression:
For `\OPT = \min_X \norm{XA-B}_v`,
`X^*_R \equiv \argmin_{X}\norm{(XA-B)R}_v` has
`\norm{X^*_R A-B}_v \le (1+\eps)\OPT`
- When `R` has `\poly(\rank(A)/\eps)` columns
  (need here `\rank(A)\ll d`)
- New structural result: formerly only for `\norm{\cdot}_F`
Helpful properties:
- Many constructions are oblivious: don't look at input data
- Can include constraints:
  - For given set of matrices `\cC`, such as `\cC=\{X\mid \rank(X)=k\}`
  - `X^*_R \equiv \argmin_{X\color{red}{\in\cC}}\norm{(XA-B)R}_v`
    good for `\min_{X\color{red}{\in\cC}} \norm{XA-B}_v`
- Speed!

Faster sketching: sparse embeddings

Sketching matrix `R` can be sparse embedding
Adaptation of CountSketch from streaming algorithms
`R` looks like:
$\qquad\qquad\left[\begin{matrix} 0 & -1 & 0 \\ +1 & 0 & 0 \\ 0 & 0 & +1 \\ 0 & -1 & 0 \\ 0 & +1 & 0 \\ 0 & 0 & +1 \\ -1 & 0 & 0 \end{matrix} \right]$
One random `\pm 1` per row, column randomly chosen
Each column of `WR` is sum of `\pm W_{*,j}`
Takes `O(\nnz{W})` to compute `WR`
[CW13] introduced sparse embeddings for low-rank approx, used here too

Using the techniques: column reduction

Suppose `A_k= \argmin_{\rank(Z)=k} \norm{Z-A}_v`, `\OPT\equiv \norm{A_k-A}_v`
The minimizer of `\min_X \norm{XA_k-A}_v` is `X^*=I_n`
Suppose `R` yields approximate regression for
`\min_X \norm{XA_k - A}_v`
- Need `\poly(k/\eps)` columns
- `R` is oblivious to the unknown `A_k`
Let `X_R^* = \argmin_X \norm{(XA_k-A)R}_v`. We have:
- `\norm{X_R^* A_k-A}_v \le (1+\eps)\OPT` (with const. prob.)
- Row `i` of `X_R^*` solves `\min_{x\in\R^n} \norm{x^\top A_kR - A_{i,*}R}_2`
- That is, `X^*_R = AR(A_kR)^+`
So:
`\begin{align*} \min_{\rank(Y)=k}\norm{ARY-A}_v & \le \norm{AR(A_kR)^+A_k-A}_v \\ & = \norm{X_R^* A_k - A}_v \\ & \le (1 + \eps)\OPT \end{align*}`

Using the techniques: row reduction

`R` approximates regression for `A_k, A`
`\implies`
`\min_{\rank(Y)=k}\norm{ARY-A}_v \le (1+\eps)\OPT`

Too slow to solve for `Y` directly
Apply leverage score sampling to thin matrix `AR`
- But that only gives a constant factor approximation
So `Y_S^* \equiv \argmin_{\rank(Y)=k} \norm{S(ARY-A)}_v` has `\norm{ARY_S^*-A}_v \le C \min_{\rank(Y)=k}\norm{ARY-A}_v`
Observation: `Y_S^*` is in the rowspan of `SA`
- Let $P_{SA}$ be the matrix projecting onto `\rowspan(SA)`
- Projecting `SAR Y_S^* P_{SA}` can only make it closer to `SA`
`ARY_S^*` is an `C`-approx solution in `\rowspan(SA)`

`S` approximates regression for `AR, A`
`\implies`
`L_{SA}\equiv\rowspan(SA)` is a good `C`-approx subspace

Sampling

Our story thus far:

Find a weak coreset and its subspace `L_\eps`
1. Find bi-criteria `O(1)`-approximation:
  - Construct `R` and compute `AR`
  - Construct `S` using leverage scores of `AR`
2. Construct `S_\eps` using `\dist(a_i, \rowspan(SA))`
  - `L_\eps\equiv \rowspan(S_\eps A)` is the weak coreset
Working within `L_\eps`, reduce to a "small" problem
Solve small problem with algorithm taking $\exp$-time

Constructing a weak coreset

Use residual sampling:
`S_\eps` samples rows with probability `q_i\propto \dist(a_i, L_{SA})`
`S_\eps A` is a weak coreset:
`X_{S_\eps}^* = \argmin_{\rank(X)=k}\norm{S_\eps(AX-A)}_v` with `\norm{AX_{S_\eps}^* - A}_v \le (1+\eps)\min_{\rank(X)=k}\norm{AX-A}_v`
New improved:
follows [DV07], except not adaptive: each sample chosen independently
More specialized than multiple-response regression

Step 2: Working within `L_\eps`, reduce to a "small" problem

Let `U^\top` have orthonormal rows, and `\rowspan(U^\top)=\rowspan(S_\eps A)`
Projection of row vector `a`
onto `k`-dim `L_k\subset\rowspan(U^\top)` is
`aUXU^\top`, for a particular `X` with `\rank(X)=k`
That is, we want to solve `\min_{\rank(X)=k} \norm{AUXU^\top -A}_v`
We again sketch on the right, and sample on the left: `\tX \equiv \argmin_{\rank(X)=k} \norm{SAUXU^\top R - SAR}_v`
has `\norm{AU\tX U^\top -A}_v \le (1+\eps)\min_{\rank(X)=k}\norm{AUXU^\top - A}_v`
- Use `R` that approximates regression for `U^\top`, `A`
  - With constraint that `Y=AUX` for some rank-`k` `X`
- `S` from leverage score sampling of `[AU\, AR]`
  - Both tall, thin matrices
  - Use subspace embedding with quality `\eps`

Conclusions

First provably good input-sparsity-time algorithm for robust low-rank approximation
Some reductions for nice `M`-estimators
- (Sampling is harder to apply)
Showed NP-hardness for `\sum_i \dist(a_i, L)^p` with `p\in [1,2)`
Also (not mentioned so far)
- Prior work: approximate regression for Huber estimator [CW15]
- Used techniques here to extend to nice `M`-estimators

Thank you for your attention!