\( \newcommand\norm[1]{|| #1 ||} \newcommand\nnz[1]{{\textrm{nnz}(#1)}} \newcommand\E{\mathbf{E}} \newcommand\Dist{\mathbf{Dist}} \newcommand\reals{{\mathrm{I\!R}}} \newcommand\R{{\mathrm{I\!R}}} \newcommand\poly{{\textrm{poly}}} \newcommand\rank{{\textrm{rank}}} \newcommand\colspan{{\textrm{range}}} \newcommand\diam{{\textrm{diam}}} \newcommand\polylog{{\textrm{polylog}}} \newcommand\rowspan{{\textrm{rowspan}}} \newcommand\colbasis{{\textrm{colbasis}}} \newcommand\argmin{{\mathrm{argmin}}} \newcommand\Err{{\mathtt{NE}}} \newcommand\eps{{\epsilon}} \newcommand\bfone{{\mathbf{1}}} \newcommand\Var{{\mathbf{Var}}} \newcommand\stdev[1]{{\mathbf{stdev(#1)}}} \newcommand\cgf{{\mathbf{CGF}}} \newcommand\bfone{{\mathbb{1}}} \newcommand\Iden{{\mathbf{I}}} \newcommand\Prob{{\mathbf{Pr}}} \DeclareMathOperator\subg{\mathbf{subG}} \DeclareMathOperator\dist{\mathrm{dist}} \DeclareMathOperator\cost{\mathrm{cost}} \newcommand\cN{{\cal N}} \newcommand\cS{{\cal S}} \newcommand\cD{{\cal D}} \newcommand\cM{{\cal M}} \newcommand\cE{{\cal E}} \newcommand\cC{{\cal C}} \newcommand\KJL{K_{\mathbb{JL}}} \newcommand\tO{{\tilde O}} \newcommand\tL{{\tilde L}} \newcommand\tx{{\tilde{\mathbf{x}}}} \newcommand\tX{{\tilde X}} \newcommand\tY{{\tilde{\mathbf{Y}}}} \newcommand\tA{{\tilde{\mathbf{A}}}} \newcommand\bA{{\mathbf{A}}} \newcommand\ba{{\mathbf{a}}} \newcommand\bw{{\mathbf{w}}} \newcommand\bW{{\mathbf{W}}} \newcommand\bR{{\mathbf{R}}} \newcommand\bB{{\mathbf{B}}} \newcommand\bQ{{\mathbf{Q}}} \newcommand\bZ{{\mathbf{Z}}} \newcommand\bz{{\mathbf{z}}} \newcommand\bU{{\mathbf{U}}} \newcommand\bV{{\mathbf{V}}} \newcommand\bX{{\mathbf{X}}} \newcommand\bS{{\mathbf{S}}} \newcommand\hS{{\hat S}} \newcommand\hR{{\hat R}} \newcommand\bs{{\mathbf{s}}} \newcommand\bx{{\mathbf{x}}} \newcommand\by{{\mathbf{y}}} \newcommand\bb{{\mathbf{b}}} \newcommand\bT{{\mathbf{T}}} \newcommand\bP{{\mathbf{P}}} \newcommand\bG{{\mathbf{G}}} \newcommand\bY{{\mathbf{Y}}} \newcommand\hY{\hat {\mathbf{Y}}} \newcommand\bH{{\mathbf{H}}} \newcommand\bD{{\mathbf{D}}} \newcommand\be{{\mathbf{e}}} \newcommand\br{{\mathbf{r}}} \newcommand\bSig{{\mathbf{\Sigma}}} \newcommand\bPi{{\mathbf{\Pi}}} \newcommand\hphi{{\hat\phi}} \newcommand\OPT{{\mathtt{OPT}}} \)

Input Sparsity and Hardness for Robust Subspace Approximation

Ken Clarkson, David Woodruff
IBM Research, Almaden



Fitting Data: the Singular Value Decomposition

  • Input `n` points `a_1,\ldots,a_n` in `d` dimensions
  • Arranged as rows of `A\in\R^{n\times d}`
  • Fit: a `k`-dimensional subspace `L^*` minimizing $$\cost(A,L) \equiv \sum_{i\in [n]} \dist(a_i,L)^2$$ among all `k`-dimensional subspaces
  • Solved by the SVD: `L^*` is the span of top `k` right singular vectors `V_k^\top`
    • That is, there is a best rank-`k` `A_k=U_k\Sigma_k V_k^\top` with `\cost(A,L^*) = \norm{A - A_k}_F^2 = \sum_{i\in [n]} \dist(a_i, a_iV_k V_k^\top)^2`
  • Fast approximation algorithms: can find `L'` with `\dim(L') = k` and `\cost(A,L')\le (1+\eps)\cost(A,L^*)` in $$O(\nnz{A} + (n+d)\poly(k/\eps))$$ time. [CW13], see also [NN13],[MM13]

SVD is Not Very Robust

  • Here compared against minimizing the sum of Euclidean distances
    • That is, truncated SVD
  • A single point can exercise a lot of "leverage"

`M`-estimators for Regression

  • Regression `M`-estimators: `\min_x \sum_i M(r_i)`,
    for some function `M`,
    where `r\equiv Ax-b`
    • `M(z) = z^2`: `\ell_2^2` least-squares
      `\min_x \norm{Ax-b}_2^2`
    • `M(z) = |z|`: `\ell_1` regression
    • `M(z) = \begin{cases} z^2/2\tau & z \le \tau \\ |z| - \tau/2& z\ge \tau \end{cases}`
      • ...the Huber estimator [H64]
      • Robust, yet smooth
    • Tukey: even more robust

Robust Low-Rank Approximation

  • What's a robust analog of `\norm{\cdot}_F^2`?
  • Often want rotational invariance:
    for all rotations `Q`, $$\cost(AQ, LQ) = \cost(A,L)$$
    `\implies` independent of choice of basis.
    But also `\implies` can't use the cost `\sum_{i\in[n],j\in[d]} |W_{ij}|` for `W=A_k-A`
  • Could use `\sum_{i\in[n]} M(\dist(a_i, L))`, e.g. `M(z)=z^p`
    • Rotationally invariant, more robust when `p\lt 2`
    • Proposed in ML and applications [DZHZ06][BWSDY09]
  • For `V` a basis for `L`, `\cost(A,L) = \cost(A,V) \equiv \sum_{i\in [n]} M(\dist(a_i, a_i V V^\top))`

Prior theoretical work

Fix `\eps>0`, `p\in [1,\infty)`, `M(z) = z^p`

$$\OPT \equiv \min_{\dim{L}=k} \cost(A,L)$$ Subspace `\tL` is a `(1+\eps)`-approximation if
  • `\dim(\tL) = k`
  • `\cost(A,\tL)\le (1+\eps)\OPT`

Bi-criteria: allow `\dim(\tL)\gt k`, say `\dim(\tL) = \poly(k/\eps)`
Weak coreset: `L'` with `L'\supset \tL` and `\dim(L')` small (maybe `\poly(k/\eps)`)

  • [SV07] `(1+\eps)`-approximation in `nd\exp(\poly(k/\eps))` time
  • [DTV10] `\sqrt{2}\gamma_p`-approximation, for `\gamma_p` a function of `p`
  • [DV07] Weak coreset
  • [DV07] Bi-criteria
  • [GRSW12] For `p\gt 2`, NP-hard to get `\gamma_p`-approx

Our results: hardness

  • For `p\in [1,2)`, NP-hard to find `(1+O(1/d))`-approximation
  • That is, no `\poly(ndk/\eps)` algorithm unless `\mathrm{P}=\mathrm{NP}`
  • With prior work, `\implies` singularity at `p=2`:
    NP-hard unless `p=2+o(1)`
  • Open: hardness when `\eps` is fixed

Our results: algorithms for `M(z) = z^p`

  • For `p\in[1,2)`, an algorithm that with constant probability returns an `(1+\eps)`-approx in $$O(\nnz{A} + (n+d)\poly(k/\eps) + \exp(\poly(k/\eps))$$ time.
    • "Constant probability" means, say, 9/10
  • `\Omega(\nnz{A})` is unavoidable for `\eps<\infty`
  • Hardness results suggest `\exp(\poly(k/\eps))` may be unavoidable

Our results: nice `M`-estimators

  • Results for functions `M(z)` that are nice:
    • Even: `M(z)=M(-z)`
    • Monotone: `|a|\gt |b| \implies M(a)\gt M(b)`
    • Bounded growth: there is `C_M` so that for `a\ge b\gt 0`, $$C_M\frac{a}{b} \le \frac{M(a)}{M(b)} \le \frac{a^2}{b^2}$$
    • Root-subadditive: $$\sqrt{M(a+b)} \le \sqrt{M(a)} + \sqrt{M(b)}$$
  • $M()$ functions in the literature are nice, or almost
  • For vector `x`, `\sqrt{\sum_{i\in [n]} M(x_i)}` is almost a norm
    • All properties except:
      scale invariance: `\norm{\kappa x} = |\kappa|\cdot\norm{x}`
    • But have
      scale insensitive: `\sqrt{\kappa C_M} \norm{x}_M \le \norm{\kappa x}_M \le \kappa\norm{x}_M`, for `\kappa\ge 1`
    • Can use `\eps`-net/`\eps`-cover arguments

Our results:
algorithms for nice $M$-estimators

For a value `K=(\log n)^{O(\log k)}`:

  • Reduce to a problem of size `\poly(K/\eps)` in time $$O(\nnz{A}) + (n+d)\poly(K/\eps)$$
  • Bi-criteria: find `K`-approx solution of dimension `\poly(k\log n)` in time $$O(\nnz{A}) + (n+d)\poly(k)$$
  • Weak(er) coreset: find `L` of dim `\poly(K/\eps)` that contains an `(1+\eps)`-approximation

Techniques: Overall Strategy

We use a series of row and column reductions that preserve approximate solutions

  1. Find a weak coreset and its subspace `L_\eps`
  2. Working within `L_\eps`, reduce to a "small" problem
    • For `M(z)=z^p`: "small" = $\poly(k/\eps)\times \poly(k/\eps)$
  3. Solve small problem with algorithm taking $\exp$-time
    • [BPR96], e.g.: solves polynomial inequalities;
      given:
      • `c` polynomial inequalities
      • `q` degree
      • `z` variables
      takes `\poly(cq)^z` time.
    • Or, use some nice heuristic

Techniques

Outline of remaining talk:

  • Row and column reduction methods for regression
  • Application of those methods to low-rank approximation


  • Focus on `M(z)=|z|`
    • Most techniques also apply to nice `M(z)`, and to `p\ge 2`
  • Let `\norm{W}_v` be the sum of the Euclidean norms of the rows: $$\norm{W}_v\equiv \norm{W_{1,*}}_2 + \norm{W_{2,*}}_2 + \ldots + \norm{W_{n,*}}_2$$
    • a.k.a. the `L_{2,1}` or `R^1` norm
  • So `\cost(A,L) = \sum_{i\in [n]} \dist(a_i, a_iVV^\top)^1 = \norm{A-AVV^\top}_v`
    • `L=\rowspan(V^\top)`, rows of `V^\top` are orthonormal

Row and Column Reductions

The algorithm uses:

  • Sampling matrices `S`, `S_\eps`, and `\hS`, with few rows
  • Sketching matrices `R` and `\hR`, with few columns
  • Use `S` to sample rows: `SA` has few rows
  • Use `R` to compress rows: `AR` has few columns
    • Reducing the number of columns
  • Each sampling and sketching matrix has different properties

The algorithm

Explanation to follow!

  1. Find a weak coreset and its subspace `L_\eps`
    1. Find bi-criteria `O(1)`-approximation
      • Construct `R` and compute `AR`
      • Construct `S` using leverage scores of `AR`
    2. Construct `S_\eps` using `\dist(a_i, \rowspan(SA))`
      • `L_\eps\equiv \rowspan(S_\eps A)` is the weak coreset
  2. Working within `L_\eps`, reduce to a "small" problem
    • Construct basis `U^\top` for `L_\eps`
    • Construct `\hR`, and construct `\hS` from `AU` and `A\hR`
    • Small problem is `\min_{\rank(Y)=k} \norm{\hS A U Y U^\top \hR - \hS A\hR}_v`
  3. Solve small problem with algorithm taking $\exp$-time

Reducing Rows: Sampling the rows

  • `\norm{W}_v` is a sum, estimate with a sample
  • Use weighted sampling
    • Pick row `W_{i,*}` independently with probability `q_i`
    • Weight by `1/q_i`
  • Express as `S\in\R^{m\times n}`
    • Each row of `S` is a scaled multiple of basis vector `e_i`
    • Each row of `SW` is a scaled row of `W`
    • `m` is a random variable, `\E[m] = \sum_{i\in [n]} q_i`
  • For given `W`:
    • `\E[\norm{SW}_v] = \norm{W}_v`, by construction
    • `\norm{SW}_v=(1\pm\eps)\norm{W}_v`, with high probability
      • For the right `q_i`
      • `\E[m]` large enough

Sampling rows for Regression

With the right `q_i` and sample size, with probability `9/10`:

  • For all `Y`, `\norm{SAY}_v = (1\pm\eps)\norm{AY}_v`
    • A kind of subspace embedding:
      embedding `\{AY\mid Y\in\R^{d\times d'}\}\subset\R^{n\times d'}` in `\R^{m\times d'}`
  • Or approximate regression:
    For `\OPT = \min_Y \norm{AY-B}_v`,
    `Y^*_S \equiv \argmin_{Y}\norm{S(AY-B)}_v` has
    `\norm{AY^*_S-B}_v \le (1+\eps)\OPT`
    • Multiple-response regression, `B` is a matrix not a vector
    • Follows from lopsided embedding:
      `\norm{S(AY-B)}_v\ge (1-\eps)\norm{AY-B}_v` for all `Y`, `\norm{S(AY^*-B)}_v\le (1+\eps)\norm{AY^*-B}_v`
  • Approximate regression for sampling and `\norm{\cdot}_v`:
    • `q_i\propto` leverage scores: large constant `\eps`, `m=\poly(d)`
    • `q_i\propto` residuals: can get small `\eps`
        Residuals: cost of `a_i` for a `C`-approx solution

Reducing Columns: Sketching the rows

  • `\norm{W}_v` summands are `\ell_2` norms of rows
  • Compress to rows with about the same norm
    • As in JL, random projections
  • Express as `R\in\R^{d\times m}`
    • Sketched version of matrix `W` is `WR`
    • Each row of `WR` is a compressed row of `W`
  • e.g, `R` is i.i.d. Gaussians, `\norm{W_{i,*} R}_2 = (1\pm\eps)\norm{W_{i,*}}_2`
    • With high probability, if `m\approx\log(n)/\eps^2`
  • For given `W`:
    • `\E[\norm{WR}_v] = \norm{W}_v`, by construction
    • `\norm{WR}_v=(1\pm\eps)\norm{W}_v`, w.h.p, for enough columns

Sketching rows for regression

With the right sketching method and dimension, with probability `9/10`:

  • For all `X`, `\norm{XAR}_v \approx \norm{XA}_v` for all `X`
    • Subspace embedding (here, not new)
  • Approximate regression:
    For `\OPT = \min_X \norm{XA-B}_v`,
    `X^*_R \equiv \argmin_{X}\norm{(XA-B)R}_v` has
    `\norm{X^*_R A-B}_v \le (1+\eps)\OPT`
    • When `R` has `\poly(\rank(A)/\eps)` columns
      (need here `\rank(A)\ll d`)
    • New structural result: formerly only for `\norm{\cdot}_F`
  • Helpful properties:
    • Many constructions are oblivious: don't look at input data
    • Can include constraints:
      • For given set of matrices `\cC`, such as `\cC=\{X\mid \rank(X)=k\}`
      • `X^*_R \equiv \argmin_{X\color{red}{\in\cC}}\norm{(XA-B)R}_v`
        good for `\min_{X\color{red}{\in\cC}} \norm{XA-B}_v`
    • Speed!

Faster sketching: sparse embeddings

  • Sketching matrix `R` can be sparse embedding
  • Adaptation of CountSketch from streaming algorithms
  • `R` looks like:
    $\qquad\qquad\left[\begin{matrix} 0 & -1 & 0 \\ +1 & 0 & 0 \\ 0 & 0 & +1 \\ 0 & -1 & 0 \\ 0 & +1 & 0 \\ 0 & 0 & +1 \\ -1 & 0 & 0 \end{matrix} \right]$
  • One random `\pm 1` per row, column randomly chosen
  • Each column of `WR` is sum of `\pm W_{*,j}`
  • Takes `O(\nnz{W})` to compute `WR`
  • [CW13] introduced sparse embeddings for low-rank approx, used here too

Using the techniques: column reduction

  • Suppose `A_k= \argmin_{\rank(Z)=k} \norm{Z-A}_v`, `\OPT\equiv \norm{A_k-A}_v`
  • The minimizer of `\min_X \norm{XA_k-A}_v` is `X^*=I_n`
  • Suppose `R` yields approximate regression for
    `\min_X \norm{XA_k - A}_v`
    • Need `\poly(k/\eps)` columns
    • `R` is oblivious to the unknown `A_k`
  • Let `X_R^* = \argmin_X \norm{(XA_k-A)R}_v`. We have:
    • `\norm{X_R^* A_k-A}_v \le (1+\eps)\OPT` (with const. prob.)
    • Row `i` of `X_R^*` solves `\min_{x\in\R^n} \norm{x^\top A_kR - A_{i,*}R}_2`
    • That is, `X^*_R = AR(A_kR)^+`
  • So:
    `\begin{align*} \min_{\rank(Y)=k}\norm{ARY-A}_v & \le \norm{AR(A_kR)^+A_k-A}_v \\ & = \norm{X_R^* A_k - A}_v \\ & \le (1 + \eps)\OPT \end{align*}`

Using the techniques: row reduction

`R` approximates regression for `A_k, A`
`\implies`
`\min_{\rank(Y)=k}\norm{ARY-A}_v \le (1+\eps)\OPT`
  • Too slow to solve for `Y` directly
  • Apply leverage score sampling to thin matrix `AR`
    • But that only gives a constant factor approximation
  • So `Y_S^* \equiv \argmin_{\rank(Y)=k} \norm{S(ARY-A)}_v` has `\norm{ARY_S^*-A}_v \le C \min_{\rank(Y)=k}\norm{ARY-A}_v`
  • Observation: `Y_S^*` is in the rowspan of `SA`
    • Let $P_{SA}$ be the matrix projecting onto `\rowspan(SA)`
    • Projecting `SAR Y_S^* P_{SA}` can only make it closer to `SA`
  • `ARY_S^*` is an `C`-approx solution in `\rowspan(SA)`
`S` approximates regression for `AR, A`
`\implies`
`L_{SA}\equiv\rowspan(SA)` is a good `C`-approx subspace

Sampling

Our story thus far:

  1. Find a weak coreset and its subspace `L_\eps`
    1. Find bi-criteria `O(1)`-approximation:
      • Construct `R` and compute `AR`
      • Construct `S` using leverage scores of `AR`
    2. Construct `S_\eps` using `\dist(a_i, \rowspan(SA))`
      • `L_\eps\equiv \rowspan(S_\eps A)` is the weak coreset
  2. Working within `L_\eps`, reduce to a "small" problem
  3. Solve small problem with algorithm taking $\exp$-time

Constructing a weak coreset

  • Use residual sampling:
    `S_\eps` samples rows with probability `q_i\propto \dist(a_i, L_{SA})`
  • `S_\eps A` is a weak coreset:
    `X_{S_\eps}^* = \argmin_{\rank(X)=k}\norm{S_\eps(AX-A)}_v` with `\norm{AX_{S_\eps}^* - A}_v \le (1+\eps)\min_{\rank(X)=k}\norm{AX-A}_v`
  • New improved:
    follows [DV07], except not adaptive: each sample chosen independently
  • More specialized than multiple-response regression

Step 2: Working within `L_\eps`, reduce to a "small" problem

  • Let `U^\top` have orthonormal rows, and `\rowspan(U^\top)=\rowspan(S_\eps A)`
  • Projection of row vector `a`
    onto `k`-dim `L_k\subset\rowspan(U^\top)` is
    `aUXU^\top`, for a particular `X` with `\rank(X)=k`
  • That is, we want to solve `\min_{\rank(X)=k} \norm{AUXU^\top -A}_v`
  • We again sketch on the right, and sample on the left: `\tX \equiv \argmin_{\rank(X)=k} \norm{SAUXU^\top R - SAR}_v`
    has `\norm{AU\tX U^\top -A}_v \le (1+\eps)\min_{\rank(X)=k}\norm{AUXU^\top - A}_v`
    • Use `R` that approximates regression for `U^\top`, `A`
      • With constraint that `Y=AUX` for some rank-`k` `X`
    • `S` from leverage score sampling of `[AU\, AR]`
      • Both tall, thin matrices
      • Use subspace embedding with quality `\eps`

Conclusions

  • First provably good input-sparsity-time algorithm for robust low-rank approximation
  • Some reductions for nice `M`-estimators
    • (Sampling is harder to apply)
  • Showed NP-hardness for `\sum_i \dist(a_i, L)^p` with `p\in [1,2)`
  • Also (not mentioned so far)
    • Prior work: approximate regression for Huber estimator [CW15]
    • Used techniques here to extend to nice `M`-estimators
Thank you for your attention!