\( \newcommand\norm[1]{|| #1 ||} \newcommand\nnz[1]{{\textrm{nnz}(#1)}} \newcommand\E{\mathbf{E}} \newcommand\Dist{\mathbf{Dist}} \newcommand\reals{{\mathrm{I\!R}}} \newcommand\R{{\mathrm{I\!R}}} \newcommand\poly{{\textrm{poly}}} \newcommand\rank{{\textrm{rank}}} \newcommand\sr{{\textrm{sr}}} \newcommand\colspan{{\textrm{im}}} \newcommand\diam{{\textrm{diam}}} \newcommand\polylog{{\textrm{polylog}}} \newcommand\rowspan{{\textrm{rowspan}}} \newcommand\colbasis{{\textrm{colbasis}}} \newcommand\argmin{{\mathrm{argmin}}} \newcommand\Err{{\mathtt{NE}}} \newcommand\eps{{\epsilon}} \newcommand\bfone{{\mathbf{1}}} \newcommand\Var{{\mathbf{Var}}} \newcommand\stdev[1]{{\mathbf{stdev(#1)}}} \newcommand\cgf{{\mathbf{CGF}}} \newcommand\bfone{{\mathbb{1}}} \newcommand\Iden{{\mathbf{I}}} \newcommand\Prob{{\mathbf{Pr}}} \DeclareMathOperator\subg{\mathbf{subG}} \DeclareMathOperator\dist{\mathrm{dist}} \DeclareMathOperator\cost{\mathrm{cost}} \newcommand\cN{{\cal N}} \newcommand\cS{{\cal S}} \newcommand\cD{{\cal D}} \newcommand\cM{{\cal M}} \newcommand\cE{{\cal E}} \newcommand\cu{{\cal U}} \newcommand\cC{{\cal C}} \newcommand\KJL{K_{\mathbb{JL}}} \newcommand\tO{{\tilde O}} \newcommand\tL{{\tilde L}} \newcommand\maybemathbf[1]{\mathbf{#1}} \newcommand\matA{\maybemathbf{A}} \newcommand\matB{\maybemathbf{B}} \newcommand\matC{\maybemathbf{C}} \newcommand\matD{\maybemathbf{D}} \newcommand\matSigma{\maybemathbf{\Sigma}} \newcommand\matU{\maybemathbf{U}} \newcommand\matV{\maybemathbf{V}} \newcommand\matE{\maybemathbf{E}} \newcommand\matzero{\maybemathbf{0}} \newcommand\matW{\maybemathbf{W}} \newcommand\matZ{\maybemathbf{Z}} \newcommand\matY{\maybemathbf{Y}} \newcommand\matX{\maybemathbf{X}} \newcommand\matR{\maybemathbf{R}} \newcommand\matS{\maybemathbf{S}} \newcommand\matT{\maybemathbf{T}} \newcommand\matN{\maybemathbf{N}} \newcommand\matP{\maybemathbf{P}} \newcommand\vecx{\maybemathbf{x}} \newcommand\vecb{\maybemathbf{b}} \newcommand\hvecb{\hat{\maybemathbf{b}}} \newcommand\vecy{\maybemathbf{y}} \newcommand\vecr{\maybemathbf{r}} \newcommand\vecw{\maybemathbf{w}} \newcommand\tmatW{\tilde{\maybemathbf{W}}} \newcommand\tmatU{\tilde{\maybemathbf{U}}} \newcommand\tmatV{\tilde{\maybemathbf{V}}} \newcommand\tmatZ{\tilde{\maybemathbf{Z}}} \newcommand\hmatZ{\hat{\maybemathbf{Z}}} \newcommand\tmatY{\tilde{\maybemathbf{Y}}} \newcommand\tmatX{\tilde{\maybemathbf{X}}} \newcommand\tvecx{\tilde{\maybemathbf{x}}} \newcommand\hmatS{{ \maybemathbf{\hat S}}} \newcommand\hmatR{{ \maybemathbf{\hat R}}} \newcommand\hmatA{{ \maybemathbf{\hat A}}} \newcommand\twomat[2]{\left[\begin{smallmatrix} #1 \\ #2 \end{smallmatrix} \right] } \newcommand\twomatr[2]{\left[\begin{smallmatrix} #1 & #2 \end{smallmatrix} \right] } \newcommand\tx{{\tilde{\mathbf{x}}}} \newcommand\tX{{\tilde{\mathbf{X}}}} \newcommand\tY{{\tilde{\mathbf{Y}}}} \newcommand\tA{{\tilde{\mathbf{A}}}} \newcommand\tW{{\tilde{\mathbf{W}}}} \newcommand\tZ{{\tilde{\mathbf{Z}}}} \newcommand\hS{{\hat {\mathbf{S}}}} \newcommand\hR{{\hat {\mathbf{R}}}} \newcommand\hA{{\hat {\mathbf{A}}}} \newcommand\hY{\hat {\mathbf{Y}}} \newcommand\bA{{\mathbf{A}}} \newcommand\bN{{\mathbf{N}}} \newcommand\ba{{\mathbf{a}}} \newcommand\bw{{\mathbf{w}}} \newcommand\W{{\mathbf{W}}} \newcommand\bB{{\mathbf{B}}} \newcommand\bQ{{\mathbf{Q}}} \newcommand\bZ{{\mathbf{Z}}} \newcommand\bz{{\mathbf{z}}} \newcommand\bU{{\mathbf{U}}} \newcommand\bV{{\mathbf{V}}} \newcommand\bX{{\mathbf{X}}} \newcommand\bS{{\mathbf{S}}} \newcommand\bs{{\mathbf{s}}} \newcommand\bx{{\mathbf{x}}} \newcommand\by{{\mathbf{y}}} \newcommand\bb{{\mathbf{b}}} \newcommand\bT{{\mathbf{T}}} \newcommand\bP{{\mathbf{P}}} \newcommand\bG{{\mathbf{G}}} \newcommand\bY{{\mathbf{Y}}} \newcommand\bH{{\mathbf{H}}} \newcommand\bD{{\mathbf{D}}} \newcommand\be{{\mathbf{e}}} \newcommand\br{{\mathbf{r}}} \newcommand\Sig{{\mathbf{\Sigma}}} \newcommand\bLam{{\mathbf{\Lambda}}} \newcommand\bPi{{\mathbf{\Pi}}} \newcommand\hphi{{\hat\phi}} \newcommand\OPT{{\mathtt{OPT}}} \newcommand\sd{{\mathtt{sd}}} \newcommand\words{{\mathtt{words}}} \newcommand\bp{{\mathbf{p}}} \newcommand\bR{{\mathbf{R}}} \newcommand\bW{{\mathbf{W}}} \)

Dimensionality Reduction for Tukey Regression


Ken Clarkson, IBM Research
Ruosong Wang, CMU
David Woodruff, CMU

Regression

`\matA,\vecb,n,d,\matA_i` `\matA` is an `n\times d` matrix
`\vecb\in\R^n` is a column vector
`\matA_i` is row `i` of `\matA`
`\nnz{\matA}` is the Number of NonZero entries of `\matA`

`(\matA_i, b_i)` for `i \in [n]` is an input set of points



(General) regression Given function `M:\R\rightarrow\R`,
find $$\min_{\vecx\in\R^d} \sum_i M(\matA_i \vecx - b_i)$$ Generically, an `M`-estimator.
`\ell_p` regression
When:
  • `M(z)=z^2`, least-squares regression [G95][L05]
  • `M(z)=|z|^p` for some `p\ge 1`, `\ell_p` regression
    `\qquad` e.g., `p=1`, a.k.a. least absolute deviations

Near-linear algorithms (old set of examples)

  • Rank: `\tO(\nnz{\matA})` [CKL12] (cf. rank condensers [GR08])
    • `\rank(\matA)\equiv \dim \colspan(\matA)`
  • Regression
    • `\tO(nd) + \tO(d^3)\poly(1/\eps)` [S06][DMMS10]
    • `\tO(\nnz{\matA}) + \tO(d^3)\poly(1/\eps)` [CW13][NN13][MM13]
    • `\tO(\nnz{\matA})\log(1/\eps) + \tO(d^3)\poly(\log(1/\eps))` [AMT10]++, [MMS12]++
    • `\tO(\nnz{\matA} + \tO(d^3)\poly(1/\eps))` [LMP13]
  • Low-rank approximation
    • `\tO(\nnz{\matA}) + O(n+d)\poly(k/\eps) + \poly(k/\eps)` [CW13][NN13][MM13]
  • Leading eigenvalues
    • `\tO(\nnz{\matA}) + \poly(k/\eps)` [AN13]++
  • Canonical correlation analysis
    • `\tO(\nnz{\matA}+\nnz{\matB}) + \tO(d^3)\poly(1/\eps)` [ABTZ13]++

Excluding near-linear for constant `\eps`, fast matrix multiply, non-relative error
No dependence on condition number, incoherence, or other `\matA` numerical properties
May require large `\nnz{\matA}\gg d` and/or not-too-small `\eps` to be "near-linear"

Least-squares is not very robust

  • Green point is a single outlier
  • Red line is `\ell_2` (least squares) fit
  • Green line is `\ell_1` fit

Other `M`-estimators

For given `\tau`,
`M(z) = \begin{cases} z^2/2\tau & |z| \le \tau \\ |z| - \tau/2& \text{otherwise} \end{cases}`
  • [H64]
  • Robust, yet smooth: better statistical efficiency



For given `\tau`,
`M(z) = \begin{cases} \frac{\tau^2}{6} (1-[1-\left(\frac{z}{\tau}\right)^2]^3) & |z| \leq \tau \\ \frac{\tau^2}{6} & \text{otherwise}.\end{cases}`
  • Even more robust
  • `M(z)=\Theta(z^2)` for small `z`

`M`-norms and `\eps`-approximations

residual, `\norm{\cdot}_M` We will call `\vecr = \matA\vecx - \vecb` the residual for given `\vecx`
Let `\norm{\vecr}_M \equiv \left[\sum_i M(\vecr_i)\right]^{1/2}`
  • e.q., for least squares, the Euclidean norm of `\vecr`
  • The square root gives subadditivity for `M(\cdot)` of interest
  • Not quite a norm on `\vecr` : no scale invariance

`\eps`-approximation, `\tvecx`, `\vecx^*` For error parameter `\eps`, want a small relative-error solution:
`\tvecx` such that $$\norm{\matA\tvecx - \vecb}_M \le (1+\epsilon) \min_{\vecx\in\R^d} \norm{\matA\vecx^* - \vecb}_M$$ Here `\vecx^*\equiv \argmin_{\vecx\in\reals^d} \norm{\matA\vecx - \vecb}_M`

For example, Tukey regression, with `M(z)` the bisquare function

Rough outline

  • Techniques
    • Sketching and solving
    • Embeddings
    • Sparse embeddings
    • Embeddings from leverage score sampling
  • Tukey regression

Sketch and Solve

Generic scheme using sketching:

Generate
`\quad` sketching matrix `S\in\R^{m\times n}`,
`\quad` weight vector `\vecw`
return `\tvecx := \argmin_{\vecx\in\R^d} \norm{\matS(\matA\vecx-\vecb)}_{M,\vecw}`
Here `\norm{\vecr}_{M,\vecw} \equiv \left[\sum_i w_i M(\vecr_i)\right]^{1/2}`, for vector `\vecw`

TBD: solve the sketched version.
That is, this is (only) a dimensionality reduction;
but, can yield an algorithm



Figure is not to scale: typically `m = \poly(d/\eps)`, of interest when `n\gg d`

Sketching Matrices:
Oblivious Subspace Embeddings (Euclidean)

embeddings, `\ell_2` Matrix `\matS\in\R^{m\times n}` is an `\eps`-embedding of set `P\subset\R^n` if $$\norm{\matS\vecy }= (1\pm\eps)\norm{\vecy }\text{ for all }\vecy\in P.$$ `\matS` will be called a "sketching matrix".
For `\matA\in\R^{n\times d}`,
a matrix `\matS ` is a subspace `\eps`-embedding for `\matA ` if `\matS ` is an `\eps`-embedding for `\colspan(\matA )= \{\matA\vecx\mid\vecx\in\R^d\}`. That is, $$\norm{\matS\matA\vecx }= (1\pm\eps)\norm{\matA\vecx }\text{ for all }\vecx\in\R^d.$$ `\matS\matA` will be called a "sketch".

An oblivious subspace embedding is:
  • A probability distribution `\cal D` over matrices `\matS\in\R^{m\times n}`, so that
  • For any unknown but fixed matrix `\matA`,
    `\matS ` is a subspace `\eps`-embedding for `\matA `, with high-enough probability.

Subspace embeddings for least-squares regression

Suppose `\matS` is a subspace `\eps`-embedding for `\colspan([\matA\ \vecb])`.
Letting $$\begin{align*} \tvecx\equiv & \argmin_{\vecx\in\R^d}\norm{\matS (\matA\vecx -\vecb)} \\ \vecx^*\equiv & \argmin_{\vecx\in\R^d}\norm{\matA\vecx -\vecb} \end{align*}$$ when `\eps\le 1/3` $$ \norm{\matA\tvecx-\vecb} \le (1+3\eps)\norm{\matA\vecx^*-\vecb} $$
For `\vecy=\left[\begin{smallmatrix}\vecx\\ -1\end{smallmatrix}\right]`, `\vecx\in\R^d`, $$\norm{\matS (\matA\vecx -\vecb)} = \norm{\matS[\matA\ \vecb]\vecy} = (1\pm\eps)\norm{[\matA\ \vecb]\vecy} = (1\pm\eps)\norm{\matA\vecx -\vecb}$$ So
$$ \begin{align*} \norm{\matA\tvecx-\vecb} & \le\frac{\norm{\matS (\matA\tvecx-\vecb)}}{1-\eps} \le\frac{\norm{\matS (\matA\vecx^*-\vecb)}} {1-\eps} \le\norm{\matA\vecx^*-\vecb}\frac{1+\eps}{1-\eps}, \end{align*}$$
and so for `\eps\le 1/3`, `\norm{\matA\tvecx-\vecb}\le\norm{\matA\vecx^*-\vecb}(1+3\eps)`.
Solving least-squares in "sketch space"
yields a good solution in the original space.

Embeddings, from points to nets to subspaces

Why should there be subspace embeddings?

  • Show `\eps`-embedding for `P=\{\vecy\}` with very high probability
  • `\implies` embedding for `\eps`-net of unit sphere of subspace
    • Union bound
  • `\implies` embedding for subspace

Good sketching matrices

Criteria, for given `\eps`:
  • How fast can `\matS` be computed?
  • How fast can `\matS\matA` be computed?
  • What is the sketching dimension `m`, `\matS\in\R^{m\times n}`?


  • time `\tO(nd)`, `m = \tO(d/\eps^2)`:
    `\quad` Fast JL, Subsampled Randomized Hadamard [S06][DMMS10]
  • `\tO(\nnz{\matA})`, `\tO(d^2/\eps^2)`:
    `\quad` Countsketch, a.k.a. sparse embeddings [CW13][NN13][MM13]
    `\quad` cf. "feature hashing", maybe "random indexing"
  • `\tO(\nnz{\matA}) +\tO(d^3)`, `\tO(d/\eps^2)`:
    `\quad` Two-stage
  • `\tO(\nnz{\matA}f(\kappa))`, `\tO(d^{1+\kappa}/\eps^2)`:
    `\quad` OSNAPs [NN13]
  • `O(\nnz{\matA}\log(d)/\eps)`, `O(d\log(d)/\eps^2)`:
    `\quad` [Coh15]
The bottom line: can quickly remove dependence on `n`

Sparse Embeddings

  • Adaptation of CountSketch from streaming algorithms
  • `\bS ` looks like:
    $\qquad\qquad\left[\begin{matrix} 0 & -1 & 0 & 0 & +1 & 0 \\ +1 & 0 & 0 & +1 & 0 & 0 \\ 0 & 0 & +1 & 0 & 0 & -1 \\ 0 & 0 & 0 & 0 & 0 & 0 \end{matrix} \right]$
  • One random `\pm 1` per column
  • Row `\bA_{i}` of `\bA ` contributes `\pm\bA_{i}` to one row of `\bS\bA `
  • `m= \tO(d^2/\eps^2)`
  • Sometimes have `s>1` entries per column
    • Better quality, slower to apply

Sparse Embeddings, More

  • For `\by = \bA\bx`, each row of sparse embedding `\bS `:
    • Collects a subset of entries of `\by ` (into a "bucket")
    • Applies random signs
    • Adds
  • Result is `\E[\norm{\bS\by}^2] = \norm{\by }_2^2`:
    • Each bucket squared `\approx` squared norm of its entries
    • Splitting into buckets separates big entries
      • Smaller variance
    • With `n` buckets, don't quite get `\norm{\by }_2^2` exactly, always
      • (Entry `\rightarrow` Bucket is not one-to-one)

Sampling rows of a matrix

A row sampling matrix
has rows that are multiples of the natural basis vectors `\be_i^\top`

Here, sampling `\bA\in\R^{n\times d}`

If `\bS\in\R^{m\times n}` is a row sampling matrix,
each row of `\bS\bA` is a multiple of a row of `\bA`:

$\small\qquad\left[\begin{matrix} 0 & S_{12} & 0 & 0 & 0 & 0 \\ S_{21} & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & S_{33} & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & S_{46} \end{matrix} \right] \left[\begin{matrix} \bA_{1*}\\ \bA_{2*}\\ \vdots\\ \bA_{n*} \end{matrix} \right] = \left[\begin{matrix} S_{12}\bA_{2*}\\ S_{21}\bA_{1*}\\ S_{33}\bA_{3*}\\ S_{46}\bA_{6*} \end{matrix}\right]$

NB: $\matS$ requires $O(m)$ space

Are there sampling matrix distributions on `\bS`,
so that w.h.p. `\norm{\bS\bA\bx} \approx \norm{\bA \bx}` for all `\bx\in\R^d`?

So `\bS` is an embedding of the subspace $$\colspan(\bA)\equiv\{\bA\bx \mid \bx\in\R^d\}$$ into `\R^m`

Weighted sampling

A general construction:
For some `\bp\in [0,1]^n`,
independently for each `i\in [n]`,
with probability `p_i`, include `\be_i^\top/\sqrt{p_i}` as a row of `\bS`

For given `\bx`, `\E[\norm{\bS\bA\bx}^2] = \norm{\bA\bx}^2`.
$$\E[\norm{\bS\bA\bx}^2] = \sum_i p_i (\frac{1}{\sqrt{p_i}}\be_i^\top\bA\bx)^2 = \sum_i (\bA_{i}\bx)^2 = \norm{\bA\bx}^2.$$

Want to choose `\bp` so that:

  1. `\E[m] = \sum_i p_i`, is not too big
    `\quad` (`m\equiv` number of rows of `\bS`)
  2. `\bS` is a subspace embedding

Variance reduction

Uniform sampling: `p_i\propto\frac{1}{n}` for `i\in [n]` might miss big rows

Length-squared (row) sampling: `p_i\propto \norm{\bA_{i}}^2` is better,
but for some $\bx$ maybe `\bA_{i}\bx` is relatively large even if `\norm{\bA_{i}}` is small `\qquad` e.g. `\bA_{*,1} = \eps \be^{(n)}_1, \bx = \be^{(d)}_1`


That is (in either), `\bA_{i}\bx` is large, other entries of $\bA\bx$ small
`\implies` large variance, maybe wrong rank


Want `(\frac{1}{\sqrt{p_i}}\bA_{i}\bx)^2` to be "well-behaved" for all `i` and `\bx`

Leverage scores

Suppose `\bA = \bU\bR`, where `\bU^\top\bU = \Iden`.
Then `\sup_{\bx} \frac{(\bA_{i}\bx)^2 }{ \norm{\bA\bx}^2} \le \norm{\bU_{i}}^2`
\begin{align*} \sup_{\bx} \frac{(\bA_{i}\bx)^2 }{ \norm{\bA\bx}^2} = \sup_{\bx} \frac{(\bU_{i}\bR\bx)^2 }{ \norm{\bU\bR\bx}^2 } \le \sup_{\by} \frac{(\bU_{i}\by)^2 }{ \norm{\bU\by}^2 } = \sup_{\by} \frac{(\bU_{i}\by)^2 }{ \norm{\by}^2 } = \norm{\bU_{i}}^2 \end{align*}

leverage scores `\ell_i` Given `\bA `, with `\bU ` an orthonormal basis for `\colspan(\bA )`, the `i`'th leverage score of `\bA ` is `\ell_i \equiv \norm{\bU_{i}}^2 = \norm{\be_i^\top\bU }^2`.

For the `p_i=\norm{\bU_{i}}^2`,
$$\qquad \sum_i p_i = \sum_i \norm{\bU_{i}}^2 = \sum_{i,j} U_{ij}^2 = \sum_j\norm{\bU_{*j}}^2 = d$$ and all summands in
$$\frac{\norm{\bS\bA\bx}^2}{\norm{\bA\bx}^2} = \sum_i \delta_i \frac{1}{p_i}\frac{(\bA_{i}\bx)^2}{\norm{\bA\bx}^2}$$ are at most one

Computing leverage scores

Roughly: use sketch to get `\approx`orth. basis, use its row norms

Given

  • Input matrix `\bA `
  • Oblivious subspace `\eps`-embedding matrix `\bS\in\R^{m\times n}` for `\bA `
  • embedding (JL) matrix `\mathbf{\Pi}\in\R^{d\times m'}`,
    so `\norm{\bx\mathbf{\Pi}}= (1\pm\eps)\norm{\bx }` for `n` row vectors `\bx `
    • `m'=O(\eps^{-2}\log n)`

The algorithm:

`\bW = \bS *\bA `;// compute sketch
`[\bQ,\bR ] = \mathtt{qr}(\bW )`; // compute change-of-basis for `\bS\bA `
`\bZ = \bA *(\bR^{-1}*\mathbf{\Pi})`;// compute sketch of `\bA\bR^{-1}`
return `\mathtt{dot}(\bZ,\bZ,2)`; // return vector of squared row norms

Tukey regression (back to)

Our results [CWW19] generalize beyond Tukey;
but we do need some conditions

Restrictions on `M`
  • even: `M(z) = M(-z)`
  • monotone nondecreasing in `|z|`
  • for `|a|\ge |b|`, $$\frac{M(a)}{M(b)} \le \frac{a^2}{b^2}$$
  • There are $L_M, U_M$ so that for `|a| \le \tau`, $$L_M \le M(a)/a^2 \le U_M$$
  • Mostly flat: `M(a) = \tau^2` for `|a|\ge\tau`

(Need not be convex or smooth; can replace `2` by `p\ge 1`)

Restrictions on input and output
  • Relevant solutions have norm at most `n^{O(d^2)}`
    • Including constant-factor approx. solutions
  • Columns of $\matA$ and $\vecb$ have norms in `n^{O(d)}`
  • `\tau=\Omega(1/n^{O(d)})`

Prior results

The goal: quickly remove dependence on `n`
Prior work: `M`-estimators When `M(\cdot)` is nice,
sketch-and-solve allows dimensionality reduction:
  • Via sampling for:
    `\quad` Huber [CW15a]
    `\quad` All nice [CW15b]
  • Via oblivious `M`-sketches, fixed large-enough `\eps` [CW15a]

("Nice" a bit stronger than above; will define "`M`-sketch")


These yield (possibly after recursion), algorithms that take `\qquad O(\nnz{\matA}\log n) + \poly(\eps^{-1} d\log n)` time,
when the sketched problem is tractable

Did not allow `M(z)` to be flat:
must have `M(a)\ge C_M |a|` for some `C_M > 0`,
running time proportional to `1/C_M`

New result

Unresolved prior question:
Is the linear growth lower bound (`C_M \gt 0`) always needed?

Here: for Tukey, and above I/O assumptions, the answer is no

Dim. reduction for Tukey Under the assumptions, for the Tukey `M`-estimator, there is a distribution on sketching matrices
`\matS\in\R^{m\times n}` with `m=\poly(d\log n)` such that
there is `C` so that
`\quad \norm{\matA\tvecx-\vecb}\le C\norm{\matA\vecx^* - \vecb}_M`.
Here `\matS\matA` takes `O(\nnz{\matA})` time to compute.

And there is a recursive reduction, based on sampling, giving an `\eps`-approximation.

This matters because:
  • Tukey is more robust, more nonconvex
  • Suggests simpler, more general conditions are possible

I/O restrictions and `\eps`-nets

Our analysis follows (roughly) that standard outline

  • For a small-enough `\eps`-net `\cN\subset \{\matA\vecx-\vecb\mid \vecx\in\R^d\}`,
    show bounds on
    • Contraction: `\norm{\matS\vecy}_{M,\vecw}\ge \norm{y}_M` for all `y\in \cN`
      `\qquad` via analysis of single vector `y`, union bound
    • Dilation: `\norm{\matS(\matA\vecx^*-\vecb)}_{M,\vecw} \le C\norm{\matA\vecx^*-\vecb}`.
  • Extend contraction from `\cN` to all relevant `\vecy`

We need to bound the net size `|\cN|`

Before: used linear lower bound on growth: still can pack and cover

Here: use I/O restrictions

To be discussed:
What is `\matS`?
Why is `\norm{\matS\vecy}_{M,\vecw}\approx \norm{\vecy}_M`?

`M`-sketches (for nice `M`): `L_0` sampling + Countsketch

`\matS \gets \left[ \begin{matrix}\matS_0\matT_0 \\ \matS_1\matT_1 \\\vdots\\ \matS_{\log n}\matT_{\log n}\end{matrix}\right]`
  • Combine sampling and sparse embedding
    • [VZ12] earthmover, [CW15a] `M`-estimators
  • Each `\matS_h` is a sparse embedding (Countsketch) matrix
  • Each `\matT_h` uniformly samples about `\alpha^h n` entries,
    uses weights `w_i\propto 1/\alpha^h`
    • For parameter `\alpha`
  • Result: `\norm{\matS\vecy}_{M,\vecw} \approx \norm{\vecy}_M`


Roughly:
`\matT_h\vecy` sparse enough `\implies` its entries are isolated in `\matS_h\matT_h\vecy`
`\quad` one entry per bucket `\implies \norm{\matS_h\matT_h\vecy}_M\approx \alpha^h\norm{\vecy}_M`

Structural result

Analyses both for sketching and for sampling
use a structural condition on `\colspan([\matA\ \vecb])`
Roughly the following:

Distribution of outliers Suppose `\tau=1`. Given `\alpha > 0`,
there is a set `{\cal I}` of indices with `|{\cal I}|=O(d\alpha)` such that:
if `\vecr=\matA\vecx - \vecb` has `\norm{\vecr}_M\le \alpha`,
then all outliers of `\vecr` (`r_i\ge 1`) are in `{\cal I}`.
The leverage scores upper-bound the entries of `\vecr`, including outliers;
not too many leverage scores can be large;
put `i` with large leverage score into `\cal I`, repeat with what's left

This allows analysis for "big" entries to be restricted to these `\cal I`

Structural result, 2/2

Large `r_i\implies` large `\ell_i` For `\vecr` with `\norm{\vecr}^2_M\le \alpha`, largest `r_i \ge 1` has `\ell_i\ge 1/\alpha`
  • At most `\alpha` heavy coords (`r_i\ge 1`)
  • For `r_{i^*}` heaviest, `\norm{\vecr}^2\le \alpha + \alpha r_{i^*}^2`
    • Contributions the same for light coords
    • At most `r_{i^*}^2` for heavy coords, by choice of `i^*`
  • So leverage score `\ell_{i^*} \ge r_{i^*}^2/\norm{\vecr}^2 \ge r_{i^*}^2/(\alpha + \alpha r_{i^*}^2)\ge 1/\alpha`

Construction is:

  • Put all `i` with `\ell_i\ge 1/\alpha` into `\cal I`, repeat
  • `\cal I` has all heavy coords after `\alpha` iterations
  • Add at most `d\alpha` each time, since `\sum_i \ell_i \le d`


Tighter result with faster algorithm:

Split into `\alpha` random subsets
Use per-subset leverage scores
Put all `i` with `\ell_i` large into `\cal I`
Repeat about `\log(d\alpha)` times

Experiments

  • All inputs `n=10000`
  • Used LinvPY for full dataset and sampled weighted version
  • "with outliers": `5\%` of responses huge
NumDataset`d``\tau`
1Random Gaussian2010
2Random Gaussian
(with outliers)
2010
3Facebook Comment Volume5310
4Facebook Comment Volume (with outliers)53100
5Appliances energy prediction251000
6Appliances energy prediction (with outliers) 25100
7CT Slice Localization384100
8CT Slice Localization
(with outliers)
3841000

Conclusion

  • Also (as mentioned) hardness results
  • Other applications of structural result?
  • Thank you for your attention!