\(
\newcommand\norm[1]{|| #1 ||}
\newcommand\nnz[1]{{\textrm{nnz}(#1)}}
\newcommand\E{\mathbf{E}}
\newcommand\Dist{\mathbf{Dist}}
\newcommand\reals{{\mathrm{I\!R}}}
\newcommand\R{{\mathrm{I\!R}}}
\newcommand\poly{{\textrm{poly}}}
\newcommand\rank{{\textrm{rank}}}
\newcommand\sr{{\textrm{sr}}}
\newcommand\colspan{{\textrm{im}}}
\newcommand\diam{{\textrm{diam}}}
\newcommand\polylog{{\textrm{polylog}}}
\newcommand\rowspan{{\textrm{rowspan}}}
\newcommand\colbasis{{\textrm{colbasis}}}
\newcommand\argmin{{\mathrm{argmin}}}
\newcommand\Err{{\mathtt{NE}}}
\newcommand\eps{{\epsilon}}
\newcommand\bfone{{\mathbf{1}}}
\newcommand\Var{{\mathbf{Var}}}
\newcommand\stdev[1]{{\mathbf{stdev(#1)}}}
\newcommand\cgf{{\mathbf{CGF}}}
\newcommand\bfone{{\mathbb{1}}}
\newcommand\Iden{{\mathbf{I}}}
\newcommand\Prob{{\mathbf{Pr}}}
\DeclareMathOperator\subg{\mathbf{subG}}
\DeclareMathOperator\dist{\mathrm{dist}}
\DeclareMathOperator\cost{\mathrm{cost}}
\newcommand\cN{{\cal N}}
\newcommand\cS{{\cal S}}
\newcommand\cD{{\cal D}}
\newcommand\cM{{\cal M}}
\newcommand\cE{{\cal E}}
\newcommand\cu{{\cal U}}
\newcommand\cC{{\cal C}}
\newcommand\KJL{K_{\mathbb{JL}}}
\newcommand\tO{{\tilde O}}
\newcommand\tL{{\tilde L}}
\newcommand\maybemathbf[1]{\mathbf{#1}}
\newcommand\matA{\maybemathbf{A}}
\newcommand\matB{\maybemathbf{B}}
\newcommand\matC{\maybemathbf{C}}
\newcommand\matD{\maybemathbf{D}}
\newcommand\matSigma{\maybemathbf{\Sigma}}
\newcommand\matU{\maybemathbf{U}}
\newcommand\matV{\maybemathbf{V}}
\newcommand\matE{\maybemathbf{E}}
\newcommand\matzero{\maybemathbf{0}}
\newcommand\matW{\maybemathbf{W}}
\newcommand\matZ{\maybemathbf{Z}}
\newcommand\matY{\maybemathbf{Y}}
\newcommand\matX{\maybemathbf{X}}
\newcommand\matR{\maybemathbf{R}}
\newcommand\matS{\maybemathbf{S}}
\newcommand\matT{\maybemathbf{T}}
\newcommand\matN{\maybemathbf{N}}
\newcommand\matP{\maybemathbf{P}}
\newcommand\vecx{\maybemathbf{x}}
\newcommand\vecb{\maybemathbf{b}}
\newcommand\hvecb{\hat{\maybemathbf{b}}}
\newcommand\vecy{\maybemathbf{y}}
\newcommand\vecr{\maybemathbf{r}}
\newcommand\vecw{\maybemathbf{w}}
\newcommand\tmatW{\tilde{\maybemathbf{W}}}
\newcommand\tmatU{\tilde{\maybemathbf{U}}}
\newcommand\tmatV{\tilde{\maybemathbf{V}}}
\newcommand\tmatZ{\tilde{\maybemathbf{Z}}}
\newcommand\hmatZ{\hat{\maybemathbf{Z}}}
\newcommand\tmatY{\tilde{\maybemathbf{Y}}}
\newcommand\tmatX{\tilde{\maybemathbf{X}}}
\newcommand\tvecx{\tilde{\maybemathbf{x}}}
\newcommand\hmatS{{ \maybemathbf{\hat S}}}
\newcommand\hmatR{{ \maybemathbf{\hat R}}}
\newcommand\hmatA{{ \maybemathbf{\hat A}}}
\newcommand\twomat[2]{\left[\begin{smallmatrix} #1 \\ #2 \end{smallmatrix} \right] }
\newcommand\twomatr[2]{\left[\begin{smallmatrix} #1 & #2 \end{smallmatrix} \right] }
\newcommand\tx{{\tilde{\mathbf{x}}}}
\newcommand\tX{{\tilde{\mathbf{X}}}}
\newcommand\tY{{\tilde{\mathbf{Y}}}}
\newcommand\tA{{\tilde{\mathbf{A}}}}
\newcommand\tW{{\tilde{\mathbf{W}}}}
\newcommand\tZ{{\tilde{\mathbf{Z}}}}
\newcommand\hS{{\hat {\mathbf{S}}}}
\newcommand\hR{{\hat {\mathbf{R}}}}
\newcommand\hA{{\hat {\mathbf{A}}}}
\newcommand\hY{\hat {\mathbf{Y}}}
\newcommand\bA{{\mathbf{A}}}
\newcommand\bN{{\mathbf{N}}}
\newcommand\ba{{\mathbf{a}}}
\newcommand\bw{{\mathbf{w}}}
\newcommand\W{{\mathbf{W}}}
\newcommand\bB{{\mathbf{B}}}
\newcommand\bQ{{\mathbf{Q}}}
\newcommand\bZ{{\mathbf{Z}}}
\newcommand\bz{{\mathbf{z}}}
\newcommand\bU{{\mathbf{U}}}
\newcommand\bV{{\mathbf{V}}}
\newcommand\bX{{\mathbf{X}}}
\newcommand\bS{{\mathbf{S}}}
\newcommand\bs{{\mathbf{s}}}
\newcommand\bx{{\mathbf{x}}}
\newcommand\by{{\mathbf{y}}}
\newcommand\bb{{\mathbf{b}}}
\newcommand\bT{{\mathbf{T}}}
\newcommand\bP{{\mathbf{P}}}
\newcommand\bG{{\mathbf{G}}}
\newcommand\bY{{\mathbf{Y}}}
\newcommand\bH{{\mathbf{H}}}
\newcommand\bD{{\mathbf{D}}}
\newcommand\be{{\mathbf{e}}}
\newcommand\br{{\mathbf{r}}}
\newcommand\Sig{{\mathbf{\Sigma}}}
\newcommand\bLam{{\mathbf{\Lambda}}}
\newcommand\bPi{{\mathbf{\Pi}}}
\newcommand\hphi{{\hat\phi}}
\newcommand\OPT{{\mathtt{OPT}}}
\newcommand\sd{{\mathtt{sd}}}
\newcommand\words{{\mathtt{words}}}
\newcommand\bp{{\mathbf{p}}}
\newcommand\bR{{\mathbf{R}}}
\newcommand\bW{{\mathbf{W}}}
\)
Dimensionality Reduction for Tukey Regression
|
Ken Clarkson, IBM Research
Ruosong Wang, CMU
David Woodruff, CMU
|
Regression
`\matA,\vecb,n,d,\matA_i`
`\matA` is an `n\times d` matrix
`\vecb\in\R^n` is a column vector
`\matA_i` is row `i` of `\matA`
`\nnz{\matA}` is the Number of NonZero entries of `\matA`
`(\matA_i, b_i)` for `i \in [n]` is an input set of points
(General) regression
Given function `M:\R\rightarrow\R`,
find
$$\min_{\vecx\in\R^d} \sum_i M(\matA_i \vecx - b_i)$$
Generically, an `M`-estimator.
`\ell_p` regression
When:
- `M(z)=z^2`, least-squares regression [G95][L05]
- `M(z)=|z|^p` for some `p\ge 1`, `\ell_p` regression
`\qquad` e.g., `p=1`, a.k.a. least absolute deviations
Near-linear algorithms (old set of examples)
- Rank: `\tO(\nnz{\matA})` [CKL12] (cf. rank condensers [GR08])
- `\rank(\matA)\equiv \dim \colspan(\matA)`
- Regression
- `\tO(nd) + \tO(d^3)\poly(1/\eps)` [S06][DMMS10]
- `\tO(\nnz{\matA}) + \tO(d^3)\poly(1/\eps)` [CW13][NN13][MM13]
- `\tO(\nnz{\matA})\log(1/\eps) + \tO(d^3)\poly(\log(1/\eps))` [AMT10]++, [MMS12]++
- `\tO(\nnz{\matA} + \tO(d^3)\poly(1/\eps))` [LMP13]
- Low-rank approximation
- `\tO(\nnz{\matA}) + O(n+d)\poly(k/\eps) + \poly(k/\eps)` [CW13][NN13][MM13]
- Leading eigenvalues
- `\tO(\nnz{\matA}) + \poly(k/\eps)` [AN13]++
- Canonical correlation analysis
- `\tO(\nnz{\matA}+\nnz{\matB}) + \tO(d^3)\poly(1/\eps)` [ABTZ13]++
Excluding near-linear for constant `\eps`, fast matrix multiply, non-relative error
No dependence on condition number, incoherence, or other `\matA` numerical properties
May require large `\nnz{\matA}\gg d` and/or not-too-small `\eps` to be "near-linear"
Least-squares is not very robust
- Green point is a single outlier
- Red line is `\ell_2` (least squares) fit
- Green line is `\ell_1` fit
Other `M`-estimators
For given `\tau`,
`M(z) = \begin{cases}
z^2/2\tau & |z| \le \tau
\\ |z| - \tau/2& \text{otherwise}
\end{cases}`
- [H64]
- Robust, yet smooth: better statistical efficiency
For given `\tau`,
`M(z) = \begin{cases}
\frac{\tau^2}{6} (1-[1-\left(\frac{z}{\tau}\right)^2]^3) & |z| \leq \tau \\
\frac{\tau^2}{6} & \text{otherwise}.\end{cases}`
- Even more robust
- `M(z)=\Theta(z^2)` for small `z`
`M`-norms and `\eps`-approximations
residual, `\norm{\cdot}_M`
We will call `\vecr = \matA\vecx - \vecb` the
residual for given `\vecx`
Let `\norm{\vecr}_M \equiv \left[\sum_i M(\vecr_i)\right]^{1/2}`
- e.q., for least squares, the Euclidean norm of `\vecr`
- The square root gives subadditivity for `M(\cdot)` of interest
- Not quite a norm on `\vecr` : no scale invariance
`\eps`-approximation, `\tvecx`, `\vecx^*`
For error parameter `\eps`, want a small relative-error solution:
`\tvecx` such that
$$\norm{\matA\tvecx - \vecb}_M \le (1+\epsilon) \min_{\vecx\in\R^d} \norm{\matA\vecx^* - \vecb}_M$$
Here `\vecx^*\equiv \argmin_{\vecx\in\reals^d} \norm{\matA\vecx - \vecb}_M`
For example, Tukey regression, with `M(z)` the bisquare function
Rough outline
- Techniques
- Sketching and solving
- Embeddings
- Sparse embeddings
- Embeddings from leverage score sampling
- Tukey regression
Sketch and Solve
Generic scheme using sketching:
Generate
`\quad` sketching matrix `S\in\R^{m\times n}`,
`\quad` weight vector `\vecw`
return `\tvecx := \argmin_{\vecx\in\R^d} \norm{\matS(\matA\vecx-\vecb)}_{M,\vecw}`
Here `\norm{\vecr}_{M,\vecw} \equiv \left[\sum_i w_i M(\vecr_i)\right]^{1/2}`, for vector `\vecw`
TBD: solve the sketched version.
That is, this is (only) a dimensionality reduction;
but, can yield an algorithm
Figure is not to scale: typically `m = \poly(d/\eps)`, of interest when `n\gg d`
Sketching Matrices:
Oblivious Subspace Embeddings (Euclidean)
embeddings, `\ell_2`
Matrix `\matS\in\R^{m\times n}` is an `\eps`-embedding of set `P\subset\R^n`
if
$$\norm{\matS\vecy }= (1\pm\eps)\norm{\vecy }\text{ for all }\vecy\in P.$$
`\matS` will be called a "sketching matrix".
For `\matA\in\R^{n\times d}`,
a matrix `\matS ` is a subspace `\eps`-embedding
for `\matA ` if `\matS ` is an `\eps`-embedding for `\colspan(\matA )= \{\matA\vecx\mid\vecx\in\R^d\}`. That is,
$$\norm{\matS\matA\vecx }= (1\pm\eps)\norm{\matA\vecx }\text{ for all }\vecx\in\R^d.$$
`\matS\matA` will be called a "sketch".
An
oblivious subspace embedding is:
- A probability distribution `\cal D` over matrices `\matS\in\R^{m\times n}`, so that
- For any unknown but fixed matrix `\matA`,
`\matS ` is a subspace `\eps`-embedding for `\matA `, with high-enough probability.
Subspace embeddings for least-squares regression
Suppose `\matS` is a subspace `\eps`-embedding for `\colspan([\matA\ \vecb])`.
Letting
$$\begin{align*}
\tvecx\equiv & \argmin_{\vecx\in\R^d}\norm{\matS (\matA\vecx -\vecb)}
\\ \vecx^*\equiv & \argmin_{\vecx\in\R^d}\norm{\matA\vecx -\vecb}
\end{align*}$$
when `\eps\le 1/3`
$$ \norm{\matA\tvecx-\vecb} \le (1+3\eps)\norm{\matA\vecx^*-\vecb} $$
For `\vecy=\left[\begin{smallmatrix}\vecx\\ -1\end{smallmatrix}\right]`, `\vecx\in\R^d`,
$$\norm{\matS (\matA\vecx -\vecb)}
= \norm{\matS[\matA\ \vecb]\vecy} = (1\pm\eps)\norm{[\matA\ \vecb]\vecy}
= (1\pm\eps)\norm{\matA\vecx -\vecb}$$
So
$$
\begin{align*}
\norm{\matA\tvecx-\vecb}
& \le\frac{\norm{\matS (\matA\tvecx-\vecb)}}{1-\eps}
\le\frac{\norm{\matS (\matA\vecx^*-\vecb)}} {1-\eps}
\le\norm{\matA\vecx^*-\vecb}\frac{1+\eps}{1-\eps},
\end{align*}$$
and so for `\eps\le 1/3`, `\norm{\matA\tvecx-\vecb}\le\norm{\matA\vecx^*-\vecb}(1+3\eps)`.
Solving least-squares in "sketch space"
yields a good solution in the original space.
Embeddings, from points to nets to subspaces
Why should there be subspace embeddings?
- Show `\eps`-embedding for `P=\{\vecy\}` with very high probability
- `\implies` embedding for `\eps`-net of unit sphere of subspace
- `\implies` embedding for subspace
Good sketching matrices
Criteria, for given `\eps`:
- How fast can `\matS` be computed?
- How fast can `\matS\matA` be computed?
- What is the sketching dimension `m`, `\matS\in\R^{m\times n}`?
- time `\tO(nd)`, `m = \tO(d/\eps^2)`:
`\quad` Fast JL, Subsampled Randomized Hadamard [S06][DMMS10]
- `\tO(\nnz{\matA})`, `\tO(d^2/\eps^2)`:
`\quad` Countsketch, a.k.a. sparse embeddings [CW13][NN13][MM13]
`\quad` cf. "feature hashing", maybe "random indexing"
- `\tO(\nnz{\matA}) +\tO(d^3)`, `\tO(d/\eps^2)`:
`\quad` Two-stage
- `\tO(\nnz{\matA}f(\kappa))`, `\tO(d^{1+\kappa}/\eps^2)`:
`\quad`
OSNAPs [NN13]
- `O(\nnz{\matA}\log(d)/\eps)`, `O(d\log(d)/\eps^2)`:
`\quad`
[Coh15]
The bottom line: can quickly remove dependence on `n`
Sparse Embeddings
- Adaptation of
CountSketch
from streaming algorithms
- `\bS ` looks like:
$\qquad\qquad\left[\begin{matrix}
0 & -1 & 0 & 0 & +1 & 0
\\ +1 & 0 & 0 & +1 & 0 & 0
\\ 0 & 0 & +1 & 0 & 0 & -1
\\ 0 & 0 & 0 & 0 & 0 & 0
\end{matrix}
\right]$
- One random `\pm 1` per column
- Row `\bA_{i}` of `\bA ` contributes `\pm\bA_{i}` to one row of `\bS\bA `
- `m= \tO(d^2/\eps^2)`
- Sometimes have `s>1` entries per column
- Better quality, slower to apply
Sparse Embeddings, More
- For `\by = \bA\bx`, each row of sparse embedding `\bS `:
- Collects a subset of entries of `\by ` (into a "bucket")
- Applies random signs
- Adds
- Result is `\E[\norm{\bS\by}^2] = \norm{\by }_2^2`:
- Each bucket squared `\approx` squared norm of its entries
- Splitting into buckets separates big entries
- With `n` buckets, don't quite get `\norm{\by }_2^2` exactly, always
- (Entry `\rightarrow` Bucket is not one-to-one)
Sampling rows of a matrix
A row sampling matrix
has rows that are multiples of the natural basis vectors `\be_i^\top`
Here, sampling `\bA\in\R^{n\times d}`
If `\bS\in\R^{m\times n}` is a row sampling matrix,
each row of `\bS\bA` is a multiple of a row of `\bA`:
$\small\qquad\left[\begin{matrix}
0 & S_{12} & 0 & 0 & 0 & 0
\\ S_{21} & 0 & 0 & 0 & 0 & 0
\\ 0 & 0 & S_{33} & 0 & 0 & 0
\\ 0 & 0 & 0 & 0 & 0 & S_{46}
\end{matrix}
\right]
\left[\begin{matrix}
\bA_{1*}\\
\bA_{2*}\\
\vdots\\
\bA_{n*}
\end{matrix}
\right]
= \left[\begin{matrix}
S_{12}\bA_{2*}\\
S_{21}\bA_{1*}\\
S_{33}\bA_{3*}\\
S_{46}\bA_{6*}
\end{matrix}\right]$
NB: $\matS$ requires $O(m)$ space
Are there sampling matrix distributions on `\bS`,
so that w.h.p.
`\norm{\bS\bA\bx} \approx \norm{\bA \bx}` for all `\bx\in\R^d`?
So `\bS` is an embedding of the subspace $$\colspan(\bA)\equiv\{\bA\bx \mid \bx\in\R^d\}$$ into `\R^m`
Weighted sampling
A general construction:
For some `\bp\in [0,1]^n`,
independently for each `i\in [n]`,
with probability `p_i`, include `\be_i^\top/\sqrt{p_i}` as a row of `\bS`
For given `\bx`, `\E[\norm{\bS\bA\bx}^2] = \norm{\bA\bx}^2`.
$$\E[\norm{\bS\bA\bx}^2] = \sum_i p_i (\frac{1}{\sqrt{p_i}}\be_i^\top\bA\bx)^2 = \sum_i (\bA_{i}\bx)^2 = \norm{\bA\bx}^2.$$
Want to choose `\bp` so that:
- `\E[m] = \sum_i p_i`, is not too big
`\quad` (`m\equiv` number of rows of `\bS`)
- `\bS` is a subspace embedding
Variance reduction
Uniform sampling: `p_i\propto\frac{1}{n}` for `i\in [n]` might miss big rows
Length-squared (row) sampling: `p_i\propto \norm{\bA_{i}}^2` is better,
but for some $\bx$ maybe `\bA_{i}\bx` is relatively large even if `\norm{\bA_{i}}` is small
`\qquad` e.g. `\bA_{*,1} = \eps \be^{(n)}_1, \bx = \be^{(d)}_1`
That is (in either), `\bA_{i}\bx` is large, other entries of $\bA\bx$ small
`\implies` large variance, maybe wrong rank
Want `(\frac{1}{\sqrt{p_i}}\bA_{i}\bx)^2` to be "well-behaved"
for all `i` and `\bx`
Leverage scores
Suppose `\bA = \bU\bR`, where `\bU^\top\bU = \Iden`.
Then `\sup_{\bx} \frac{(\bA_{i}\bx)^2 }{ \norm{\bA\bx}^2} \le \norm{\bU_{i}}^2`
\begin{align*}
\sup_{\bx} \frac{(\bA_{i}\bx)^2 }{ \norm{\bA\bx}^2}
= \sup_{\bx} \frac{(\bU_{i}\bR\bx)^2 }{ \norm{\bU\bR\bx}^2 }
\le \sup_{\by} \frac{(\bU_{i}\by)^2 }{ \norm{\bU\by}^2 }
= \sup_{\by} \frac{(\bU_{i}\by)^2 }{ \norm{\by}^2 }
= \norm{\bU_{i}}^2
\end{align*}
leverage scores `\ell_i`
Given `\bA `, with `\bU ` an orthonormal basis for `\colspan(\bA )`,
the `i`'th leverage score of `\bA ` is
`\ell_i \equiv \norm{\bU_{i}}^2 = \norm{\be_i^\top\bU }^2`.
For the `p_i=\norm{\bU_{i}}^2`,
$$\qquad \sum_i p_i = \sum_i \norm{\bU_{i}}^2 = \sum_{i,j} U_{ij}^2 = \sum_j\norm{\bU_{*j}}^2 = d$$
and all summands in
$$\frac{\norm{\bS\bA\bx}^2}{\norm{\bA\bx}^2} = \sum_i \delta_i \frac{1}{p_i}\frac{(\bA_{i}\bx)^2}{\norm{\bA\bx}^2}$$
are at most one
Computing leverage scores
Roughly: use sketch to get `\approx`orth. basis, use its row norms
Given
- Input matrix `\bA `
- Oblivious subspace `\eps`-embedding matrix `\bS\in\R^{m\times n}` for `\bA `
- embedding (JL) matrix `\mathbf{\Pi}\in\R^{d\times m'}`,
so `\norm{\bx\mathbf{\Pi}}= (1\pm\eps)\norm{\bx }` for `n` row vectors `\bx `
The algorithm:
`\bW = \bS *\bA `; | // compute sketch |
`[\bQ,\bR ] = \mathtt{qr}(\bW )`; | // compute change-of-basis for `\bS\bA ` |
`\bZ = \bA *(\bR^{-1}*\mathbf{\Pi})`; | // compute sketch of `\bA\bR^{-1}` |
return `\mathtt{dot}(\bZ,\bZ,2)`; | // return vector of squared row norms |
Tukey regression (back to)
Our results [CWW19] generalize beyond Tukey;
but we do need some conditions
Restrictions on `M`
- even: `M(z) = M(-z)`
- monotone nondecreasing in `|z|`
- for `|a|\ge |b|`,
$$\frac{M(a)}{M(b)} \le \frac{a^2}{b^2}$$
- There are $L_M, U_M$ so that for `|a| \le \tau`,
$$L_M \le M(a)/a^2 \le U_M$$
- Mostly flat: `M(a) = \tau^2` for `|a|\ge\tau`
(Need not be convex or smooth; can replace `2` by `p\ge 1`)
Restrictions on input and output
- Relevant solutions have norm at most `n^{O(d^2)}`
- Including constant-factor approx. solutions
- Columns of $\matA$ and $\vecb$ have norms in `n^{O(d)}`
- `\tau=\Omega(1/n^{O(d)})`
Prior results
The goal: quickly remove dependence on `n`
Prior work: `M`-estimators
When `M(\cdot)` is
nice,
sketch-and-solve allows dimensionality reduction:
- Via sampling for:
`\quad` Huber [CW15a]
`\quad` All nice [CW15b]
- Via oblivious `M`-sketches, fixed large-enough `\eps` [CW15a]
("Nice" a bit stronger than above; will define "`M`-sketch")
These yield (possibly after recursion), algorithms that take
`\qquad O(\nnz{\matA}\log n) + \poly(\eps^{-1} d\log n)` time,
when the sketched problem is tractable
Did not allow `M(z)` to be flat:
must have `M(a)\ge C_M |a|` for some `C_M > 0`,
running time proportional to `1/C_M`
New result
Unresolved prior question:
Is the linear growth lower bound (`C_M \gt 0`) always needed?
Here: for Tukey, and above I/O assumptions, the answer is no
Dim. reduction for Tukey
Under the assumptions, for the Tukey `M`-estimator,
there is a distribution on sketching matrices
`\matS\in\R^{m\times n}` with `m=\poly(d\log n)` such that
there is `C` so that
`\quad \norm{\matA\tvecx-\vecb}\le C\norm{\matA\vecx^* - \vecb}_M`.
Here `\matS\matA` takes `O(\nnz{\matA})` time to compute.
And there is a recursive reduction, based on sampling, giving
an `\eps`-approximation.
This matters because:
- Tukey is more robust, more nonconvex
- Suggests simpler, more general conditions are possible
I/O restrictions and `\eps`-nets
Our analysis follows (roughly) that standard outline
- For a small-enough `\eps`-net `\cN\subset \{\matA\vecx-\vecb\mid \vecx\in\R^d\}`,
show bounds on
- Contraction: `\norm{\matS\vecy}_{M,\vecw}\ge \norm{y}_M` for all `y\in \cN`
`\qquad` via analysis of single vector `y`, union bound
- Dilation: `\norm{\matS(\matA\vecx^*-\vecb)}_{M,\vecw} \le C\norm{\matA\vecx^*-\vecb}`.
- Extend contraction from `\cN` to all relevant `\vecy`
We need to bound the net size `|\cN|`
Before: used linear lower bound on growth: still can pack and cover
Here: use I/O restrictions
To be discussed:
What is `\matS`?
Why is `\norm{\matS\vecy}_{M,\vecw}\approx \norm{\vecy}_M`?
`M`-sketches (for nice `M`): `L_0` sampling + Countsketch
`\matS \gets \left[ \begin{matrix}\matS_0\matT_0 \\ \matS_1\matT_1 \\\vdots\\ \matS_{\log n}\matT_{\log n}\end{matrix}\right]`
- Combine sampling and sparse embedding
- [VZ12] earthmover, [CW15a] `M`-estimators
- Each `\matS_h` is a sparse embedding (Countsketch) matrix
- Each `\matT_h` uniformly samples about `\alpha^h n` entries,
uses weights `w_i\propto 1/\alpha^h`
- Result: `\norm{\matS\vecy}_{M,\vecw} \approx \norm{\vecy}_M`
Roughly:
`\matT_h\vecy` sparse enough `\implies` its entries are isolated in `\matS_h\matT_h\vecy`
`\quad` one entry per bucket
`\implies \norm{\matS_h\matT_h\vecy}_M\approx \alpha^h\norm{\vecy}_M`
Structural result
Analyses both for sketching and for sampling
use a structural condition on `\colspan([\matA\ \vecb])`
Roughly the following:
Distribution of outliers
Suppose `\tau=1`. Given `\alpha > 0`,
there is a set `{\cal I}` of indices with `|{\cal I}|=O(d\alpha)` such that:
if `\vecr=\matA\vecx - \vecb` has `\norm{\vecr}_M\le \alpha`,
then all outliers of `\vecr` (`r_i\ge 1`) are in `{\cal I}`.
The leverage scores upper-bound the entries of `\vecr`, including outliers;
not too many leverage scores can be large;
put `i` with large leverage score into `\cal I`, repeat with what's left
This allows analysis for "big" entries to be restricted to these `\cal I`
Structural result, 2/2
Large `r_i\implies` large `\ell_i`
For `\vecr` with `\norm{\vecr}^2_M\le \alpha`,
largest `r_i \ge 1` has `\ell_i\ge 1/\alpha`
- At most `\alpha` heavy coords (`r_i\ge 1`)
- For `r_{i^*}` heaviest, `\norm{\vecr}^2\le \alpha + \alpha r_{i^*}^2`
- Contributions the same for light coords
- At most `r_{i^*}^2` for heavy coords, by choice of `i^*`
- So leverage score `\ell_{i^*} \ge r_{i^*}^2/\norm{\vecr}^2 \ge r_{i^*}^2/(\alpha + \alpha r_{i^*}^2)\ge 1/\alpha`
Construction is:
- Put all `i` with `\ell_i\ge 1/\alpha` into `\cal I`, repeat
- `\cal I` has all heavy coords after `\alpha` iterations
- Add at most `d\alpha` each time, since `\sum_i \ell_i \le d`
Tighter result with faster algorithm:
Split into `\alpha` random subsets
Use per-subset leverage scores
Put all `i` with `\ell_i` large into `\cal I`
Repeat about `\log(d\alpha)` times
Experiments
- All inputs `n=10000`
- Used LinvPY for full dataset and sampled weighted version
- "with outliers": `5\%` of responses huge
Num | Dataset | `d` | `\tau` |
1 | Random Gaussian | 20 | 10 |
2 | Random Gaussian (with outliers) | 20 | 10 |
3 | Facebook Comment Volume | 53 | 10 |
4 | Facebook Comment Volume (with outliers) | 53 | 100 |
5 | Appliances energy prediction | 25 | 1000 |
6 | Appliances energy prediction (with outliers) | 25 | 100 |
7 | CT Slice Localization | 384 | 100 |
8 | CT Slice Localization (with outliers) | 384 | 1000 |
Conclusion
- Also (as mentioned) hardness results
- Other applications of structural result?
- Thank you for your attention!