Sketching and Sampling for `M`-estimators

Sampling and Sketching for `M`-estimators

Ken Clarkson
IBM Research, Almaden

Joint with David Woodruff

Least-squares regression

Input is `n` points in `d+1` dimensions
Points have the form `(A_i,b_i)`, where:
- `A` is an `n \times d` matrix
- `A_i\in \reals^d` is row `i` of a matrix `A`
- `b\in\reals^n` is a column vector
Regression [G95][L05]: find $$ \min_{x\in\reals^d}\norm{Ax-b}_2 $$
- `r\equiv Ax-b` is the residual
- `\norm{r}_2 = \sqrt{\sum_i r_i^2}`, the length of `r`

Least-squares is not very robust

As compared to `\ell_1` regression, minimizing sum of distances

`M`-estimators

`M`-estimators: min `\norm{r}_G\equiv \sum_i G(r_i)`,
for some function `G`
- `G(z) = z^2`: `\ell_2^2` least-squares
- `G(z) = |z|`: `\ell_1` regression
- `G(z) = \begin{cases} z^2/2\tau & z \le \tau \\ |z| - \tau/2& z\ge \tau \end{cases}`
  - ...the Huber estimator [H64]
  - Robust, yet smooth
- Tukey: even more robust

Results

`\nnz{A} \equiv ` number of non-zeros of `A`
`\epsilon \equiv` error parameter

Seek `\hat x` with small relative error: $$\norm{A\hat x - b}_G \le (1+\epsilon)\min_{x\in\reals^d} \norm{Ax - b}_G$$

Huber:
`\qquad O(\nnz{A}\log n))+\poly(\eps^{-1}d \log n)` time
for any given `\epsilon>0`
For nice functions `G(z)`, and fixed large enough `\eps`:
- Reduce problem size: compress `A` and `b`, the same for all `G`
- Find approximate solution in

"Nice" `G`

Functions `G(x)` that are

even: `G(x) = G(-x)`
monotone nondecreasing in `|x|`
for `|a|\ge |b|`, $$\frac{|a|}{|b|}C_G \le \frac{G(a)}{G(b)} \le \frac{a^2}{b^2}$$ for some `C_G > 0`
Upper bound required for `\norm{\,}_G` to be sketchable [BO10]

Need not be:

Convex
Smooth

Tukey is not nice, but a close variant is

( `\norm{\,}_G` and norms )

Despite the notation, `\norm{\,}_G` is not a norm
But:
- For many `G` proposed, `\sqrt{\norm{\,}_G}` satisfies the triangle inequality
- For nice `G` and `\kappa\ge 1`, $$\sqrt{\norm{x}_G}\sqrt{\kappa C_M} \le \sqrt{\norm{\kappa x}_G}\le \kappa \sqrt{\norm{x}_G}$$
- That is, `\sqrt{\norm{\,}_G}` is "scale insensitive"
  - vs. e.g. `\norm{\kappa x}_2 = |\kappa|\norm{x}_2`

Sketching `A`

We apply sketching to regression (including here also row sampling)

Sketch of `A` is `SA`, sketch of `b` is `Sb`
`S` is a matrix with `m\ll n` rows
`m=O(d)` or `O(d^2)`, usually

`SA` reduces the number of points (rows)
- `A\hat S` reduces the dimension (number of columns), for appropriate `\hat S`

Using `SA` for regression

The key property: `\norm{SAx} \approx \norm{Ax} \text{ for all } x`

`S` is an embedding, preserving lengths
This is basically the same as $$ \norm{S(Ax-b)} \approx \norm{Ax-b} \text{ for all } x $$
If `\tilde x` is such that
`\qquad` `\norm{SA\tilde x - Sb} = \norm{S(A\tilde x-b)}\text{ is minimum}`
then
`\qquad` `\norm{A\tilde x - b}\text{ is small.}`

For regression, only need "lopsided" embedding: $$\norm{SAx} \ge (1-\eps) \norm{Ax} \text{ for all } x$$ and $$\norm{SAx^*} \le (1+\eps)\norm{Ax^*}$$

Sketching (in general)

A random distribution over `S`, so that for given `A`, embedding holds w.h.p.
For example:
- (Non-uniform) row samples
- JL, Fast JL
- All entries of `S` i.i.d.
Here, for "nice" `M`-estimators, `S` such that:
- `S` and `SA` can be computed in `O(\nnz{A}\log n)` time
  - vs. `O(mnd)` or `O(dn\log n)`
Typical proof technique:
- Show that `\norm{SAx}\approx \norm{Ax}` for fixed `x` with high probability
- Apply union bound for `x\in {\cal N}` an `\epsilon`-net

Outline

Huber: we use `S` that implements row sampling
Sketching for nice `\norm{\,}_G`: we extend Sparse Embeddings (good for `\ell_2`) with a family of uniform samples

Sampling for embedding

Uniform sampling of rows by itself does not give an embedding
Need non-uniform sampling: pick row `i` ind. with probability `q_i`
- Apply weight `1/q_i` appropriately
- Number of rows picked is random
- Row of `S` is a multiple of `e_i`
For `\ell_p`, there are `q_i` so that this yields an embedding
- `\sum_i q_i = \E[\text{num rows}] = \poly(d/\eps)`
- For `\ell_2`, from the leverage scores
- For general `p`, `p`'th power of norm of row `i` of well-conditioned basis for `A`
- For any `p`, find `q_i` in `O(\nnz{A}\log n) + \poly(d/\eps)` time [CW13]

Sampling for Huber embedding

Huber is roughly `\ell_2^2` for small values, `\ell_1` for large
We show that a combination of `q_i`s for `\ell_2` and `\ell_1` yields useful sampling probabilities
But: `\sum_i q_i = \E[\text{num rows}] = \Theta^*(\sqrt{n})`
- We recurse `n\rightarrow\sqrt{n}\rightarrow n^{1/4}\rightarrow\cdots`
- Until general convex programming can be applied

Sketching for nice `\norm{\,}_G`: sparse embeddings

A construction that yields an embedding for `\ell_2`
Adaptation of CountSketch from streaming algorithms
- [CW13,NN13,MM13]
`S` looks like:
$\qquad\qquad\left[\begin{matrix} 0 & -1 & 0 & 0 & +1 & 0 \\ +1 & 0 & 0 & +1 & 0 & 0 \\ 0 & 0 & +1 & 0 & 0 & -1 \\ 0 & 0 & 0 & 0 & 0 & 0 \end{matrix} \right]$
- One random `\pm 1` per column
- Row `a_{i:}` of `A` contributes `\pm a_{i:}` to one row of `SA`
`m=\tO(d^2)`

Sparse embeddings, more

For `y=Ax`, each row of sparse embedding `S`:
- Collects a subset of entries of `y` (into a "bucket")
- Applies random signs
- Adds
Result is `\norm{Sy}^2 \approx \norm{y}_2^2`:
- Each bucket squared `\approx` squared norm of its entries
- Splitting into buckets separates big entries
  - Smaller variance

Sketching `M`-estimators

How to sketch so that `\norm{Sy}_G \approx \norm{y}_G`?
Suppose `G(x) = |x|`, so `\norm{y}_G= \sum_i |y_i| = \norm{y}_1`
Consider two one-row sketches:
- Uniform sampling: pick random `i^*`, `S_{1i} \gets\begin{cases} n & i=i^*\\0 & i\ne i^*\end{cases}`
- `\ell_2` sketching: `\hat S = (\pm 1,\pm 1,\ldots,\pm 1)`
`\norm{Sy}_G = G(Sy) = |Sy|`, since `S` has one row

Uniform vs. `\ell_2` sketches, for `\ell_1`

	Sampling `S`	`\ell_2` Sketching `\hat S`
General `y`	`\E[\|Sy\|] = \norm{y}_1`	`\E[\|\hat Sy\|] \approx \norm{y}_2`
`e_1=(1,0,\ldots,0)`	`\Var[\|Se_1\|] = n-1`	`\|\hat Se_1\| = 1 =\norm{e_1}_1`
`\bfone=(1,1,\ldots 1)/n`	`\|S\bfone\| = 1`	`\E[\|\hat S\bfone\|] \approx 1/\sqrt{n} \ll 1 = \norm{\bfone}_1`

`M`-sketches

`S \gets \left[ \begin{matrix}S_0T_0 \\ S_1T_1 \\\vdots\\ S_{\log n}T_{\log n}\end{matrix}\right]`

Combine sampling and sparse embedding
- [VZ12] earthmover, [CW15] `M`-estimators
Each `S_h` is a sparse embedding matrix
Each `T_h` samples about `n/b^h` entries, then multiplies by `b^h`
- For parameter `b`
Result: `\norm{SAx}_{w, G} \approx \norm{Ax}_G`
- A lopsided embedding for constant approximation with constant prob.
- Where `\norm{y}_{w,G} = \sum_i w_i G(y_i)` for weight vector `w`
- Claim true for any given `G`

`M`-sketch analysis

For given `y` with `\norm{y}_G=1`, define weight classes
`W_j \equiv \{ G(y_i) \mid 2^{-j-1}\le G(y_i)\le 2^{-j} \}`
For suitable `b` and other parameters, there is `h_j` such that with high enough probability:
- For `h\approx h_j`, the entries of `T_h y` from `W_j` are in separate CountSketch buckets
  - Contributing about `\norm{W_j}_G` to `\norm{y}_G`
  - And for `|W_j|` large, this concentrates
- For `h \gg h_j`, there are no entries in `T_hy` from `W_j`
- For `h \ll h_j`, the entries of `T_h y` from `W_j` appear many-per-bucket
  - And contribute roughly `G(\norm{W_j}_2)` to `\norm{y}_G`
If `G` is fast growing, need to "clip" small buckets (Ky Fan vector norm)

A Matlab Version

Just to show it's not very complicated:

   function [ S ] = M_sketch(n, d, b)
    S.num_buckets = 1+floor(b*d*(1+log(n)));
    S.level = 1 + random('geo', 1-1/b, n);
    S.sign_flip = randi(2, n)*2 - 3;
    S.hash = randi(S.num_buckets, n);
   end
   
   function [ Sy ] = M_sketch_apply( S, y)
    Sy = zeros(max(level), S.num_buckets);
    for i=1:length(y)
        li = S.level(i);
	hi = S.hash(i);
        Sy(li, hi) = Sy(li, hi) + y(i)*S.sign_flip(i);
    end;
   end
   
   function [ est ] = M_sketch_eval( Sy, b, G )
    est = 0;
    for i=1:size(Sy,1)
	est = est + (b^(i-1))*sum(G(Sy(i,:)));
   end

Experiments

For `\bfone=(1,1,\ldots 1)/n` and `\bfone+10e_1`:
- Estimate `\norm{y}` using sketch
- Compare to actual norm
`n=1e6`
Sketch dim `\approx` 1e4

Vector	`\ell_1`	`\ell_2^2`	Huber
`\bfone=(1,1,\ldots 1)/n`	1	`1/n`	`1/n`
`\bfone + 10e_1`	`11`	`\approx 100`	`\approx 10`

Results

Histogram of `\log_2(\text{Est/Act})`
`\bfone` above, `\bfone+10e_1` below

Thanks for your attention!

	Sampling `S`	`\ell_2` Sketching `\hat S`
General `y`	`\E[\|Sy\|] = \norm{y}_1`	`\E[\|\hat Sy\|] \approx \norm{y}_2`
`e_1=(1,0,\ldots,0)`	`\Var[\|Se_1\|] = n-1`	`\|\hat Se_1\| = 1 =\norm{e_1}_1`
`\bfone=(1,1,\ldots 1)/n`	`\|S\bfone\| = 1`	`\E[\|\hat S\bfone\|] \approx 1/\sqrt{n} \ll 1 = \norm{\bfone}_1`

Sampling and Sketching for `M`-estimators

Ken Clarkson IBM Research, Almaden

Joint with David Woodruff

Least-squares regression

Least-squares is not very robust

`M`-estimators

Results

"Nice" `G`

( `\norm{\,}_G` and norms )

Sketching `A`

Using `SA` for regression

Sketching (in general)

Outline

Sampling for embedding

Sampling for Huber embedding

Sketching for nice `\norm{\,}_G`: sparse embeddings

Sparse embeddings, more

Sketching `M`-estimators

Uniform vs. `\ell_2` sketches, for `\ell_1`

`M`-sketches

`M`-sketch analysis

A Matlab Version

Experiments

Results

Ken Clarkson
IBM Research, Almaden