Low-rank Approximation and Regression in Input Sparsity Time

Low Rank Approximation and Regression in Input Sparsity Time Combinatorial Algorithms for Numerical Linear Algebra

Ken Clarkson
David Woodruff

IBM Research - Almaden
Elevator Version
- Near-linear-time numerical linear algebra
- Adaptation of CountSketch from streaming, for sketching
- Exploit bounds on coordinates of vectors of input matrix columnspace
  - From statistical leverage scores
  - Look at "big" and "little" coordinates separately
Results
- Let
  - `A` be an `n\times d` matrix
  - `\nnz{A}\equiv` number of nonzero entries of `A` (its sparsity)
  - Error parameter `\epsilon`
- We show:
  - Least-squares regression in `O(\nnz{A}\log(n/\epsilon)) + \tO(d^3)`
  - Rank-`k` approximation in `O(\nnz{A})+\tO(nk^2\epsilon^{-4})`
    - Something like a truncated SVD, Lanczos is `\approx k*\nnz{A}`
  - Leverage scores in `\nnz{A}\log n + \tilde O(d^3)` for constant `\epsilon`
- No dependence on matrix condition, eigenvalue separation, etc.
- Regression also in `O(\nnz{A})+\tO(d^3\epsilon^{-2})` time
  - Also: constrained, multiple-response, `\ell_1`
Sketching `A`
- Main technique: sketching
- Sketches are compressed versions of the input
  - Sketch of `A` is `SA`, where `S` is a matrix with few rows
Embedding via Sketches
- `S` should have the norm-preserving property:
  `\norm{SAx} \approx_\epsilon \norm{Ax}` for all `x`
  - Meaning `\norm{SAx} \le (1+\epsilon)\norm{Ax}` and vice versa
- That is, `S` embeds `C(A)\equiv \{Ax\mid x\in\reals^d\}` in `C(SA)`
- Almost all results here follow from that property
Sketches and Regression
- Regression: `\min_x\norm{Ax-b}`
- Norm-preservation is basically the same as
  `\norm{S(Ax-b)} \approx \norm{Ax-b}` for all `x`
- If `\tilde x` solves the regression problem
  
  `\min_x \norm{S(Ax-b)}`
  
  then
  
  `\quad\quad\norm{A\tilde x - b}`
  
  is small, within `1+\epsilon` of opt
- Exact in "sketch space" `\rightarrow` good approximation in original space
Sketching Matrices
- Random combinations of rows:
  - Johnson-Lindenstrauss (JL): Classic, Fast, Fast Sparse
  - Entries are i.i.d. `\pm 1`, or i.i.d. Gaussian
- These constructions, and ours, are oblivious to `A`
- Fastest are `\tO(nd)` or `O(\nnz{A}/\epsilon)`
- Samples of rows: `S` can prescribe sampling of rows of `A`
  - For best embedding properties, sample based on leverage scores of `A`
    - Not oblivious
Our sketching matrix
- ...looks like:
  $\qquad\qquad\left[\begin{matrix} 0 & -1 & 0 & 0 & +1 & 0 \\ +1 & 0 & 0 & +1 & 0 & 0 \\ 0 & 0 & +1 & 0 & 0 & -1 \\ 0 & 0 & 0 & 0 & 0 & 0 \end{matrix} \right]$
- Hash (random) function `h:[n] \mapsto [m]`
  - `[n]\equiv 1\ldots n`
- `s_{h(i),i} \equiv z_i \sim \pm 1` equally likely
- Row `a_{i:}` of `A` contributes `\pm a_{i:}` to one row of `SA`
- This is (more or less) CountSketch, from streaming algorithms
Properties of Our Sketching Matrix
- `SA` takes `\nnz{A}` time to compute
- `S` and `SA` have `m=\tO(d^2\epsilon^{-2})` rows
  - Can reduce to `\tO(d\epsilon^{-2})` using Fast JL (or sampled randomized Hadamard transform) in `\tO(d^3\epsilon^{-2})`
- `S` is norm-preserving for `C(A)`
  - With constant probability
- `S` preserves matrix product `(SB)^\top SA \approx B^\top A`
- All results follow from these properties
  - (Up to rounding error)
Preserving Norms
- Embedding, again:
  `\norm{Sy} \approx \norm{y}` for all `y\in C(A)\equiv \{Ax\mid x\in\reals^d\}`
- Strategy (as usual) is to show that embedding holds for:
  1. Fixed `y\in C(A)` with very high probability
  2. An `\epsilon_0`-cover of unit vectors in `C(A)`
  3. `C(A)`
- We focus on (1)
- Can assume `\norm{y}=1`
Preserving norms of `y`
- Entry `j` of `Sy` is `\sum_{h(i)=j} z_i y_i`
  - Signed sum of entries `y_i` falling in bucket `j`
- `(Sy)_j^2 \approx ` sum of squares of entries in bucket
  - ~~Khintchine's~~ ~~Kintchin's~~ ~~Khinchin's~~ Хи́нчин's inequality
- So `y_i` contributes about `y_i^2` to `\norm{Sy}^2=\sum_j (Sy)_j^2`
- So `\norm{Sy}^2 \approx \norm{y}^2`
  - With not-good-enough probability
The Virtues of Obliviousness, and Leverage Scores
- For best results, exploit properties common to all `y\in C(A)`
- Pick any particular `U` so that `C(U)=C(A)`
  - `S` is an embedding for `A` `\iff` embedding for `U`
- `S` is oblivious, so pick any convenient `U`
- Pick `U` to be an orthogonal basis for `C(A)`
  - So columns of `U` are unit vectors and orthogonal
- Leverage score `u_i` of row `a_{i:}` is `u_i \equiv \norm{u_{i:}}^2`
  - Any such `U` gives the same leverage scores
  - `\sum_i u_i = \sum_{i,j} u_{i,j}^2 = d`, so "average" `u_i` is `d/n`
  - (Commonly for matrix completion: all `u_i` assumed small)
Leverage Scores as Bounds on `y_i`
- `\norm{SUx}` is easier to analyze than `\norm{SAx}`
- For unit `y\in C(A)`, that is, `y=Ux` for unit `x`:
  - `y_i^2 = (u_{i:}x)^2 \le \norm{u_{i:}}^2\norm{x}^2\le u_i`
- Much of the analysis only needs to consider leverage scores
- We show that `S` is "good" for given leverage scores with constant probability
  - For various values of "good"
High and Low Leverage Scores
- The idea: split leverage scores into groups of "High" and "Low"
- Assume WLOG that `u_i\ge u_{i'}` for `i\le i'`
- High: `H\equiv 1\ldots s`, where `s\equiv \sqrt{m/10}`, indexing the `s` highest scores
- Low: `L\equiv s+1\ldots n`
- Now consider `y^H, y^L\in\reals^n` such that:
  - `y=y^H+y^L`
  - `y^H_i = 0`, `i > s`
  - `y^L_i = 0`, `i \le s`
- Analyze each part of
  $\norm{S(y^H+y^L)}^2= \norm{Sy^H}^2 + \norm{Sy^L}^2 + 2(Sy^L)\cdot (Sy^H)$
Error for `y^H`
- Since `s^2\le m/10`, all in `H` are in separate buckets (no collisions)
  - No birthday paradox events
  - With constant probability
- Assuming this event, `\norm{Sy^H} = \norm{y^H}`
Error for `y^L`
- We use `u_i \le d/s = O(d/\sqrt{m})` for `i\ge s`
  - From `\sum_i u_i = d`
- Hence all squared entries of `y^L` are `O(d/\sqrt{m})`
- There is `m=\poly(d/\epsilon)` so that `\norm{Sy^L}^2 \approx_\epsilon \norm{y^L}^2`
  - From [DKS]
- There is a tighter analysis, using a bound on the max weight in buckets
  `W\equiv \max_j\sum_{\substack{i>s\\h(i)=j}} y_i^2\le \max_j\sum_{\substack{i>s\\ h(i)=j}} \color{green}{u_i}`
  - `W = O(d/m + (d/s)\log(m))`
  - With constant probability
  - From Bernstein's inequality, union bound over buckets
  - Assume this event
Mixed error
- `(Sy^L)\cdot (Sy^H)` is small: assuming "no birthdays", its second moment is the expectation of:
  \begin{align} ((Sy^L)\cdot (Sy^H))^2 & = \left[\sum_{j\in [m], i'\le s, i>s} y_iz_i y_{i'}z_{i'}\Ibr{h(i)=h(i') = j}\right]^2 \\ & = [\sum_{i>s} y_i v_i z_i]^2 \end{align} where `v_i\equiv y_{i'}z_{i'}`, for `i'` the unique `i'\le s` with `h(i')=h(i)`
- By Khintchine, this is bounded by
  `C_1 \sum_{i > s} y_i^2 v_i^2 = C_1 \sum_{i' \le s} y_{i'}^2 \sum_{h(i)=h(i')} y_i^2 \le C_1 W\sum_{i' < s} y_{i'}^2`
  which is at most `C_1 W`
Mixed error, enough already
- So the second moment of `(Sy^L)\cdot (Sy^H)` is at most `C_1 W`
- Similarly, the `p`'th root of the `2p`'th moment is `C_p W`
  - `C_p=O(p)`
- `\implies` high-probability bound on mixed error
- When `m` and `s` are large enough, `W\le \epsilon`
- Putting together High, Low, and Mixed, have `\norm{Sy}^2\approx_\epsilon 1=\norm{y}^2`
Big, Medium, and Little Leverage Scores
- Condition of having no High collisions is very restrictive
- Makes our bound `m=O(d^4\poly(1/\epsilon))`
- Our best analysis, to get `\tO(d^2\epsilon^{-2})`: subdivide High `u_i` into groups of near-equal magnitude
  - Within factor of two
  - Analyze within-group and across-group error
Within-Group Error, Analysis Ingredients
- For given `\beta`, consider indices `G` of `u_i\in [\beta,2\beta]`
- Let `G'\subset G` denote indices `i\in G` with `h(i)=h(i')` for some `i'\in G`
  - Errors come only from collisions
- For unit `y\in C(A)`, let `y^G` have `y^G_i \equiv y_i`, for `i\in [G']`, and zero o.w.
- We know that:
  - `(y^G_i)^2 \le u_i\le 2\beta`
  - `\beta |G|\le d`
- We also show that `\norm{y^G}^2\le O((\beta + 1/|G| + \sqrt{d/m}+ |G|/m))`, up to logs
- These bounds allow error for `G` to be bounded
(Even More Matrix Concentration Goodness)
Let $\ell$ be an integer parameter. For $i\in [\ell]$, let $H_i\in\reals^{d\times d}$ be independent symmetric zero-mean random matrices. Suppose \begin{align}\rho_i^2 &\equiv \norm{\E[H_i H_i]}_2 \textrm{ and }\\ M &\equiv \max_{i\in [\ell]} \norm{H_i}_2.\end{align} Then for any $\tau > 0$, \[ \log \Pr\left[\left\|\sum_{i\in [\ell]} H_i\right\|_2 > \tau\right] \le \log 2d - \frac{\tau^2/2}{\sum_{i \in [\ell]}\rho_i^2 + M\tau/3}. \]
- Apply to `U_G^\top U_G`, where `U_G\equiv` rows of `U` indexed by `G'`
- `\implies` bound on squared spectral norm of `U_G`
Sketching for Low Rank Approximation
- Obliviousness also helps for low-rank approximaton
- Results for regression imply that for column-sketching matrix `R` and some rank-`k` `X`, $$\norm{ARX - A}_F \le (1+\epsilon)\norm{A_k-A}_F,$$ where `A_k` is the best rank-`k` approximation to `A`
- Proof uses `R^\top` as a subspace embedding for (unknown) `A_k`
- Use row-sketching to get approximation version of such `X`
Concluding Remarks
- Also:
  - Affine Embeddings: `\norm{S(AX-B)} \approx \norm{AX-B}`
  - High probability embedding in `\nnz(A)\log(n)` time
- In followup work by others:
  - `m=O(d^2\epsilon^{-2})` in `O(\nnz{A})` time, and
    `m=O(d^{1+\gamma}\epsilon^{-2})` in `O(\nnz{A} f(1/\gamma))` time [NN]
    - As mentioned, we also get to `m'=\tO(d)`, with additive $\tilde O(d^3)$
  - Additive term optimized for regression [MP]
  - Implemented and compared to Fast JL for SVM [PBMD]
  - Implemented and studied for low-rank approximation [CW]
- More virtues of obliviousness
  - Suppose `A=\sum_i A^{(i)}`, with `A^{(i)}` on processor `i`
  - Compute and/or use `SA^{(i)}` on processor `i`

Low Rank Approximation and Regression in Input Sparsity Time Combinatorial Algorithms for Numerical Linear Algebra

Ken ClarksonDavid Woodruff IBM Research - Almaden

Elevator Version

Results

Sketching `A`

Embedding via Sketches

Sketches and Regression

Sketching Matrices

Our sketching matrix

Properties of Our Sketching Matrix

Preserving Norms

Preserving norms of `y`

The Virtues of Obliviousness, and Leverage Scores

Leverage Scores as Bounds on `y_i`

High and Low Leverage Scores

Error for `y^H`

Error for `y^L`

Mixed error

Mixed error, enough already

Big, Medium, and Little Leverage Scores

Within-Group Error, Analysis Ingredients

(Even More Matrix Concentration Goodness)

Sketching for Low Rank Approximation

Concluding Remarks

Ken Clarkson
David Woodruff

IBM Research - Almaden