2. Elevator Version

• Near-linear-time numerical linear algebra
• Adaptation of CountSketch from streaming, for sketching
• Exploit bounds on coordinates of vectors of input matrix columnspace
• From statistical leverage scores
• Look at "big" and "little" coordinates separately
3. Results

• Let
• $A$ be an $n\times d$ matrix
• $\nnz{A}\equiv$ number of nonzero entries of $A$ (its sparsity)
• Error parameter $\epsilon$
• We show:
• Least-squares regression in $O(\nnz{A}\log(n/\epsilon)) + \tO(d^3)$
• Rank-$k$ approximation in $O(\nnz{A})+\tO(nk^2\epsilon^{-4})$
• Something like a truncated SVD, Lanczos is $\approx k*\nnz{A}$
• Leverage scores in $\nnz{A}\log n + \tilde O(d^3)$ for constant $\epsilon$
• No dependence on matrix condition, eigenvalue separation, etc.
• Regression also in $O(\nnz{A})+\tO(d^3\epsilon^{-2})$ time
• Also: constrained, multiple-response, $\ell_1$
4. Sketching $A$

• Main technique: sketching
• Sketches are compressed versions of the input
• Sketch of $A$ is $SA$, where $S$ is a matrix with few rows
5. Embedding via Sketches

• $S$ should have the norm-preserving property:
$\norm{SAx} \approx_\epsilon \norm{Ax}$ for all $x$
• Meaning $\norm{SAx} \le (1+\epsilon)\norm{Ax}$ and vice versa
• That is, $S$ embeds $C(A)\equiv \{Ax\mid x\in\reals^d\}$ in $C(SA)$
• Almost all results here follow from that property
6. Sketches and Regression

• Regression: $\min_x\norm{Ax-b}$
• Norm-preservation is basically the same as
$\norm{S(Ax-b)} \approx \norm{Ax-b}$ for all $x$
• If $\tilde x$ solves the regression problem

$\min_x \norm{S(Ax-b)}$

then

$\quad\quad\norm{A\tilde x - b}$

is small, within $1+\epsilon$ of opt
• Exact in "sketch space" $\rightarrow$ good approximation in original space
7. Sketching Matrices

• Random combinations of rows:
• Johnson-Lindenstrauss (JL): Classic, Fast, Fast Sparse
• Entries are i.i.d. $\pm 1$, or i.i.d. Gaussian
• These constructions, and ours, are oblivious to $A$
• Fastest are $\tO(nd)$ or $O(\nnz{A}/\epsilon)$
• Samples of rows: $S$ can prescribe sampling of rows of $A$
• For best embedding properties, sample based on leverage scores of $A$
• Not oblivious
8. Our sketching matrix

• ...looks like:
$\qquad\qquad\left[\begin{matrix} 0 & -1 & 0 & 0 & +1 & 0 \\ +1 & 0 & 0 & +1 & 0 & 0 \\ 0 & 0 & +1 & 0 & 0 & -1 \\ 0 & 0 & 0 & 0 & 0 & 0 \end{matrix} \right]$
• Hash (random) function $h:[n] \mapsto [m]$
• $[n]\equiv 1\ldots n$
• $s_{h(i),i} \equiv z_i \sim \pm 1$ equally likely
• Row $a_{i:}$ of $A$ contributes $\pm a_{i:}$ to one row of $SA$
• This is (more or less) CountSketch, from streaming algorithms
9. Properties of Our Sketching Matrix

• $SA$ takes $\nnz{A}$ time to compute
• $S$ and $SA$ have $m=\tO(d^2\epsilon^{-2})$ rows
• Can reduce to $\tO(d\epsilon^{-2})$ using Fast JL (or sampled randomized Hadamard transform) in $\tO(d^3\epsilon^{-2})$
• $S$ is norm-preserving for $C(A)$
• With constant probability
• $S$ preserves matrix product $(SB)^\top SA \approx B^\top A$
• All results follow from these properties
• (Up to rounding error)
10. Preserving Norms

• Embedding, again:
$\norm{Sy} \approx \norm{y}$ for all $y\in C(A)\equiv \{Ax\mid x\in\reals^d\}$
• Strategy (as usual) is to show that embedding holds for:
1. Fixed $y\in C(A)$ with very high probability
2. An $\epsilon_0$-cover of unit vectors in $C(A)$
3. $C(A)$
• We focus on (1)
• Can assume $\norm{y}=1$
11. Preserving norms of $y$

• Entry $j$ of $Sy$ is $\sum_{h(i)=j} z_i y_i$
• Signed sum of entries $y_i$ falling in bucket $j$
• $(Sy)_j^2 \approx$ sum of squares of entries in bucket
• Khintchine's Kintchin's Khinchin's Хи́нчин's inequality
• So $y_i$ contributes about $y_i^2$ to $\norm{Sy}^2=\sum_j (Sy)_j^2$
• So $\norm{Sy}^2 \approx \norm{y}^2$
• With not-good-enough probability
12. The Virtues of Obliviousness, and Leverage Scores

• For best results, exploit properties common to all $y\in C(A)$
• Pick any particular $U$ so that $C(U)=C(A)$
• $S$ is an embedding for $A$ $\iff$ embedding for $U$
• $S$ is oblivious, so pick any convenient $U$
• Pick $U$ to be an orthogonal basis for $C(A)$
• So columns of $U$ are unit vectors and orthogonal
• Leverage score $u_i$ of row $a_{i:}$ is $u_i \equiv \norm{u_{i:}}^2$
• Any such $U$ gives the same leverage scores
• $\sum_i u_i = \sum_{i,j} u_{i,j}^2 = d$, so "average" $u_i$ is $d/n$
• (Commonly for matrix completion: all $u_i$ assumed small)
13. Leverage Scores as Bounds on $y_i$

• $\norm{SUx}$ is easier to analyze than $\norm{SAx}$
• For unit $y\in C(A)$, that is, $y=Ux$ for unit $x$:
• $y_i^2 = (u_{i:}x)^2 \le \norm{u_{i:}}^2\norm{x}^2\le u_i$
• Much of the analysis only needs to consider leverage scores
• We show that $S$ is "good" for given leverage scores with constant probability
• For various values of "good"
14. High and Low Leverage Scores

• The idea: split leverage scores into groups of "High" and "Low"
• Assume WLOG that $u_i\ge u_{i'}$ for $i\le i'$
• High: $H\equiv 1\ldots s$, where $s\equiv \sqrt{m/10}$, indexing the $s$ highest scores
• Low: $L\equiv s+1\ldots n$
• Now consider $y^H, y^L\in\reals^n$ such that:
• $y=y^H+y^L$
• $y^H_i = 0$, $i > s$
• $y^L_i = 0$, $i \le s$
• Analyze each part of
$\norm{S(y^H+y^L)}^2= \norm{Sy^H}^2 + \norm{Sy^L}^2 + 2(Sy^L)\cdot (Sy^H)$
15. Error for $y^H$

• Since $s^2\le m/10$, all in $H$ are in separate buckets (no collisions)
• With constant probability
• Assuming this event, $\norm{Sy^H} = \norm{y^H}$
16. Error for $y^L$

• We use $u_i \le d/s = O(d/\sqrt{m})$ for $i\ge s$
• From $\sum_i u_i = d$
• Hence all squared entries of $y^L$ are $O(d/\sqrt{m})$
• There is $m=\poly(d/\epsilon)$ so that $\norm{Sy^L}^2 \approx_\epsilon \norm{y^L}^2$
• From [DKS]
• There is a tighter analysis, using a bound on the max weight in buckets
$W\equiv \max_j\sum_{\substack{i>s\\h(i)=j}} y_i^2\le \max_j\sum_{\substack{i>s\\ h(i)=j}} \color{green}{u_i}$
• $W = O(d/m + (d/s)\log(m))$
• With constant probability
• From Bernstein's inequality, union bound over buckets
• Assume this event
17. Mixed error

• $(Sy^L)\cdot (Sy^H)$ is small: assuming "no birthdays", its second moment is the expectation of:
\begin{align} ((Sy^L)\cdot (Sy^H))^2 & = \left[\sum_{j\in [m], i'\le s, i>s} y_iz_i y_{i'}z_{i'}\Ibr{h(i)=h(i') = j}\right]^2 \\ & = [\sum_{i>s} y_i v_i z_i]^2 \end{align} where $v_i\equiv y_{i'}z_{i'}$, for $i'$ the unique $i'\le s$ with $h(i')=h(i)$
• By Khintchine, this is bounded by
$C_1 \sum_{i > s} y_i^2 v_i^2 = C_1 \sum_{i' \le s} y_{i'}^2 \sum_{h(i)=h(i')} y_i^2 \le C_1 W\sum_{i' < s} y_{i'}^2$
which is at most $C_1 W$

• So the second moment of $(Sy^L)\cdot (Sy^H)$ is at most $C_1 W$
• Similarly, the $p$'th root of the $2p$'th moment is $C_p W$
• $C_p=O(p)$
• $\implies$ high-probability bound on mixed error
• When $m$ and $s$ are large enough, $W\le \epsilon$
• Putting together High, Low, and Mixed, have $\norm{Sy}^2\approx_\epsilon 1=\norm{y}^2$
19. Big, Medium, and Little Leverage Scores

• Condition of having no High collisions is very restrictive
• Makes our bound $m=O(d^4\poly(1/\epsilon))$
• Our best analysis, to get $\tO(d^2\epsilon^{-2})$: subdivide High $u_i$ into groups of near-equal magnitude
• Within factor of two
• Analyze within-group and across-group error
20. Within-Group Error, Analysis Ingredients

• For given $\beta$, consider indices $G$ of $u_i\in [\beta,2\beta]$
• Let $G'\subset G$ denote indices $i\in G$ with $h(i)=h(i')$ for some $i'\in G$
• Errors come only from collisions
• For unit $y\in C(A)$, let $y^G$ have $y^G_i \equiv y_i$, for $i\in [G']$, and zero o.w.
• We know that:
• $(y^G_i)^2 \le u_i\le 2\beta$
• $\beta |G|\le d$
• We also show that $\norm{y^G}^2\le O((\beta + 1/|G| + \sqrt{d/m}+ |G|/m))$, up to logs
• Using matrix Bernstein, via Recht
• These bounds allow error for $G$ to be bounded
21. (Even More Matrix Concentration Goodness)

Let $\ell$ be an integer parameter. For $i\in [\ell]$, let $H_i\in\reals^{d\times d}$ be independent symmetric zero-mean random matrices. Suppose \begin{align}\rho_i^2 &\equiv \norm{\E[H_i H_i]}_2 \textrm{ and }\\ M &\equiv \max_{i\in [\ell]} \norm{H_i}_2.\end{align} Then for any $\tau > 0$, $\log \Pr\left[\left\|\sum_{i\in [\ell]} H_i\right\|_2 > \tau\right] \le \log 2d - \frac{\tau^2/2}{\sum_{i \in [\ell]}\rho_i^2 + M\tau/3}.$
• Apply to $U_G^\top U_G$, where $U_G\equiv$ rows of $U$ indexed by $G'$
• $\implies$ bound on squared spectral norm of $U_G$
22. Sketching for Low Rank Approximation

• Obliviousness also helps for low-rank approximaton
• Results for regression imply that for column-sketching matrix $R$ and some rank-$k$ $X$, $$\norm{ARX - A}_F \le (1+\epsilon)\norm{A_k-A}_F,$$ where $A_k$ is the best rank-$k$ approximation to $A$
• Proof uses $R^\top$ as a subspace embedding for (unknown) $A_k$
• Use row-sketching to get approximation version of such $X$
23. Concluding Remarks

• Also:
• Affine Embeddings: $\norm{S(AX-B)} \approx \norm{AX-B}$
• High probability embedding in $\nnz(A)\log(n)$ time
• In followup work by others:
• $m=O(d^2\epsilon^{-2})$ in $O(\nnz{A})$ time, and
$m=O(d^{1+\gamma}\epsilon^{-2})$ in $O(\nnz{A} f(1/\gamma))$ time [NN]
• As mentioned, we also get to $m'=\tO(d)$, with additive $\tilde O(d^3)$
• Additive term optimized for regression [MP]
• Implemented and compared to Fast JL for SVM [PBMD]
• Implemented and studied for low-rank approximation [CW]
• More virtues of obliviousness
• Suppose $A=\sum_i A^{(i)}$, with $A^{(i)}$ on processor $i$
• Compute and/or use $SA^{(i)}$ on processor $i$