
Numerical Linear Algebra in the Streaming Model: Upper Bounds
Ken Clarkson
IBM Almaden
joint with David Woodruff

The Problems
Given `n times d` matrix `A`, `n times d'` matrix `B`, integer `k`, estimators for:
 The matrix product `A^TB
 The matrix `X^**` minimizing `AXB`
 A slightly generalized version of leastsquares regression
 The matrix `A_k` of rank `k` minimizing `A  A_k`
 Rank `k` implies: matrix can be expressed as `CD^T` where `C` and `D` have `k` columns
 The rank of `A`

General Properties of Our Algorithms
 Matrix norm here is always Frobenius: root of sum of squares
 Make one pass over the matrix entries, in any order
 Maintain compressed versions of matrices,
with `O(d+d')`, `O(d^2)`, `O(k(n+d))`, or `O(k^2)` entries
 That is, `o(N)`, where `N=nd` or `N=nc`, where `c := d+d'`
 Do `O(1)` work per entry in maintaining the sketches
 Compute output results using the sketches
 Have provable error bounds, with high probability
 For some cases, sketches cannot be smaller
 When `A` and `B` have appropriatesized integer entries

Matrix Compression Methods
In a line of similar efforts...
 Elementwise sampling [AM01][AHK06]
 Sketching/Random Projection: maintain a small number of
random linear combinations of rows or columns [S06]
 Row/column sampling: pick small random subsets of the
rows, columns, or both [DK01][DKM04]
 Sample probability based on Euclidean norm of row or column
 In general, needs two passes
 Or even: probability based on norm of vector in SVD
 Whole row or column samples are good "examples", and may preserve sparsity
Here: sketching

Outline
 Matrix Product
 The algorithm
 The bounds, and relation to JohnsonLindenstrauss
 Previous work
 Outline of analysis
 Regression
 Lowrank approximation
 Outline of analysis, using regression results
 (Rank estimation omitted)
 Uses matrices `C` and `D` each with `k` rows
so that the rank of `CAD^T` is likely at least `k` if the rank of `A` is

Approximate Matrix Product
 `A` and `B` have `n` rows, we want to estimate `A^TB`
 Let `S` be an `n times m` sign matrix
 A.K.A. Rademacher or Bernoulli
 Each entry is `+1` or `1` with probability `1//2`
 `m = O(1)`, to be specified
 Independent entries, for now
 Our estimate of `A^TB` is `A^TSS^TB`
 That is, sketches are `S^TA` and `S^TB`

Streaming Matrix Updates and Pass Efficiency
 Need only for `S^TA` and `S^TB` implies algorithm in the streaming setting
 Suppose the matrix entries are given as a sequence of updates to `A` or `B`
 An update specifies `i`, `j`, `v`, and `A` or `B`, so that `a_{ij} := a_{ij} + v`, or sim. for `B`
 As in the turnstile streaming model
 Even for `A` and `B` fixed in memory,
the fewer passes over the data, the better

Algorithm Bounds
 As `A` and `B` stream by, maintain `S^TA` and `S^TB`
 For update `i`,`j`, `v` for `A`, add `v [s_{ i : }]^T` to the `j`'th column of current `S^TA`
 Time is `O(m)` per update, since `s_{i: }` has `m` entries
 Space is `O(mc)` for `S^TA` and `S^TB`
 `O(m)` space for `S`, as `S` entries need only be `O(log(1//delta)`wise
independent
 When desired, compute `[A^TS ][S^TB]//m`
 Time for product of `d times m` with `m times d'` is `O(m dd') = O(m c^2)`, `c:=d+d'`
 As a streaming algorithm:
 Maintaining `N=nc` values using `o(N)` time and `o(N)` space
 But: compute time is `O(mdd')`, not `o(N)`
 In strictest sense, not streaming, but takes only one pass

Why this works: the sign matrix `S`
 Suppose `x` and `y` are independent Rademacher random values
 Each takes the values `+1` and `1` with equal prob.
 Then `x^2 = y^2 = 1`, and `bb E[x] = bb E [x^{2p+1}] = bb E[xy] = 0`
 Suppose `x` is a sign vector
 Each entry of `x` is an independent Rademacher random value
 Then `bb E[x]=0` and the outer product `bb E[x x^T]=I`
 Suppose `S` is a sign matrix with `m` columns:
 Each entry of `S` is an independent Rademacher value
 `SS^T` is the sum of the outer products of the `m` column vectors `s_{ : i}` of `S`
 `bb E[SS^T]//m = E[sum_i s_{ : i} s_{ : i}^T]//m = (mI)//m = I`

Expected Error and a Tail Estimate
 From `bb E[SS^T]//m =I` and linearity of expectation,
`bb E[A^TSS^TB//m] = A^T bb E[SS^T] B//m = A^TB`
 So in expectation, sketch product is a good estimate of the product
 This is true also with high probability
 That is, for `delta,epsilon>0`, there is `m = O(log(1//delta)epsilon^{2})` so that
`Prob { {:Lambda > epsilonA B :} } le delta`
 Here `Lambda` is the error `A^TSS^TB//m  A^TB`
 ...and again `A := [ sum_{i,j} a_{ij}^2] ^{1//2}`
 This tail estimate seems to be new
 True also when entries of `S` are `O(log(1//delta))`wise independent

Relation to JohnsonLindenstrauss
 For `B=A=b`,
the `n`vector `b` `>` the `m`vector `S^Tb`
 The tail estimate says that w.h.p.,
` b^TSS^Tb // m  b^Tb 
=   S^Tb  ^2//m  {: b :} ^2  le epsilon  b  ^2`
 That is, the length of `b` is approximately preserved by `hat b := S^Tb`
 This is (pretty much) the celebrated JohnsonLindenstrauss Lemma
 (Use a sign matrix rather than the original random rotation)

JL `=>` Matrix Product Estimate
 The JL Lemma itself implies a weaker form of the matrix product result
 For columns vectors `a` and `b`, if
`hat a`, `hat b`, and `hat a + hat b` have about the same length as
`a`, `b`, and `a+b`,
 Then `hat a cdot hat b approx a cdot b`, with error
about `epsilon ab`
 Apply JL to all `a_{ :i}`, `b_{ :j}`, and `a_{ :i} + b_{ :j}`
 Total failure probability is `O(c^2\delta)`,
where again `c:=d+d'`
 For large enough `m = O(log c \ log(1//delta)//epsilon^2)`,
we have that every dot product `(S^Ta_{ :i}) cdot S^Tb_{ :j}` is a good
estimate of `a_{ :i} cdot b_{ :j}`

JL and Matrix Product
 So: JL implies error bound for every entry of `A^TSS^TB`
 Not just the Frobenius norm
 At the cost of a factor of `log c` in `m`

Related Work
This JLbased algorithm is due to Sarlós [S06], who gave two algorithms for product:
 In one pass, but with an additional `log c` factor, using JL
 In two passes, using a bound on `bb E[Lambda ^2]`
 But needing limited randomness: for each column, fourwise independence
 Here: `O(log(1//delta))`wise independence among all entries of `S` is adequate
 The two pass algorithm is similar in resource bounds to earlier samplingbased
algorithms
 Our proofs are descendants of [S06], which stand on [DKM*]

Lower Bound on Space
 Squeezing out the `log c` factor in the sketch size
is maybe not so interesting
 Except: space lower bound `Omega(c//epsilon^2)log(nc)` is required by
any onepass algorithm,
for failure probability `delta le 1//4`, when entries are `O(log(nc))` bit integers
 For large enough `n` and `c`
 Defer lower bound discussion to "part two"
 Result here has:
 Fewest passes (one)
 Least space for one pass
 High probability bounds
 Simpler than previous for high probability
 Most general streaming model

A Moment Bound Implies the Tail Estimate
 The tail estimate, implying that the sketches are good w.h.p.,
follows from a bound on the moments of the error
 For a random variable `Y`, let `bb E_p[Y]` denote `[bb E[Y^p]]^{1//p}`
 For any `p`,
`bb E_p[Lambda^2] le C p A^2B^2 // m `,
for a constant `C`, where (again) `Lambda := A^TSS^TB//m  A^TB`
 Or, `bb E_p[Lambda ] le C sqrt p A B // sqrt m `
 The bound `O{: ( :} {: sqrt {: p :} :}{: ):}` as `p >infty`
implies that `Lambda` is subgaussian:
the tail of its distribution is bounded by that of a Gaussian
 Or: apply the Markov inequality to `Lambda^p`, use `p approx log(1//delta)`

The Moment Bound, Roughly
To bound `bb E [Lambda ^{2p}]`:
 Multiply out its definition
 Apply linearity of expectation
 The resulting sum has the form
`sum ["terms dependent on "`A`" and "`B`] quad bb E[s_{i_1j_1} s_{i_2 j_2}....]`
 Since `bb E[s^k] =0` for Rademacher `s` and odd `k`
many summands are zero
 Conditions on subscripts that imply `bb E[s_{i_1 j_1}...]` terms are nonzero, also imply conditions
on the datadependent parts, implying that the sum can be bounded

Regression
 The problem again: `min_X  AXB^2`
 `X^**` minimizing this has `X^** = {:A^{:  :} :} B`,
where `A^` is the pseudoinverse of `A`
 The algorithm is:
 Maintain `S^TA` and `S^TB`
 Return `hat X` solving `min_X  S^T(AXB)`
 Main claim: if `A` has rank `k`,
there is `m=O(k epsilon^{1} log(1//delta))` so that
with probability at least `1delta`
`A hat X  B le (1 + epsilon)  AX^**  B`
 That is, relative error for `hat X` is small

Regression Analysis
 Why should `hat X` be so good?
 `S^T` approximately preserves norm of `S^T(AXB)`, for fixed `X`
 If this worked for all `X`, we're done
 `S^T` must preserve norms even for `hat X`, chosen using `S`
 The main idea: to show that `hat X` is good,
reduce to showing that `A {:( :} X^**  hat X {:):}` is small
 Using normal equations of exact problem
 Then, use rank `k le d` of `A`

Regression Analysis, cont.
 `A` has rank `k`, so `A = CD^T` for `C` and `D` with `k` columns
 All columns of `A{:( :}X^**  hat X{:):}` are in the
columnspace of `C`
 `equiv` the `k`dimensional space of linear combinations of the columns of `C`
 has the form `Cy` for a vector `y`
 Fact (Subspace JL): for `m=O(k epsilon^{1} log(1//delta))`,
`S^T` approximately preserves lengths of all vectors in a `k`space
 ...including columns of `A {:( :} X^**  hat X {:):}`
 So, `S^T (A hat X  B )` small `=> A hat X  B` is small

Best LowRank Approximation
 For any matrix `A` and integer `k`,
there is a matrix `A_k` of rank `k` that is closest to `A` among all matrices of rank `k`
 Since rank of `A_k` is `k`, it is the product `CD^T` of two `k`column matrices `C` and `D`
 (`A_k` can be found from the SVD (singular value decomposition), where `C` and `D` are
orthogonal matrices `U` and `V Sigma`)
 This is a good compression of `A`
 If entries of `A` are noisy measurements,
often the noise is "compressed out" in this way
 LSI, PCA, Eigen*, recommender systems, clustering,...

Best LowRank Approximation and `S^TA`
 The sketch `S^TA` holds a lot of information about `A`
 In particular, there is a rank `k` matrix `hat A_k` in the rowspace of `A` nearly
as close to `A` as the closest rank `k` matrix `A_k`
 The rowspace of `S^TA` is the set of linear combinations of its rows
 That is, `A  hat A_k le (1+epsilon)AA_k`

LowRank Approximation : Using Regression
 Why is there such an `hat A_k`?
Apply the regression results with `A>A_k, B > A`
 The `hat X` miniminizing `S^T{: ( :}A_kX  A{:):}`
has `A_k hat X  A le (1 + epsilon)  A_k X^**  A`
 But here `X^** = I`, and `hat X = (S^TA_k)^{:  :}S^TA`
 So, the matrix `A_k hat X = A_k(S^TA_k)^{:  :}S^TA`:
 Has rank `k`, since the rank of the product is the min of the ranks
 Is in the rowspace of `S^TA`
 Is within `1+epsilon` of the smallest distance of any rank `k` matrix

Best LowRank Approximation:
Two Pass Algorithm
 We can't use `A_k(S^TA_k)^{:  :}S^TA` without finding `A_k` first
 Instead:
 Maintain `S^TA`
 Project: find the closest matrix `hat A` to `A` in the rowspace of `S^TA`
 Approximate: find the best rank `k` approximation to `hat A`
 But, this does two passes over `A`

Nearly Best NearlyLowRank Approximation
 Suppose `R` is a `d times m` sign matrix (recall `A` is `n times d`)
 By regression results transposed, the columnspace of `AR` contains
a nearly best rank`k` approximation to `A`
 That is, `hat X` minimizing ` AR X  A ` has
` AR hat X  A le (1+epsilon) A  A_k`
 Apply regression results with `A>AR` and `B>A`,
and `X'` minimizing `S^T(ARX  A)`
 We have `X' = (S^TAR)^{:  :}S^TA` has
` AR X'  A le (1+epsilon)AR hat X  A le (1+epsilon)^2A  A_k`
 Since `AR` has rank `k epsilon^{1}`,
`S` must be `n times m'`, with `m'=k epsilon^{2}`

Nearly Best NearlyLowRank Algorithm
 An algorithm: maintain `AR` and `S^TA`, return
`ARX' = AR (S^TAR)^{:  :} S^TA`
 Rank is `k//epsilon`
 Distance to `A` is `(1+epsilon)A  A_k`
 This approximation to `A` is interesting in its own right
 No SVD required, only psuedoinverse of a matrix of constant size

Nearly Best LowRank Approximation
Still haven't found a good rank `k` matrix
 To do this, we find the
best rank`k` approximation to
`AR(S^TAR)^{:  :} S^TA` in the columnspace of `AR`
 Uses sketches `AR` and `S^TA`
that are bigger than our lower bounds require, w.r.t. `epsilon`
 We have to make `m` and `m'` bigger to prove same approximation bound
 When `A` is given a column at a time, or a row at a time, we can do better

Concluding Remarks
 Space bounds are tight for product, regression
 Space bounds are not tight w.r.t. `epsilon` for lowrank approximation
 Upper bounds are at fault, probably
 We have better upper bounds for restricted cases
 The entrywise `r`norm of the error matrix `Lambda` can also be bounded
 This implies a bound on `Lambda_{"max"}` in terms of `A_{1>2}` and `B_{1>2}`
 Other projection matrices besides sign matrices?
 For what other problems is the full power of the JL transform not needed?