Add p^\top b to both sides: \norm{Ap}^2 \color{green}{+ p^\top b} \ge 2x^\top Ap - \norm{x}^2 \color{green}{+ p^\top b}
Taking \min on both sides: \color{green}{\min_{p\in \Delta}}\norm{Ap}^2 + p^\top b \ge \color{green}{\min_{p\in \Delta}}2x^\top Ap - \norm{x}^2 + p^\top b
Since this holds for all x, we have: \min_{p\in \Delta} \norm{Ap}^2 + p^\top b
\ge \color{green}{\max_{x\in\reals^d}} \min_{p\in \Delta} 2x^\top Ap - \norm{x}^2 + p^\top b
By Observation (2): \min_{p\in \Delta} \norm{Ap}^2 + p^\top b
\ge \max_{x\in\reals^d} \color{green}{\min_{i\in [n]}} 2x^\top \color{green}{A_i} - \norm{x}^2 + \color{green}{b_i}
This is the weak duality relation between two optimization problems
\norm{v}\le 1, so in particular each estimate of x^\top A_i is at most 1
Tail Estimate from Bernstein
Theorem.Z_1,\ldots,Z_m are independent random variables, |Z_k|\le 1 for all k\in[m].
For \epsilon, \delta>0, there is m\approx \log(1/\delta)/\epsilon^2
so that with probability at least 1-\delta,
\left|\frac{1}{m}\sum_{j\in [m]} Z_k - \frac{1}{m}\sum_{j\in [m]} \E[Z_k]\, \right| \le \epsilon.
Applying Sampling and Bernstein
Algorithm:
Set m=O(\log(n/\delta))/\epsilon^2
Use m/T\approx O(\log n)/\epsilon independent estimators for each x_t^\top A
In O(\nnz{A}), pre-compute row norms \norm{A_{j:}} and column norms \norm{A_i}
Each iteration, in O(d) generate sampling probabilities \rho(j) and sampling data structure
Analysis:
Apply Bernstein to sampling estimates of \sum_{t\in [T]} \norm{x_t - A_i}^2 in FW
The regret-based analysis as above, with additional error of O(\epsilon)
O(n/\epsilon + d) work per iteration, or O(\nnz{A} + n/\epsilon^2 + d/\epsilon) overall
\ell_2 Sampling
We can avoid pre-computing the row norms
For perceptrons, column norms aren't needed either
So total work is sublinear (!)
Have matching lower bounds, for MEB and for perceptrons, for sublinear algorithms
\ell_2 Sampling
Pick j with probability \rho(j) proportional to \color{green}{x(j)}^2
Return estimate row vector v\leftarrow \textrm{Clip}(\frac{1}{\rho(j)} x(j) A_{j:}, 2\epsilon)
\textrm{Clip}(.,.) means, per coord., the nearest value of magnitude \le 2/\epsilon
No row norms needed, but v is unbounded: x(j)/\rho(j) can be very large
Clipped Bernstein
Theorem.Z_1,\ldots,Z_m are independent random variables, |\E[Z_k]|\le 1 and \Var[Z_k]\le 1 for all k\in[m].
For \epsilon, \delta>0, there is m\approx \log(1/\delta)/\epsilon^2
so that with probability at least 1-\delta,
\left|\frac{1}{m}\sum_{j\in [m]} \textrm{Clip}(Z_k,{2/\epsilon}) - \frac{1}{m}\sum_{j\in [m]} \E[Z_k]\, \right| \le \epsilon.
Other Concentrated Estimators Using Only Variance Bounds
Another estimator with similar bounds:
Split the Z_k into \log(1/\delta) groups of 1/\epsilon^2 each
Take the median of the averages of each group
Long known, e.g. in the streaming literature
Both this median-of-averages and the average-of-clipped estimator are non-linear, which is surely inevitable
It is also possible to get a concentrated estimator when the variance is bounded but unknown (Catoni)
\ell_2 Sampling as Low-Rent Feature Selection
\ell_2 sampling estimates x^\top A in O(n+d) time
With variance \norm{A_i}^2\norm{x}^2 for single estimate of x^\top A_i
Using a linear classifier requires dot products (with normal vector)
Given vector x, sample so that dot products with a sequence of vectors can be estimated quickly
For a total of q query vectors, use m=O(\epsilon^{-2}\log(q/\delta)) samples
With probability \delta, all dot products satisfy error bounds
Where y\cdot x is estimated as the average of m clipped estimators
Storage of x as for JL/sketching, but classification is faster
Kernelization
In kernel methods, the input A_i are lifted to points \Phi(A_i)
\Phi is non-linear
For some popular \Phi():
Dot products \Phi(A_i)^\top \Phi(A_j) can be computed from
Dot products A_i^\top A_j
But also [CHW]:
Good enough estimates of \Phi(A_i)^\top\Phi(A_j) can be computed from
Good enough estimates of A_i^\top A_j
E.g.: for the polynomial kernel (x^\top y)^g, use g independent estimates of x^\top y
Increase in running time is additive O(1/\epsilon^5)
Matrix-Matrix Multiplication via Sampling
For A as always and B\in \reals^{d\times n'}, B^\top A = \sum_{j\in[d]} B_{j:}^\top A_{j:}
Each summand B_{j:}^\top A_{j:} is n'\times n, just like B^\top A
Pick j with prob. \rho(j) proportional to \norm{B_{j:}}*\norm{A_{j:}}
Return estimate matrix V, where V \equiv \frac{1}{\rho(j)} B_{j:}^\top A_{j:}
\E[V] = B^\top A
\norm{V} = 1, where \norm{V} is the spectral norm of V
\norm{V}\equiv \sum_{x} \norm{Vx}/\norm{x}
Matrix Bernstein
Bernstein applies per-element, but also: Matrix Bernstein applies
For V_k, k\in [m] independent estimators, the error spectral norm
\norm{B^\top A - \frac{1}{m}\sum_{k\in [m]} V_k}
is likely to be small
Isotropic Sparsification
A similar estimator can be applied to QQ^\top
Q is d\times n
Rows are Q are orthogonal, so QQ^\top = I_d
Estimator V here has the form Q_{:j}Q_{:j}^\top/\rho(j)
Matrix Bernstein implies that for some m=O((d \log d)/\epsilon^2),
\norm{I - \frac{1}{m}\sum_{k\in [m]} V_k} \le \epsilon
Another way to write this: QDQ^\top = I + E, with \norm{E} \le \epsilon
D is an n\times n diagonal matrix with m nonzero entries
Istropic Sparsification
That is: there is "sparse" version Q D^{1/2} of Q
Few columns
Columns are scaled to unit vectors
[Q D^{1/2}] [Q D^{1/2}]^\top = QDQ^\top \approx I
Or: a de-correlated, zero-mean point cloud has a small subset that [after normalization] is nearly de-correlated, nearly zero-mean
Spectral Sparsification
A has the form RQ
Q is d\times n and has orthonormal rows
R is triangular
(Gram-Schmidt on rows)
For x\in\reals^d, we can write \norm{A^\top x}^2 as
x^\top A A^\top x
For x\in\reals^d, we can write \norm{A^\top x}^2 as
x^\top RQ Q^\top R^\top x
For x\in\reals^d, we can write \norm{A^\top x}^2 as
x^\top R [QDQ^\top - E]R^\top x
For x\in\reals^d, we can write \norm{A^\top x}^2 as
x^\top R [QDQ^\top] R^\top x - x^\top RER^\top x
For x\in\reals^d, we can write \norm{A^\top x}^2 as
x^\top R QDQ^\top R^\top x + \gamma \norm{A^\top x}^2
Where |\gamma| \le \epsilon
So: for all x, \norm{D^{1/2}A^\top x}^2 is a good relative error estimator of \norm{A^\top x}^2