2. # Some Geometric Optimization Problems

• Throughout, $n$ points in $d$ dimensions
• Or: $A$ is an $n \times d$ matrix
• Row $i$ of $A$ is $A_i$
• Perceptrons: points are red or blue, split colors by a hyperplane
• A flavor of linear programming
• Minimum Enclosing Ball (MEB): find the smallest ball containing the points
• A flavor of quadratic programming
3. # Our Results

• Sublinear-time approximation algorithms, and matching lower bounds
• Matching up to polylog factors
• Additive error $\epsilon > 0$
• Perceptron: $\tilde{\Theta}(\epsilon^{-2}(n+d))$, vs. classical $O(\epsilon^{-2}\nnz{A})$
• $\nnz{A}=$ number of non-zero entries of $A$
• When $\nnz{A}=nd$, a big improvement
• MEB: $\tilde{\Theta}(\epsilon^{-2}n + \epsilon^{-1}d)$, vs. $O(\epsilon^{-1/2}\nnz{A})$
• Upper bound needs norms of points to be known
• Lower bound when center must be in convex hull of input
• Previous algorithm is in upcoming SODA
• Plus: $L_2$-SVM, $L_1$-SVM (to appear), kernelized (SVM), streaming, parallel, QP in simplex,...
4. # MEB Formulation

• One formulation is $$\min_{x\in\reals^d} R_A(x) = \min_{x\in\reals^d} \max_{i\in [n]} || x - A_i^\top||^2$$
• Find the center $x$ that minimizes the maximum squared distance
• $[n] \equiv \{1,2,\ldots, n\}$
• Equivalently, $$\min_{x\in\reals^d} \max_{\color{green}{p\in\Delta_n}} \color{green}{\sum_{i\in [n]} p_t(i)}|| x - A_i^\top||^2$$
• $\Delta_n \equiv$ the unit simplex $\equiv \{p\in\reals^n | \sum_i p(i) = 1, p(i)\ge 0\}$
• Since maximizing $p$ puts all its weight on the farthest point
5. # Max $p$ vs. Min $x$

For rounds $t=1,\ldots T$, two players respond to each other.
• The "max player" tries to find the maximizing $p_t\in\Delta_n$
• For given $x$, a short-sighted response $p$ has some $p(i)=1$
• Less short-sighted: $p_t(i) = p_{t-1}(i)\exp(\eta || x_{t-1} - A_i^\top||^2)$
• For suitable value $\eta$
• ...and also normalizing so that $\sum_i p_t(i) = 1$
• That is, a Multiplicative Weights (MW) update
• The "min player" tries to find the minimizing $x_t$
• For given $p$, the best $x$ is $A^\top p$, the weighted centroid of the input points
• Less myopically, update $x_t$ to $x_{t+1}$ by a convex combination of $x_t$ and $A^\top p_{t+1}$
• That is, an Online Gradient Descent (OGD) update
6. # A Primal-Dual Algorithm

The resulting algorithm, in pseudo-code:
$p_1\leftarrow (\frac{1}{n}, \frac{1}{n}, \ldots, \frac{1}{n})$
$x_1 \leftarrow A^\top p_1$
for $t=1,2,\ldots ,T$:
for each $i$: $p_t(i) \leftarrow p_{t-1}(i)\exp(\eta||x_{t-1} - A_i^\top||^2)$
$p_t\leftarrow p_t/||p_t||_1$
$x_t \leftarrow (1-\frac{1}{t})x_{t-1} + \frac{1}{t}A^\top p_t$
// So $x_t = \frac{1}{t}\sum_{\tau\in [t]} A^\top p_{\tau}$
7. # The Primal-Dual Algorithm is Not Too Bad

• Claim:  for $T=\tilde{O}(1/\epsilon^2)$, $\bar{x} = \sum_{\tau\in [T]} x_\tau/T$ has $$R_A(\bar{x}) \le R_A(x^*) + \epsilon.$$
• Time per iteration is $O(\nnz{A})=O(nd)$
• Except for computing $A^\top p_t$ and the squared distances $||x_t - A_i^\top||^2$, work is $O(n+d)$
• Since $||x_t - A_i^\top||^2 = ||x_t||^2 + ||A_i||^2 - 2A_i x_t$, need basically $Ax_t$
• We're assuming all $||A_i||$ are known
8. # Randomizing the Primal-Dual Algorithm

• We can use random estimates so that one iteration takes $O(n+d)$ total time
• To update $x_t$, pick $i_t$ with prob. $p_t(i)$, and use $A_{i_t}^\top$ instead of $A^\top p_t$
• $\E[A_{i_t}^\top] = A^\top p_t$, so update is the same in expectation
• To update $p_t(i)$ to $p_{t+1}(i)$, estimate $||x_t - A_i^\top||^2$ by:
• Pick $j_t$ with probability $x_t(j)^2$ (assuming $||x_t||=1$)
• Use $z_t \equiv A_i(j_t)/x_t(j_t)$ as estimate of $A_i x_t$
• $\E[z_t] = \sum_j [A_i(j)/x_t(j)] [x_t(j)^2] = \sum_j A_i(j) x_t(j) = A_i x_t$
• Estimate $||x_t - A_i^\top||^2 \approx v_t(i) \equiv ||x_t||^2 + ||A_i||^2 - 2z_t(i)$
• We call these estimation techniques $\ell_1$ and $\ell_2$ sampling, respectively
9. # The Randomized Primal-Dual Algorithm is Not Too Bad

• Claim:  for $T=\tilde{O}(1/\epsilon^2)$, the output vector $\bar{x} = \sum_{\tau\in [T]} x_\tau/T$ has $$R_A(\bar{x}) \le R_A(x^*) + \epsilon.$$
• NB: same as for deterministic version
• Time per iteration is $O(n+d)$
• Can be much faster than deterministic version
• But maybe there's big multipliers hiding in the $\tilde{O}()$?
10. # Randomized vs. Deterministic

• Noise added to distances to simulate $\ell_2$ sampling
• Color: large $\color{red}{p_t(i)}$, small $\color{blue}{p_t(i)}$,
HSL interpolation
• Black dot is center $x_t$ for randomized
• Black dot with red ring is center $x_t$ for det.
• After some iterations, similar behavior
11. # Randomized vs. Deterministic for "real" data

• Ratio of farthest point distances, randomized/deterministic
• Very preliminary
• Some data sets from MLComp, with or without a single additional added outlier
• Here also, randomized $\approx$ deterministic after 25 or so iterations
• To do: comparison with state of the art
12. # Kernelization

• In kernel methods, the input $A_i$ are lifted to points $\Phi(A_i)$
• $\Phi$ is non-linear
• For some popular $\Phi()$:
• Dot products $\Phi(A_i)\Phi(A_j)^\top$ can be computed from
• Dot products $A_iA_j^\top$
• We show:
• Good enough estimates of $\Phi(A_i)\Phi(A_j)^\top$ can be computed from
• Good enough estimates of $A_iA_j^\top$
• E.g.: for the polynomial kernel $(x^\top y)^g$, use $g$ independent estimates of $x^\top y$
• Increase in running time is additive $O(1/\epsilon^5)$
13. # Remaining Questions

• How can the randomized algorithm possibly work?
• Analysis ingredients: regret bounds, concentration (tail estimates)
• Surely someone has done something very similar already?
• Related work, in particular using $\ell_1$ sampling
• What does this have to do with low-rank matrix approximation?
• $\ell_2$ sampling vs. sketching for matrix-product estimation
14. # Ingredients of the Analysis: Regret Bounds and Concentration

• Regret bounds for on-line convex optimization
• $x_t$ does nearly as well, overall, as the best single $x$, for minimizing average distances to the $A_{i_\tau}$
• $p_t$ does nearly as well, overall, as the best single $p$, for maximizing the averages of the distance estimates $v_t(i)$
• Concentration bounds for the estimators, such as:
$\sum_{t\in [T]} v_t(i)$ is close to its expectation $\sum_{t\in [T]} ||x_t - A_i||^2$
for all $i\in [n]$
15. # Concentration Bounds

• The $\ell_2$-sampling estimator has variance $\norm{A_i}^2\norm{x_t}^2$
• Assume that all such variances are $\le 1$, that is, all $\norm{A_i}\le 1$
• I've been lying: we don't quite use the estimator $v_t$ as described
• We clip it to $\tilde{v}_t(i) \leftarrow \textrm{Clip}(v_t(i), 2/\epsilon)$, so that $|v_t|\le 2/\epsilon$
• Where $\textrm{Clip}(x,a) \equiv \min\{a, \max\{-a, x\}\}$
• We have $| \E[v_t(i)] - \E[\tilde{v}_t(i)] | \lt \epsilon$
• (Using here $\E[v_t(i)]\le 2$)
16. # Concentrated Estimators via Variance Bounds

• $\sum_{t\in [T]} \tilde{v}_t(i) \approx \sum_{t\in [T]} \E[v_t(i)]$ via a Martingale version of:
• Theorem. Let $Y_1,\ldots,Y_T$ be independent random variables with $|\E[Y_i]|\le 1$ and $\Var[Y_i]\le 1$ for all $i$. For $\epsilon, \delta>0$, there is $T\approx \log(1/\delta)/\epsilon^2$ so that with probability at least $1-\delta$, $$| \sum_{t\in [T]} \textrm{Clip}(Y_t,2/\epsilon) - \sum_{t\in [T]} \E[Y_t]\, | \le T\epsilon.$$
• The proof follows from Bernstein's inequality and the claim about the expectation of the clipped version
17. # Digression: Other Concentrated Estimators from only Variance Bounds

• Another estimator with similar bounds:
• Split the $Y_i$ into $\log(1/\delta)$ groups of $1/\epsilon^2$ each
• Take the median of the averages of each group
• Long known, e.g. in the streaming literature
• Both this median-of-averages and our average-of-clipped estimator are non-linear, which is surely inevitable
• It is also possible to get a concentrated estimator when the variance is bounded but unknown (Catoni)
18. # Related Work: Coresets,Low-Rank Approximation

• Coresets: small subsets of input points that specify approximation solutions
• Given $\epsilon$-coreset $C$, all input is inside $(1+\epsilon)\textrm{MEB}(C)$
• The rows $A_{i_t}, t\in [T]$ comprise a coreset
• Together with columns $A_{:j_t}, t\in [T]$, specifies our solution
• For classifier results as well as MEB
• Akin to rows and columns of low-rank approximation
• Vaguely, kind-of, a bit
19. # Related Work: Prior Primal-Dual Algorithms

• Our perceptron algorithm finds $$\min_{p\in\Delta_n} \max_{x\in\Ball_d} p^\top Ax$$
• Previous work [GK,KY] found sublinear approx. algorithms for $$\quad\quad\min_{p\in\Delta_n}\max_{x\in\color{green}{\Delta_d}} p^\top A x$$
• Optimal mixed strategies for a game
• Assumes bound for $||A_i||_\infty$
• Better results: small relative error
• Easier(?) problem, because $\ell_1$ sampling is easier than $\ell_2$ sampling
• $||A_i||_\infty \le ||A_i||_2$, or $\Delta_d\subset\Ball_d$
• (The solution to $\max_{p\in\Ball_n}\max_{x\in\Ball_d} p^\top A x$ is $||A||$)
20. # $\ell_2$ Sampling vs. Sketches

• Another possible approach: sketching/JL/random projections
• Map each $A_i$ to $\tilde{A}_i\in\reals^m$ with $m=O(\epsilon^{-2}\log(n/\delta))$
• This can be done so that $A_i A_j^\top \approx \tilde{A}_i \tilde{A}_j^\top$ for all $i,j\in [n]$
• Solve the problem in "sketch space", and then map back
• Sketching works here, but is slower and takes more space
• Are there other applications where $\ell_2$ sampling can be used instead of sketching?
• "Feature selection" for classifiers
• Matrix product
• Probably: nearest-neighbor search
21. # $\ell_2$ Sampling as Low-Rent Feature Selection

• Here we showed that $\ell_2$ sampling estimates $Ax$ in $O(n+d)$ time
• With variance $\norm{A_i}^2\norm{x}^2$ for estimate of $A_ix$
• By clipping and averaging, this gives a concentrated estimator
• Only an upper bound on the row norms of $A$ is needed
• This could be applied to linear classification
• Weight vector is $x$
• Rows of $A$ comprise the vectors to classify
• Sample $x$ beforehand, storage $O(\epsilon^{-2}\log(q/\delta))$
• $q$ bounds the number of rows of $A$
• Storage of $x$ as for sketching, but classification is faster
22. # Matrix Product via $\ell_2$ Sampling

• We can extend from vector $x$ to matrix $B$
• Pick row $i$ of $B$ with probability $Q_i\equiv \norm{B_i}^2/\norm{B}_F^2$
• Return outer product $A_{:i}B_i/Q_i$ as estimate of $AB$
• Variance is about $\norm{A}_F^2\norm{B}_F^2$
• Can be repeated for concentrated estimator
• Needs upper bound on row norms of $A$
• Very similar to classical sampling of [DKM]
• (Also very similar to some sampling schemes for regresion)
• But: nearly oblivious to $A$
23. # Concluding Remarks

• Bounds are tight for basic perceptron and MEB algorithms
• Kernel methods need improvement in additive $O(\epsilon^{-5})$ term
• Stochastic gradient algorithms are of course also related
• Where else can $\ell_2$ sampling supersede sketching?
• Are there $\tilde{O}(\nnz{A})$ algorithms for these problems?