Nearest Neighbor Search and Metric Space Dimensions
Ken Clarkson
Bell Labs
Nearest Neighbor Search: the Problem
Given
- A set of sites (points) `S subset U`
- In metric space `(U,D)`
- Build a data structure so that:
Given `q in U`, the closest site in `S` to a given query point `q` can be found quickly
Nearest Neighbor Search: Synonyms
- Considered for at least thirty years, called:
- Post Office problem (McNutt, '72)
- Best match file searching [BK73]
- Index for similarity search [HS03]
- Vector quantization encoder
- Fast nearest-neighbor classifier
Metric Space Dimension
- Roughly, a notion of how "size" of `U` changes with measurement scale
- Intimately related to NN searching
- Some dimensional measures (e.g., Assouad) give provable upper bounds [C97], [KR02], [KL04], [HPM05]
- Empirically, can be used to predict NN search performance [BF98], [TFP03]
- NN search useful for estimating dimension
- Correlation dimension via batched NN queries
- Pointwise dim. is related to NN distance
- Renyi dimensions via extremal graphs
Outline
- Some basics about metric spaces: repair and construction
- Packings, coverings, nets, Gonzalez construction
- Dimensions: box, packing, Assouad
- Metric measure spaces
- Renyi and pointwise dimensions, doubling measures
- Approaches to NN searching, relation to doubling constant and measures
Metric Spaces: Definition and Repairs
A metric space `(U,D)` has `D(x,y) ge 0` and `D(x,x)=0` for all `x,y in U`,
and also:
- Isolation: `x ne y` implies `D(x,y) ge 0`
- If not: pseudometric, fix with equivalence classes
- Symmetry: `D(x,y)=D(y,x)`
- If not: quasimetric; `hat D (x,y) := (D(x,y) + D(y,x))//2`
- Triangle Inequality: `D(x,z) le D(x,y) + D(y,z)`
- If not: semimetric; `hat D (x,y) := inf sum_i D(z_i, z_{i+1})`, `x=z_0`, `y=z_k`
New Metrics from Old
Start with uniform metric on finite set, or (RR, |x-y|)`;
Suppose `(U,D)`, and `(U_1,D_1)...(U_d,D_d)` are metric spaces.
- `L_p`: `hat U:= U_1 xx U_2 xx cdots xx U_d`, etc.
- Strings over `U`
- Nonnegative combinations: `U_1=U_2= cdots =U_d`, given `alpha_1 ldots alpha_d`, `hat D(x,y) := sum_i alpha_i D_i(x,y)`
- Distance on subsets `A,B subset U`
- Hausdorf
- Given measure `mu`, distance `mu(A Delta B)`
New Metrics from Old: Transforms
- Given `f(z)` on `RR` with:
- `f(0)=0`
- `f` monotone increasing
- `f` concave
- Have:
- `hat D(x,y) := f(D(x,y))` also a metric
- For `epsilon ge 0`, `f(z) := z^epsilon`, the "snowflake"
- Alternate fix for semimetric
- `f(z) := z/(1+z)` : bounded space
- For `lambda > 0`, `f(z) := 1 - e^{: - lambda z:}` : Schoenberg transform
The Biotope Transform
Given `a in U`, the
biotope or
Steinhaus transform
`hat D(x,y) := {: 2D(x,y):} / {:D(x,a) + D(y,a) + D(x,y):}
yields a metric.(How did I not know this?)
For `D(A,B)=mu(A Delta B)` and `a=O/`, get
`hat D(A,B) = {:mu(A Delta B):} / {:mu(A uu B):}`
Generalizations? Replacing `D(x,a) + D(y,a)` by `min_{a in T} D(x,a) + D(y,a)` seems to work,
for `T subset U`.
Biotope Distance : a.k.a.
- Marczewski-Steinhaus [MS58] in ecology, 32 hits
- Tanimoto [RT60] in chem and genetics, 157 hits
- Jaccard [J01] in CS and genetics, 262 hits
- Set similarity in TCS [Cha02]
- Resemblance in TCS/Web [B97]
Packings, Coverings, Nets
Given `(U ,D)`, `epsilon > 0`, `P subset U` is an:
- `epsilon`-covering: `D(x,P) le epsilon` for all `x in U`
- `epsilon`-packing: `D(x,y) ge 2 epsilon` for all `x,y in P`
- `epsilon`-net: `epsilon`-cover and `epsilon/2`-packing
- (Haussler/Welzl `epsilon`-net hits all balls of large volume)
- Gonzalez construction:
- starting with `P = {x}` for some `x in U`, repeat:
- Add `y` to `P` that is farthest from `P`
- Until have `epsilon`-net
Gonzalez Construction Properties
- Optimal approximation algorithm, in a sense [G85][ST85]
- Used in building NN data structures [Bri95][WOj03][C03][HPM05]
- Bawden-Lajiness algorithm in comp. chem.
- Farthest Point Sampling in image proc. [ELPZ97]
- Not far from Chew's algorithm for building triangulations
Box Dimension
- Given `Z = (U,D)`, let `N (Z, epsilon)` be `epsilon`-net size for `Z`
- Suppose there is some `d` so that
`N (Z, epsilon) = {: {:1:} // {: epsilon^{:d+o(1):} :} :}`
as `epsilon -> 0`.
- Then `d` is `dim_B(Z)`, the box dimension of `Z`.
- Note that `{: {:1:} // {: epsilon^{:o(1):} :} :}` may not be `O(1)`
Box Dimension Equivalents
- Equivalently
`dim_B(Z) = lim_ {:epsilon -> 0 :} {: {: - log N (Z, epsilon) :} / {: log epsilon :} :}
- Could also define using covering number `C (Z, epsilon)`
- Constant factor doesn't matter in the asymptotic expression
- `dim_B(Z)` is critical value of `t`-content
`lim_ {:epsilon -> 0 :} C (Z, epsilon) epsilon^t
= lim_ {:epsilon -> 0 :} {: epsilon^{:t - dim_B(Z) + o(1):} :}`
Hausdorff Dimension
- `epsilon`-cover `mathcal E` is a collection of balls `B` all with `diam B le epsilon`
- Use `t`-content
`inf_{mathcal E text{an} epsilon text{-cover} :} sum_{:B in mathcal E:} {:diam (B)^t:}
- More balls than `C(Z,epsilon)`, but small ones count less
- Get Hausdorff dimension `dim_H(Z)`.
Assouad Dimension
- A stronger, more uniform condition:
- All balls `B(x,r)` have an `(epsilon r)`-net that isn't too big:
`{:supr_{:x in U text(and) r>0:} C (B(x,r), epsilon r):} = 1 // {:epsilon^{:d+o(1):}:}`,
- `d` is the Assouad dimension, `dim_A(Z)`
- Have `dim_T(Z) le dim_H(Z) le dim_B(Z) le dim_A(Z)`
Doubling Constant
- Closely related doubling constant `doub_C(Z)` has largest
packing `mathcal P (B(x,r), r//2) le doub_C(Z)` for all `x in U`, `r > 0`.
- Several papers give approximation or expected algorithms assuming bounded `dim_A(Z)`[C97][KL04][HPM05]
Metric Measure Spaces
- Suppose there is also a measure `mu`, so have a metric measure space `(U,D,mu)`
- Can use empirical estimator `mu(A) approx |A cap S|//n`,
- where `S` is random sample of `n` sites, with distribution `mu`
Renyi dimension
Given `epsilon > 0`, let:
- `mu_epsilon(x) := mu(B(x,epsilon)),
- and `|\|mu_epsilon|\| _v` be the `L_v` norm of `mu_epsilon` with respect to `mu`:
`{:|\|mu_epsilon|\|:}_v ^v := int mu_epsilon^v diffmu`.
- That is, if `X_1 ldots X_{v+1}` have distribution `mu`, then `{:|\|mu_epsilon|\|:}_v^v` is the
probability that all are within `epsilon` of `X_1`.
Renyi dimension, correlation dimension
Given `epsilon > 0`, let:
- `{:|\|mu_epsilon|\|:}_1` is the correlation integral
- Empirical estimator is number of pairs of sites of `S` closer than `epsilon`
- Renyi dimension `dim_v(mu)` is `d` such that
`{:|\|mu_epsilon|\|:}_{v-1} = epsilon ^{:d+o(1):}`,
as `epsilon -> 0`.
- `dim_2(mu)` is the correlation dimension
Renyi dimension and NN Search
- Correlation dimension used in study of "strange attractors" of dynamical systems
- Computing estimator of correlation integral:
- batched fixed-radius query problem, a.k.a.
- spatial join
- In Euclidean space, fast estimates of integral can be done with bucketing [BF98]
- Dimension can also be estimated using `k`-NN distances [LB04]
Renyi and Information Dimensions
- `{:|\|mu_epsilon|\|:}_0` can be defined as a limit, giving `dim_1(mu)` as the `d` such that
`int mu_epsilon(y) log(mu_epsilon(y)) {:diffmu(y):} = epsilon ^{:d+o(1):}`
as `epsilon -> 0`
- This is the information dimension
Information and Pointwise Dimensions
- At `x in U`, the pointwise dimension `alpha _mu (x)` is the `d` so that:
`mu(B(x,epsilon)) = epsilon ^{:d+o(1):}
as `epsilon -> 0`
- Equivalently,
`alpha _mu (x) = lim_{epsilon -> 0} {:log mu(B(x,epsilon)):} / {:log epsilon :}
- Under mild conditions, `E[alpha_mu(x)] = dim_1(mu)`, where the expectation is with respect to `x~mu`
Pointwise Dimension and Others
- a.k.a. local dimension, Hoelder exponent
- Roughly, bounds Hausdorff dimension of support of `mu`
- Multi-fractal analysis uses function `f_mu(hat alpha)`
- Hausdorff dimension of `x` with `alpha_mu(x)=hat alpha
- Can be computed from Renyi spectrum, the set of all values `dim_v(mu)`
- Also related to energy dimension
- Tao et al.: use estimates of pointwise dim. to predict NN search costs for nearby points
- Used in graphs for routing [GZ04]
Pointwise Dimension and NNs
- For:
- random sample `S`,
- integer `k`,
- `delta_{:k:n:}(x)=` `k`'th NN dist,
- Have: [CD89]
`alpha_mu(x) = lim_{:n -> oo:} {:log(k//n):}/{:log delta_{:k:n:}(x) :}`
- Heuristically:
- choose `epsilon_k` such that `mu(B(x,epsilon_k)) = k//n`
- have `delta_{:k:n:}(x) approx epsilon_k`
- `{:k//n:} = mu(B(x,epsilon_k)) approx epsilon_k^{:alpha_mu(x):} approx delta_{:k:n:}(x)^{:alpha_mu(x):}`
Extremal Graphs
- Let `G` be NN graph, MST, TSP, matching...
- Let `L(G, beta) := sum_{e "an edge of" G} length(e)^beta
- In `d`-manifold, have
`d = lim_{:n -> oo:} {:log(1//n):}/{log(L(G,1)//n):}`
- Matches previous formula for `G=` 1-NN graph
- Kozma et al.: `supr_{S subset U} L(T(S),t)` is a `t`-content yielding (upper) box dim.
Dimensions and NN Data Structures
`Z=(U,D,mu)` a metric measure space.
- `doub_A(Z)`: recall `mathcal P (B(x,r), r//2) le doub_C(Z)`
- doubling measure `doub_M(Z)` has, for all `x in U`, `r > 0`:
`mu(B(x,r)) le mu(B(x,r//2))2^{:doub_M(Z):}`
- doubling measure condition is much stronger than doubling constant, and `doub_A(Z) le doub_M(Z)`
- Near-linear space/prep., `o(n^gamma)` query time:
- `doub_A(Z)` bounded, expected, exchangeable queries [C97]
- `doub_M(Z)` bounded, high prob. for given query [KR02]
- `doub_A(Z)` bounded, approx. [KL04]
Divide-and-Conquer
Several approaches could be sketched as:
- Find `P subset S`, `{:|P|:} =m`, ball `B_p` for each `p in P`, such that
- if query `q` has `p` nearest in `P`, nearest to `q` is in `S cap B_p`
- So: build data structure recursively for each `S cap B_p`
- Answer query by finding nearest in `P`-set of root, then search that child
Divide-and-Conquer Approaches
- `P` is random, `B_p` prob. contains nearest, `{:|B_p|:} = O^**(n//m)`
- doubling constant, exchangeable, spread is in bound
- roughly [C97]
- `P` is random, `B_p` contains nearest with high prob., `{:|B_p|:} = O^**(n//m)`
- doubling measure, prob. per query
- Roughly [KR02]
- `P` is an `epsilon`-net, either `p` is approx NN, or `B_p` contains nearest; `B_p` small
- doubling constant; resulting bound includes spread
- Roughly [KL04]