- A set of sites (points) `S subset U`
- In metric space `(U,D)`
- Build a data structure so that:

Given `q in U`, the closest site in `S` to a given query point `q` can be found quickly

- Considered for at least thirty years, called:
*Post Office problem*(McNutt, '72)*Best match file searching*[BK73]- Index for similarity search [HS03]
- Vector quantization encoder
- Fast nearest-neighbor classifier

- Roughly, a notion of how "size" of `U` changes with measurement scale
- Intimately related to NN searching
- Some dimensional measures (e.g., Assouad) give provable upper bounds [C97], [KR02], [KL04], [HPM05]
- Empirically, can be used to predict NN search performance [BF98], [TFP03]
- NN search useful for estimating dimension
- Correlation dimension via batched NN queries
- Pointwise dim. is related to NN distance
- Renyi dimensions via extremal graphs

- Some basics about metric spaces: repair and construction
- Packings, coverings, nets, Gonzalez construction
- Dimensions: box, packing, Assouad
- Metric measure spaces
- Renyi and pointwise dimensions, doubling measures
- Approaches to NN searching, relation to doubling constant and measures

- Isolation: `x ne y` implies `D(x,y) ge 0`
- If not:
*pseudometric*, fix with equivalence classes - Symmetry: `D(x,y)=D(y,x)`
- If not:
*quasimetric*; `hat D (x,y) := (D(x,y) + D(y,x))//2` - Triangle Inequality: `D(x,z) le D(x,y) + D(y,z)`
- If not:
*semimetric*; `hat D (x,y) := inf sum_i D(z_i, z_{i+1})`, `x=z_0`, `y=z_k`

Suppose `(U,D)`, and `(U_1,D_1)...(U_d,D_d)` are metric spaces.

- `L_p`: `hat U:= U_1 xx U_2 xx cdots xx U_d`, etc.
- Strings over `U`
- Nonnegative combinations: `U_1=U_2= cdots =U_d`, given `alpha_1 ldots alpha_d`, `hat D(x,y) := sum_i alpha_i D_i(x,y)`
- Distance on subsets `A,B subset U`
- Hausdorf
- Given measure `mu`, distance `mu(A Delta B)`

- Given `f(z)` on `RR` with:
- `f(0)=0`
- `f` monotone increasing
- `f` concave
- Have:
- `hat D(x,y) := f(D(x,y))` also a metric
- For `epsilon ge 0`, `f(z) := z^epsilon`, the "snowflake"
- Alternate fix for semimetric
- `f(z) := z/(1+z)` : bounded space
- For `lambda > 0`, `f(z) := 1 - e^{: - lambda z:}` : Schoenberg transform

`hat D(x,y) := {: 2D(x,y):} / {:D(x,a) + D(y,a) + D(x,y):}yields a metric.(How did I not know this?)

For `D(A,B)=mu(A Delta B)` and `a=O/`, get

`hat D(A,B) = {:mu(A Delta B):} / {:mu(A uu B):}`Generalizations? Replacing `D(x,a) + D(y,a)` by `min_{a in T} D(x,a) + D(y,a)` seems to work, for `T subset U`.

- Marczewski-Steinhaus [MS58] in ecology, 32 hits
- Tanimoto [RT60] in chem and genetics, 157 hits
- Jaccard [J01] in CS and genetics, 262 hits
- Set similarity in TCS [Cha02]
- Resemblance in TCS/Web [B97]

- `epsilon`-covering: `D(x,P) le epsilon` for all `x in U`
- `epsilon`-packing: `D(x,y) ge 2 epsilon` for all `x,y in P`
- `epsilon`-net: `epsilon`-cover and `epsilon/2`-packing
- (Haussler/Welzl `epsilon`-net hits all balls of large
*volume*) - Gonzalez construction:
- starting with `P = {x}` for some `x in U`, repeat:
- Add `y` to `P` that is farthest from `P`
- Until have `epsilon`-net

- Optimal approximation algorithm, in a sense [G85][ST85]
- Used in building NN data structures [Bri95][WOj03][C03][HPM05]
- Bawden-Lajiness algorithm in comp. chem.
- Farthest Point Sampling in image proc. [ELPZ97]
- Not far from Chew's algorithm for building triangulations

- Given `Z = (U,D)`, let `N (Z, epsilon)` be `epsilon`-net size for `Z`
- Suppose there is some `d` so that
`N (Z, epsilon) = {: {:1:} // {: epsilon^{:d+o(1):} :} :}`

as `epsilon -> 0`. - Then `d` is `dim_B(Z)`, the
*box dimension*of `Z`. - Note that `{: {:1:} // {: epsilon^{:o(1):} :} :}` may not be `O(1)`

- Equivalently
`dim_B(Z) = lim_ {:epsilon -> 0 :} {: {: - log N (Z, epsilon) :} / {: log epsilon :} :}

- Could also define using covering number `C (Z, epsilon)`
- Constant factor doesn't matter in the asymptotic expression
- `dim_B(Z)` is critical value of
*`t`-content*`lim_ {:epsilon -> 0 :} C (Z, epsilon) epsilon^t = lim_ {:epsilon -> 0 :} {: epsilon^{:t - dim_B(Z) + o(1):} :}`

- `epsilon`-cover `mathcal E` is a collection of balls `B` all with `diam B le epsilon`
- Use `t`-content
`inf_{mathcal E text{an} epsilon text{-cover} :} sum_{:B in mathcal E:} {:diam (B)^t:}

- More balls than `C(Z,epsilon)`, but small ones count less
- Get Hausdorff dimension `dim_H(Z)`.

- A stronger, more uniform condition:
- All balls `B(x,r)` have an `(epsilon r)`-net that isn't too big:
`{:supr_{:x in U text(and) r>0:} C (B(x,r), epsilon r):} = 1 // {:epsilon^{:d+o(1):}:}`,

- `d` is the Assouad dimension, `dim_A(Z)`
- Have `dim_T(Z) le dim_H(Z) le dim_B(Z) le dim_A(Z)`

- Closely related
*doubling constant*`doub_C(Z)` has largest packing `mathcal P (B(x,r), r//2) le doub_C(Z)` for all `x in U`, `r > 0`. - Several papers give approximation or expected algorithms assuming bounded `dim_A(Z)`[C97][KL04][HPM05]

- Suppose there is also a measure `mu`, so have a metric measure space `(U,D,mu)`
- Can use empirical estimator `mu(A) approx |A cap S|//n`,
- where `S` is random sample of `n` sites, with distribution `mu`

- `mu_epsilon(x) := mu(B(x,epsilon)),
- and `|\|mu_epsilon|\| _v` be the `L_v` norm of `mu_epsilon` with respect to `mu`:
`{:|\|mu_epsilon|\|:}_v ^v := int mu_epsilon^v diffmu`.

- That is, if `X_1 ldots X_{v+1}` have distribution `mu`, then `{:|\|mu_epsilon|\|:}_v^v` is the probability that all are within `epsilon` of `X_1`.

- `{:|\|mu_epsilon|\|:}_1` is the
*correlation integral* - Empirical estimator is number of pairs of sites of `S` closer than `epsilon`
*Renyi dimension*`dim_v(mu)` is `d` such that`{:|\|mu_epsilon|\|:}_{v-1} = epsilon ^{:d+o(1):}`,

as `epsilon -> 0`.- `dim_2(mu)` is the
*correlation dimension*

- Correlation dimension used in study of "strange attractors" of dynamical systems
- Computing estimator of correlation integral:
- batched fixed-radius query problem, a.k.a.
*spatial join*- In Euclidean space, fast estimates of integral can be done with bucketing [BF98]
- Dimension can also be estimated using `k`-NN distances [LB04]

- `{:|\|mu_epsilon|\|:}_0` can be defined as a limit, giving `dim_1(mu)` as the `d` such that
`int mu_epsilon(y) log(mu_epsilon(y)) {:diffmu(y):} = epsilon ^{:d+o(1):}`

as `epsilon -> 0` - This is the
*information dimension*

- At `x in U`, the pointwise dimension `alpha _mu (x)` is the `d` so that:
`mu(B(x,epsilon)) = epsilon ^{:d+o(1):}

as `epsilon -> 0` - Equivalently,
`alpha _mu (x) = lim_{epsilon -> 0} {:log mu(B(x,epsilon)):} / {:log epsilon :}

- Under mild conditions, `E[alpha_mu(x)] = dim_1(mu)`, where the expectation is with respect to `x~mu`

- a.k.a.
*local dimension*,*Hoelder exponent* - Roughly, bounds Hausdorff dimension of support of `mu`
*Multi-fractal analysis*uses function `f_mu(hat alpha)`- Hausdorff dimension of `x` with `alpha_mu(x)=hat alpha
- Can be computed from
*Renyi spectrum*, the set of all values `dim_v(mu)` - Also related to
*energy*dimension - Tao et al.: use estimates of pointwise dim. to predict NN search costs for nearby points
- Used in graphs for routing [GZ04]

- For:
- random sample `S`,
- integer `k`,
- `delta_{:k:n:}(x)=` `k`'th NN dist,
- Have: [CD89]
`alpha_mu(x) = lim_{:n -> oo:} {:log(k//n):}/{:log delta_{:k:n:}(x) :}`

- Heuristically:
- choose `epsilon_k` such that `mu(B(x,epsilon_k)) = k//n`
- have `delta_{:k:n:}(x) approx epsilon_k`
- `{:k//n:} = mu(B(x,epsilon_k)) approx epsilon_k^{:alpha_mu(x):} approx delta_{:k:n:}(x)^{:alpha_mu(x):}`

- Let `G` be NN graph, MST, TSP, matching...
- Let `L(G, beta) := sum_{e "an edge of" G} length(e)^beta
- In `d`-manifold, have `d = lim_{:n -> oo:} {:log(1//n):}/{log(L(G,1)//n):}`
- Matches previous formula for `G=` 1-NN graph
- Kozma et al.: `supr_{S subset U} L(T(S),t)` is a `t`-content yielding (upper) box dim.
- Where `T(S)` is the MST

- `doub_A(Z)`: recall `mathcal P (B(x,r), r//2) le doub_C(Z)`
- doubling measure `doub_M(Z)` has, for all `x in U`, `r > 0`:
`mu(B(x,r)) le mu(B(x,r//2))2^{:doub_M(Z):}`

- doubling measure condition is much stronger than doubling constant, and `doub_A(Z) le doub_M(Z)`
- Near-linear space/prep., `o(n^gamma)` query time:
- `doub_A(Z)` bounded, expected,
*exchangeable*queries [C97] - `doub_M(Z)` bounded, high prob. for given query [KR02]
- `doub_A(Z)` bounded, approx. [KL04]

- Find `P subset S`, `{:|P|:} =m`, ball `B_p` for each `p in P`, such that
- if query `q` has `p` nearest in `P`, nearest to `q` is in `S cap B_p`
- So: build data structure recursively for each `S cap B_p`
- Answer query by finding nearest in `P`-set of root, then search that child

- `P` is random, `B_p` prob. contains nearest, `{:|B_p|:} = O^**(n//m)`
- doubling constant, exchangeable,
*spread*is in bound - roughly [C97]
- `P` is random, `B_p` contains nearest with high prob., `{:|B_p|:} = O^**(n//m)`
- doubling measure, prob. per query
- Roughly [KR02]
- `P` is an `epsilon`-net, either `p` is approx NN, or `B_p` contains nearest; `B_p` small
- doubling constant; resulting bound includes spread
- Roughly [KL04]