Metric Embedding

  • Metric Space: $(X,d),X$ is a set and $d$ is a distance on $X$
  • Embedding: $\phi:X\rightarrow Y$ on two metric space $(X,d_X),(Y,d_Y)$
    • Distortion $\alpha\geq 1$: $\forall x,y\in X,\frac{1}{\alpha}d(x,y)\leq d(\phi(x),\phi(y))\leq\alpha d(x,y)$
    • Dimension Reduction: $x_1,\cdots,x_n\in\mathbb{R}^d,y_1,\cdots,y_n\in\mathbb{R}^k$

JLT Embedding

  • Johnson-Lindenstrauss Theorem(JLT, 1984): it is always possble to embed $n$ points in arbitrary dimension to $O(\log n)$ dimension with constant distortion in Euclidian Space
    • $\forall0<\epsilon <1,\forall S\subset \mathbb{R}^{d},|S|=n,\exists k=O(\epsilon ^{-2}\log n),\phi :\mathbb{R} ^{d}\rightarrow \mathbb{R}^{k}$ such that $\forall x,y\in S,(1-\epsilon )|x-y|^{2}\leq |\phi (x)-\phi (y)|^{2}\leq (1+\epsilon )|x-y|^{2}$
    • (linear embedding): $\phi(x)=Ax$
    • (2016) Lower Bound: $\Omega(\epsilon^{-2}\log n)$
  • Contruction
    • projection onto uniform random $k$-dimensional subspace of $\mathbb{R}^d$ (1999)
    • random matrix with i.i.d. $\pm 1$ (2003)
    • random matrix with i.i.d. Gaussian entries (1998): $A\in\mathbb{R}^{k\times d}$ is drawn from the Gaussian distribution $\mathcal{N}(0,\frac{1}{k})$
      • To prove: $P(|| Au|^2_2-1|>\epsilon)<\frac{1}{n^3}$
      • $\lVert Au\rVert^2=\sum_{i=1}^kY_i^2,Y_i\sim\mathcal{N}(0,\frac{1}{k})$
      • Chernoff bound for $\chi^2$-distribution

Nearest Neighbor Search(NNS)

Problem

  • NNS
    • Data: $y_1,\cdots,y_n\in X$
    • Query: $x\in X$
    • Answer: $y_i$ closest to $x$
  • $c$-ANN: Approximate Nearest Neighbor
    • Answer: Find a $y_i$ such that $\text{dist}(x,y_i)\leq c\min_{1\leq j\leq n}\text{dist}(x,y_j)$
  • $(c,r)$-ANN: Approximate Near Neighbor:

$$\begin{cases}y_{i^}\in S,\text{dist}(x,y_{i^})\leq cr & \exists y_i\in S,\text{dist}(x,y_i)\leq r\newline \perp & \forall y_i\in S,\text{dist}(x,y_i)> r\newline \text{arbitrary} & o.w.\end{cases}$$

  • From $(c,r)$-ANN to $c$-ANN
    • Definition
      • $D_{\min}=\min\text{dist}(y_i,y_j)$
      • $D_{\max}=\max\text{dist}(y_i,y_j)$
      • $R=\frac{D_{\max}}{D_{\min}}$
    • $\forall r,\exists$ data structure for $(c,r)$-ANN
      • space $s$
      • answer time $t$
      • probability $1-\delta$
    • $\exists$ data structure for $r$-ANN
      • space $O(s\log_c R)$
      • answer time $O(t\log\log_c R)$
      • probability $1-O(\delta\log\log_c R)$

Deterministic

  • Dictionary data structure
    • k-d tree
    • Voronoi diagram
  • Curse of dimensionality
    • conjecture: NNS in high dimension requires either super-polynomial space or super-polynomial time

Dimension Reduction

  • Solve $(c,r)$-ANN in Hamming Space $x\in{0,1}^d,d»\log n$ w.h.p
  • Data Structure
    • random Boolean matrix $A_{k\times d}$ with i.i.d. entries $\in$ Bernoulli($p$)
      • $z_i=Ay_i\in{0,1}^k$ on finite field GF(2)
    • $s$-balls: $B_s(u)={y_i|\text{dist}(u,z_i)\leq s}$ for all $u\in{0,1}^k$ (打表)
    • space: $O(n2^k)$
  • Answer: any $y_i\in B_s(Ax)$ (no if none)
    • query time: $O(kd)+O(1)$
  • Decide $k,p,s$
    • $k=\frac{\ln n}{(\frac{1}{8}-2^{-(c+2)})^2}=O(\log n)$
    • $p=\frac{1}{2}-2^{-1-1/r}$
    • $s=(\frac{3}{8}-2^{-(c+2)})k$
    • space: $n^{O(1)}$
    • time: $O(d\log n)$

Locality Sensitive Hashing

  • $(r,cr,p,q)$-LSH: $h:X\rightarrow U$ satisfying $\forall x,y\in X$
    • $\text{dist}(x,y)\leq r\Rightarrow P(h(x)=h(y))\geq p$
    • $\text{dist}(x,y)>cr\Rightarrow P(h(x)=h(y))\leq q$
  • $\exists (r,cr,p,q)$-LSH $\Rightarrow\exists (r,cr,p^k,q^k)$-LSH
    • $g(x)=(h_1(x),\cdots,h_k(x))\in U^k$
    • $k=\log_{\frac{1}{q}}n,p^k={n^{-\frac{\log p}{\log q}}},q=\frac{1}{n}$
  • with $(r,cr,p^*,\frac{1}{n})$-LSH $g$
    • Algorithm 1:
      • space: $O(n)$
        • store $y_1,\cdots,y_n$ in nondecreasing order $g(y_i)$
      • time: $O(\log n)+O(1)$ in expectation
        • find all $y_i$ that $g(x)=g(y_i)$ and check $\text{dist}(x,y_i)$
      • real answer “no”: always correct
      • real answer not “no”: $\geq p^*$
    • Algorithm 2:
      • space: $O(\frac{n}{p^*})$
        • draw independent $g_1,\cdots,g_{\frac{1}{p^*}}$
        • store $y_1,\cdots,y_n$ in table-$j$ in nondecreasing order of $g_j(y_i)$
      • time: $O(\frac{\log n}{p^*})$
        • find $\leq\frac{10}{p^*}$ number of $y_i,\exists j,g_j(x)=g_j(y_i)$
      • real answer “no”: always correct
      • real answer not “no”: $>\frac{1}{2}$
  • Overall: solve $(c,r)$-ANN with space $O(n^{1+\frac{\log p}{\log q}})$, query time $O(n^{\frac{\log p}{\log q}}\log n)$ and one-sided error $<0.5$
  • Hamming space: $h(x)=x_i$ for uniform $i\in[d]$ is a $(r,cr,1-\frac{r}{d},1-\frac{cr}{d})$-LSH