Metric Embedding
- Metric Space: $(X,d),X$ is a set and $d$ is a distance on $X$
- Embedding: $\phi:X\rightarrow Y$ on two metric space $(X,d_X),(Y,d_Y)$
- Distortion $\alpha\geq 1$: $\forall x,y\in X,\frac{1}{\alpha}d(x,y)\leq d(\phi(x),\phi(y))\leq\alpha d(x,y)$
- Dimension Reduction: $x_1,\cdots,x_n\in\mathbb{R}^d,y_1,\cdots,y_n\in\mathbb{R}^k$
JLT Embedding
- Johnson-Lindenstrauss Theorem(JLT, 1984): it is always possble to embed $n$ points in arbitrary dimension to $O(\log n)$ dimension with constant distortion in Euclidian Space
- $\forall0<\epsilon <1,\forall S\subset \mathbb{R}^{d},|S|=n,\exists k=O(\epsilon ^{-2}\log n),\phi :\mathbb{R} ^{d}\rightarrow \mathbb{R}^{k}$ such that $\forall x,y\in S,(1-\epsilon )|x-y|^{2}\leq |\phi (x)-\phi (y)|^{2}\leq (1+\epsilon )|x-y|^{2}$
- (linear embedding): $\phi(x)=Ax$
- (2016) Lower Bound: $\Omega(\epsilon^{-2}\log n)$
- Contruction
- projection onto uniform random $k$-dimensional subspace of $\mathbb{R}^d$ (1999)
- random matrix with i.i.d. $\pm 1$ (2003)
- random matrix with i.i.d. Gaussian entries (1998): $A\in\mathbb{R}^{k\times d}$ is drawn from the Gaussian distribution $\mathcal{N}(0,\frac{1}{k})$
- To prove: $P(|| Au|^2_2-1|>\epsilon)<\frac{1}{n^3}$
- $\lVert Au\rVert^2=\sum_{i=1}^kY_i^2,Y_i\sim\mathcal{N}(0,\frac{1}{k})$
- Chernoff bound for $\chi^2$-distribution
Nearest Neighbor Search(NNS)
Problem
- NNS
- Data: $y_1,\cdots,y_n\in X$
- Query: $x\in X$
- Answer: $y_i$ closest to $x$
- $c$-ANN: Approximate Nearest Neighbor
- Answer: Find a $y_i$ such that $\text{dist}(x,y_i)\leq c\min_{1\leq j\leq n}\text{dist}(x,y_j)$
- $(c,r)$-ANN: Approximate Near Neighbor:
$$\begin{cases}y_{i^}\in S,\text{dist}(x,y_{i^})\leq cr & \exists y_i\in S,\text{dist}(x,y_i)\leq r\newline \perp & \forall y_i\in S,\text{dist}(x,y_i)> r\newline \text{arbitrary} & o.w.\end{cases}$$
- From $(c,r)$-ANN to $c$-ANN
- Definition
- $D_{\min}=\min\text{dist}(y_i,y_j)$
- $D_{\max}=\max\text{dist}(y_i,y_j)$
- $R=\frac{D_{\max}}{D_{\min}}$
- $\forall r,\exists$ data structure for $(c,r)$-ANN
- space $s$
- answer time $t$
- probability $1-\delta$
- $\exists$ data structure for $r$-ANN
- space $O(s\log_c R)$
- answer time $O(t\log\log_c R)$
- probability $1-O(\delta\log\log_c R)$
- Definition
Deterministic
- Dictionary data structure
- k-d tree
- Voronoi diagram
- Curse of dimensionality
- conjecture: NNS in high dimension requires either super-polynomial space or super-polynomial time
Dimension Reduction
- Solve $(c,r)$-ANN in Hamming Space $x\in{0,1}^d,d»\log n$ w.h.p
- Data Structure
- random Boolean matrix $A_{k\times d}$ with i.i.d. entries $\in$ Bernoulli($p$)
- $z_i=Ay_i\in{0,1}^k$ on finite field GF(2)
- $s$-balls: $B_s(u)={y_i|\text{dist}(u,z_i)\leq s}$ for all $u\in{0,1}^k$ (打表)
- space: $O(n2^k)$
- random Boolean matrix $A_{k\times d}$ with i.i.d. entries $\in$ Bernoulli($p$)
- Answer: any $y_i\in B_s(Ax)$ (no if none)
- query time: $O(kd)+O(1)$
- Decide $k,p,s$
- $k=\frac{\ln n}{(\frac{1}{8}-2^{-(c+2)})^2}=O(\log n)$
- $p=\frac{1}{2}-2^{-1-1/r}$
- $s=(\frac{3}{8}-2^{-(c+2)})k$
- space: $n^{O(1)}$
- time: $O(d\log n)$
Locality Sensitive Hashing
- $(r,cr,p,q)$-LSH: $h:X\rightarrow U$ satisfying $\forall x,y\in X$
- $\text{dist}(x,y)\leq r\Rightarrow P(h(x)=h(y))\geq p$
- $\text{dist}(x,y)>cr\Rightarrow P(h(x)=h(y))\leq q$
- $\exists (r,cr,p,q)$-LSH $\Rightarrow\exists (r,cr,p^k,q^k)$-LSH
- $g(x)=(h_1(x),\cdots,h_k(x))\in U^k$
- $k=\log_{\frac{1}{q}}n,p^k={n^{-\frac{\log p}{\log q}}},q=\frac{1}{n}$
- with $(r,cr,p^*,\frac{1}{n})$-LSH $g$
- Algorithm 1:
- space: $O(n)$
- store $y_1,\cdots,y_n$ in nondecreasing order $g(y_i)$
- time: $O(\log n)+O(1)$ in expectation
- find all $y_i$ that $g(x)=g(y_i)$ and check $\text{dist}(x,y_i)$
- real answer “no”: always correct
- real answer not “no”: $\geq p^*$
- space: $O(n)$
- Algorithm 2:
- space: $O(\frac{n}{p^*})$
- draw independent $g_1,\cdots,g_{\frac{1}{p^*}}$
- store $y_1,\cdots,y_n$ in table-$j$ in nondecreasing order of $g_j(y_i)$
- time: $O(\frac{\log n}{p^*})$
- find $\leq\frac{10}{p^*}$ number of $y_i,\exists j,g_j(x)=g_j(y_i)$
- real answer “no”: always correct
- real answer not “no”: $>\frac{1}{2}$
- space: $O(\frac{n}{p^*})$
- Algorithm 1:
- Overall: solve $(c,r)$-ANN with space $O(n^{1+\frac{\log p}{\log q}})$, query time $O(n^{\frac{\log p}{\log q}}\log n)$ and one-sided error $<0.5$
- Hamming space: $h(x)=x_i$ for uniform $i\in[d]$ is a $(r,cr,1-\frac{r}{d},1-\frac{cr}{d})$-LSH