Compressed cover tree

The compressed cover tree is a type of data structure in computer science that is specifically designed to facilitate the speed-up of a k-nearest neighbors algorithm in finite metric spaces.^[1] Compressed cover tree is a simplified version of explicit representation of cover tree that was motivated by past issues in proofs of time complexity results^[2] of cover tree. The compressed cover tree was specifically designed to achieve claimed time complexities of cover tree^[3] in a mathematically rigorous way.

Problem statement[edit]

In the modern formulation, the k-nearest neighbor problem is to find all $k\geq 1$ nearest neighbors in a given reference set R for all points from another given query set Q. Both sets belong to a common ambient space X with a distance metric d satisfying all metric axioms.

Definitions[edit]

Compressed cover tree[edit]

Let (R,d) be a finite metric space. A compressed cover tree ${\mathcal {T}}(R)$ has the vertex set R with a root $r\in R$ and a level function $l:R\rightarrow \mathbb {Z}$ satisfying the conditions below:

Root condition: the level of the root node r satisfies $l(r)\geq 1+\max \limits _{p\in R\setminus \{r\}}l(p)$
Covering condition: For every node $q\in R\setminus \{r\}$ , we select a unique parent p and a level l(q) such that $d(q,p)\leq 2^{l(q)+1}$ and $l(q)<l(p)$ this parent node pp has a single link to its child node q.

Separation condition: For $i\in \mathbb {Z}$ , the cover set $C_{i}=\{p\in R\mid l(p)\geq i\}$ has $d_{\min }(C_{i})=\min \limits _{p\in C_{i}}\min \limits _{q\in C_{i}\setminus \{p\}}d(p,q)>2^{i}$

Expansion constants[edit]

In a metric space, let ${\bar {B}}(p,t)$ be the closed ball with a center p and a radius $t\geq 0$ . The notation $|{\bar {B}}(p,t)|$ denotes the number (if finite) of points in the closed ball.

The expansion constant ^[3] $c(R)$ is the smallest $c(R)\geq 2$ such that $|{\bar {B}}(p,2t)|\leq c(R)\cdot |{\bar {B}}(p,t)|$ for any point $p\in R$ and $t\geq 0$ .

the new minimized expansion constant ^[1] $c_{m}$ is a discrete analog of the doubling dimension Navigating nets ^[4] $c_{m}(R)=\lim \limits _{\xi \rightarrow 0^{+}}\inf \limits _{R\subseteq A\subseteq X}\sup \limits _{p\in A,t>\xi }{\dfrac {|{\bar {B}}(p,2t)\cap A|}{|{\bar {B}}(p,t)\cap A|}}$ , where A is a locally finite set which covers R.

Note that $c_{m}(R)\leq c(R)$ for any finite metric space (R,d).

Aspect ratio[edit]

For any finite set R with a metric d, the diameter is $\mathrm {diam} (R)=\max _{p\in R}\max _{q\in R}d(p,q)$ . The aspect ratio is $\Delta (R)={\dfrac {\mathrm {diam} (R)}{d_{\min }(R)}}$ , where $d_{\min }(R)$ is the shortest distance between points of R.

Complexity[edit]

Insert[edit]

Although cover trees provide faster searches than the naive approach, this advantage must be weighed with the additional cost of maintaining the data structure. In a naive approach adding a new point to the dataset is trivial because order does not need to be preserved, but in a compressed cover tree it can be bounded

Using expansion constant: $O(c(R)^{10}\cdot \log |R|)$ .
Using minimized expansion constant / doubling dimension $O(c_{m}(R)^{8}\cdot \log \Delta (|R|))$ .

K-nearest neighborhood search[edit]

Let Q and R be finite subsets of a metric space (X,d). Once all points of R are inserted into a compressed cover tree ${\mathcal {T}}(R)$ it can be used for find-queries of the query point set Q. The following time complexities have been proven for finding the k-nearest neighbor of a query point $q\in Q$ in the reference set R:

Using expansion constant: $O{\Big (}c(R\cup \{q\})^{2}\cdot \log _{2}(k)\cdot {\big (}(c_{m}(R))^{10}\cdot \log _{2}(|R|)+c(R\cup \{q\})\cdot k{\big )}{\Big )}.$ .
Using minimized expansion constant / doubling dimension $O{\Big (}(c_{m}(R))^{10}\cdot \log _{2}(k)\cdot \log _{2}(\Delta (R))+|{\bar {B}}(q,5d_{k}(q,R))|\cdot \log _{2}(k){\Big )}$ , where $|{\bar {B}}(q,5d_{k}(q,R))|$ is a number of points inside a closed ball around q having a radius 5 times the distance of q to its k-nearest neighbor.

Space[edit]

The compressed cover tree constructed on finite metric space R requires O(|R|) space, during the construction and during the execution of the Find algorithm.

Compared to other similar data structures[edit]

Using doubling dimension as hidden factor[edit]

Tables below show time complexity estimates which use minimized expansion constant $c_{m}(R)$ or dimensionality constant $2^{\text{dim}}$ ^[4] related to doubling dimension. Note that $\Delta$ denotes the aspect ratio.

Results for building data structures

Name of datastructure, source	Claimed time complexity	Claimed space complexity	Proof of result
Navigating nets ^[4]	$O{\big (}2^{O({\text{dim}})}\cdot \|R\|\cdot \log(\Delta (R))\cdot \log(\log((\Delta (R))){\big )}$	$O(2^{O({\text{dim}})}\|R\|)$	Theorem 2.5 ^[4]
Compressed cover tree ^[1]	$O{\big (}c(R\cup \{q\})^{O(1)}\cdot \log(k)\cdot (\log(\|R\|)+k){\big )}$	$O(\|R\|)$	Theorem 3.6 ^[1]

Results for exact k-nearest neighbors of one query point $q\in Q$ in reference set R assuming that all data structures are already built. Below we denote the distance between a query point q and the reference set R as $d(q,R)$ and distance from a query point q to its k-nearest neighbor in set R as $d_{k}(q,R)$ :

Name of datastructure, source	Claimed time complexity	Claimed space complexity	Proof of result
Navigating nets ^[4]	$O{\big (}2^{O({\text{dim}})}\cdot \log(\Delta )+\|{\bar {B}}(q,O(d(q,R))\|{\big )}$	$O(2^{O({\text{dim}})}\|R\|)$	Proof outline in Theorem 2.3 ^[4]
Compressed cover tree ^[1]	$O{\big (}\log(k)\cdot (c_{m}(R)^{O(1)}\log(\|\Delta \|)+\|{\bar {B}}(q,O(d_{k}(q,R))\|){\big )}$	$O(\|R\|)$	Corollary 3.7 ^[1]

Using expansion constant as hidden factor[edit]

Tables below show time complexity estimates which use $c(R)$ or KR-type constant $2^{{\text{dim}}_{KR}}$ ^[4] as a hidden factor. Note that the dimensionality factor $2^{{\text{dim}}_{KR}}$ is equivalent to $c(R)^{O(1)}$