Safe Haskell | None |
---|---|
Language | Haskell2010 |
Random projection trees for approximate nearest neighbor search in high-dimensional vector spaces.
Introduction
Similarity search is a common problem in many fields (imaging, natural language processing, ..), and is often one building block of a larger data processing system.
There are many ways to embed data in a vector space such that similarity search can be recast as a geometrical nearest neighbor lookup.
In turn, the efficiency and effectiveness of querying such a vector database strongly depends on how internally the data index is represented, graphs and trees being two common approaches.
The naive, all-pairs exact search becomes impractical even at moderate data sizes, which motivated research into approximate indexing methods.
Overview
This library provides a tree-based approach to approximate nearest neighbor search. The database is recursively partitioned according to a series of random projections, and this partitioning is logically arranged as a tree which allows for rapid lookup.
Internally, a single random projection vector is sampled per tree level, as proposed in [1]. The projection vectors in turn can be sparse with a tunable sparsity parameter, which can help compressing the database at a small accuracy cost.
Retrieval accuracy can be improved by populating multiple trees (i.e. a random forest), and intersecting the results of the same query against each of them.
Quick Start
1) Build an index with forest
2) Lookup the \(k\) nearest neighbors to a query point with knn
3) The database can be serialised and restored with serialiseRPForest
and deserialiseRPForest
, respectively.
References
1) Hyvonen, V., et al., Fast Nearest Neighbor Search through Sparse Random Projections and Voting, https://www.cs.helsinki.fi/u/ttonteri/pub/bigdata2016.pdf
Synopsis
- forest :: (Monad m, Inner SVector v) => Word64 -> Int -> Int -> Int -> Int -> Double -> Int -> ConduitT () (Embed v Double x) m () -> m (RPForest Double (Vector (Embed v Double x)))
- knn :: (Ord p, Inner SVector v, Unbox d, Real d) => (u d -> v d -> p) -> Int -> RPForest d (Vector (Embed u d x)) -> v d -> Vector (p, Embed u d x)
- serialiseRPForest :: (Serialise d, Serialise a, Unbox d) => RPForest d a -> [ByteString]
- deserialiseRPForest :: (Serialise d, Serialise a, Unbox d) => [ByteString] -> Either String (RPForest d a)
- recallWith :: (Inner SVector v, Unbox d, Fractional a1, Ord d, Ord a2, Ord x, Ord (u d), Num d) => (u d -> v d -> a2) -> RPForest d (Vector (Embed u d x)) -> Int -> v d -> a1
- levels :: RPTree d a -> Int
- points :: Monoid m => RPTree d m -> m
- candidates :: (Inner SVector v, Unbox d, Ord d, Num d, Semigroup xs) => RPTree d xs -> v d -> xs
- data Embed v e a = Embed {}
- data RPTree d a
- type RPForest d a = IntMap (RPTree d a)
- data SVector a
- fromListSv :: Unbox a => Int -> [(Int, a)] -> SVector a
- fromVectorSv :: Int -> Vector (Int, a) -> SVector a
- data DVector a
- fromListDv :: Unbox a => [a] -> DVector a
- fromVectorDv :: Vector a -> DVector a
- class (Scale u, Scale v) => Inner u v where
- class Scale v where
- innerSS :: (Vector u (Int, a), Vector v (Int, a), Unbox a, Num a) => u (Int, a) -> v (Int, a) -> a
- innerSD :: (Num a, Vector u (Int, a), Vector v a, Unbox a) => u (Int, a) -> v a -> a
- innerDD :: (Vector v a, Num a) => v a -> v a -> a
- metricSSL2 :: (Floating a, Vector u a, Unbox a, Vector u (Int, a), Vector v (Int, a)) => u (Int, a) -> v (Int, a) -> a
- metricSDL2 :: (Floating a, Unbox a, Vector u (Int, a), Vector v a) => u (Int, a) -> v a -> a
- scaleS :: (Vector v (a, b), Num b) => b -> v (a, b) -> v (a, b)
- scaleD :: (Vector v b, Num b) => b -> v b -> v b
- draw :: (Show a, Boxed a, PrintfArg v) => RPTree v a -> IO ()
- writeCsv :: (Show a, Show b, Unbox a) => FilePath -> [(DVector a, b)] -> IO ()
- randSeed :: MonadIO m => m Word64
- data BenchConfig = BenchConfig {
- bcDescription :: String
- bcMaxTreeDepth :: Int
- bcMinLeafSize :: Int
- bcNumTrees :: Int
- bcChunkSize :: Int
- bcNZDensity :: Double
- bcVectorDim :: Int
- bcDataSize :: Int
- bcNumQueryPoints :: Int
- normalSparse2 :: Monad m => Double -> Int -> GenT m (SVector Double)
- liftC :: (Monad m, MonadTrans t) => ConduitT i o m r -> ConduitT i o (t m) r
- dataSource :: Monad m => Int -> GenT m a -> ConduitT i a (GenT m) ()
- datS :: Monad m => Int -> Int -> Double -> ConduitT i (SVector Double) (GenT m) ()
- datD :: Monad m => Int -> Int -> ConduitT i (DVector Double) (GenT m) ()
- sparse :: (Monad m, Unbox a) => Double -> Int -> GenT m a -> GenT m (SVector a)
- dense :: (Monad m, Vector Vector a) => Int -> GenT m a -> GenT m (DVector a)
- normal2 :: Monad m => GenT m (DVector Double)
Construction
:: (Monad m, Inner SVector v) | |
=> Word64 | random seed |
-> Int | max tree depth |
-> Int | min leaf size |
-> Int | number of trees |
-> Int | data chunk size |
-> Double | nonzero density of projection vectors |
-> Int | dimension of projection vectors |
-> ConduitT () (Embed v Double x) m () | data source |
-> m (RPForest Double (Vector (Embed v Double x))) |
Populate a forest from a data stream
Assumptions on the data source:
- non-empty : contains at least one value
- stationary : each chunk is representative of the whole dataset
- bounded : we wait until the end of the stream to produce a result
Query
:: (Ord p, Inner SVector v, Unbox d, Real d) | |
=> (u d -> v d -> p) | distance function |
-> Int | k neighbors |
-> RPForest d (Vector (Embed u d x)) | random projection forest |
-> v d | query point |
-> Vector (p, Embed u d x) | ordered in increasing distance order to the query point |
Look up the \(k\) nearest neighbors to a query point
The supplied distance function d
must satisfy the definition of a metric, i.e.
- identity of indiscernible elements : \( d(x, y) = 0 \leftrightarrow x \equiv y \)
- symmetry : \( d(x, y) = d(y, x) \)
- triangle inequality : \( d(x, y) + d(y, z) \geq d(x, z) \)
I/O
:: (Serialise d, Serialise a, Unbox d) | |
=> RPForest d a | |
-> [ByteString] | the order is undefined |
Serialise each tree in the RPForest
as a separate bytestring
:: (Serialise d, Serialise a, Unbox d) | |
=> [ByteString] | |
-> Either String (RPForest d a) | the order is undefined |
Deserialise each tree in the RPForest
from a separate bytestring and reconstruct
Validation
:: (Inner SVector v, Unbox d, Fractional a1, Ord d, Ord a2, Ord x, Ord (u d), Num d) | |
=> (u d -> v d -> a2) | |
-> RPForest d (Vector (Embed u d x)) | |
-> Int | k : number of nearest neighbors to consider |
-> v d | query point |
-> a1 |
average recall-at-k, computed over a set of trees
Access
Retrieve points nearest to the query
in case of a narrow margin, collect both branches of the tree
Types
Pairing of a data item with its vector embedding
The vector is used internally for indexing
Instances
Functor (Embed v e) Source # | |
(Eq a, Eq (v e)) => Eq (Embed v e a) Source # | |
(Ord a, Ord (v e)) => Ord (Embed v e a) Source # | |
Defined in Data.RPTree.Internal | |
(Show a, Show (v e)) => Show (Embed v e a) Source # | |
Generic (Embed v e a) Source # | |
(NFData (v e), NFData a) => NFData (Embed v e a) Source # | |
Defined in Data.RPTree.Internal | |
(Serialise (v e), Serialise a) => Serialise (Embed v e a) Source # | |
type Rep (Embed v e a) Source # | |
Defined in Data.RPTree.Internal type Rep (Embed v e a) = D1 ('MetaData "Embed" "Data.RPTree.Internal" "rp-tree-0.3.4-17bHLKh60iBTPbWVoghZw" 'False) (C1 ('MetaCons "Embed" 'PrefixI 'True) (S1 ('MetaSel ('Just "eEmbed") 'NoSourceUnpackedness 'SourceStrict 'DecidedStrict) (Rec0 (v e)) :*: S1 ('MetaSel ('Just "eData") 'NoSourceUnpackedness 'SourceStrict 'DecidedStrict) (Rec0 a))) |
RPTree
Random projection trees
The first type parameter corresponds to a floating point scalar value, the second is the type of the data collected at the leaves of the tree (e.g. lists of vectors)
We keep them separate to leverage the Functor instance for postprocessing and visualization
One projection vector per tree level (as suggested in https://www.cs.helsinki.fi/u/ttonteri/pub/bigdata2016.pdf )
Instances
Functor (RPTree d) Source # | |
Foldable (RPTree d) Source # | |
Defined in Data.RPTree.Internal fold :: Monoid m => RPTree d m -> m # foldMap :: Monoid m => (a -> m) -> RPTree d a -> m # foldMap' :: Monoid m => (a -> m) -> RPTree d a -> m # foldr :: (a -> b -> b) -> b -> RPTree d a -> b # foldr' :: (a -> b -> b) -> b -> RPTree d a -> b # foldl :: (b -> a -> b) -> b -> RPTree d a -> b # foldl' :: (b -> a -> b) -> b -> RPTree d a -> b # foldr1 :: (a -> a -> a) -> RPTree d a -> a # foldl1 :: (a -> a -> a) -> RPTree d a -> a # elem :: Eq a => a -> RPTree d a -> Bool # maximum :: Ord a => RPTree d a -> a # minimum :: Ord a => RPTree d a -> a # | |
Traversable (RPTree d) Source # | |
(Unbox d, Eq d, Eq a) => Eq (RPTree d a) Source # | |
(Unbox d, Show d, Show a) => Show (RPTree d a) Source # | |
Generic (RPTree d a) Source # | |
(NFData a, NFData d) => NFData (RPTree d a) Source # | |
Defined in Data.RPTree.Internal | |
(Serialise d, Serialise a, Unbox d) => Serialise (RPTree d a) Source # | |
type Rep (RPTree d a) Source # | |
Defined in Data.RPTree.Internal |
type RPForest d a = IntMap (RPTree d a) Source #
A random projection forest is an ordered set of RPTree
s
This supports efficient updates of the ensemble in the streaming/online setting.
Sparse vectors with unboxed components
Instances
Scale SVector Source # | |
Inner SVector Vector Source # | |
Defined in Data.RPTree.Internal | |
Inner SVector DVector Source # | |
Defined in Data.RPTree.Internal | |
Inner SVector SVector Source # | |
Defined in Data.RPTree.Internal | |
(Unbox a, Eq a) => Eq (SVector a) Source # | |
(Unbox a, Ord a) => Ord (SVector a) Source # | |
Defined in Data.RPTree.Internal | |
(Unbox a, Show a) => Show (SVector a) Source # | |
Generic (SVector a) Source # | |
NFData (SVector a) Source # | |
Defined in Data.RPTree.Internal | |
(Unbox a, Serialise a) => Serialise (SVector a) Source # | |
type Rep (SVector a) Source # | |
Defined in Data.RPTree.Internal type Rep (SVector a) = D1 ('MetaData "SVector" "Data.RPTree.Internal" "rp-tree-0.3.4-17bHLKh60iBTPbWVoghZw" 'False) (C1 ('MetaCons "SV" 'PrefixI 'True) (S1 ('MetaSel ('Just "svDim") 'SourceUnpack 'SourceStrict 'DecidedStrict) (Rec0 Int) :*: S1 ('MetaSel ('Just "svVec") 'NoSourceUnpackedness 'NoSourceStrictness 'DecidedLazy) (Rec0 (Vector (Int, a))))) |
fromListSv :: Unbox a => Int -> [(Int, a)] -> SVector a Source #
(Unsafe) Pack a SVector
from its vector dimension and components
Note : the relevant invariants are not checked :
- vector components are _assumed_ to be in increasing order
- vector dimension is larger than any component index
(Unsafe) Pack a SVector
from its vector dimension and components
Note : the relevant invariants are not checked :
- vector components are _assumed_ to be in increasing order
- vector dimension is larger than any component index
Dense vectors with unboxed components
Instances
Scale DVector Source # | |
Inner DVector DVector Source # | |
Defined in Data.RPTree.Internal | |
Inner SVector DVector Source # | |
Defined in Data.RPTree.Internal | |
(Unbox a, Eq a) => Eq (DVector a) Source # | |
(Unbox a, Ord a) => Ord (DVector a) Source # | |
Defined in Data.RPTree.Internal | |
(Unbox a, Show a) => Show (DVector a) Source # | |
Generic (DVector a) Source # | |
NFData (DVector a) Source # | |
Defined in Data.RPTree.Internal | |
(Unbox a, Serialise a) => Serialise (DVector a) Source # | |
type Rep (DVector a) Source # | |
Defined in Data.RPTree.Internal |
fromListDv :: Unbox a => [a] -> DVector a Source #
fromVectorDv :: Vector a -> DVector a Source #
Vector space types
class (Scale u, Scale v) => Inner u v where Source #
Inner product spaces
This typeclass is provided as a convenience for library users to interface their own vector types.
inner :: (Unbox a, Num a) => u a -> v a -> a Source #
metricL2 :: (Unbox a, Floating a) => u a -> v a -> a Source #
helpers for implementing Inner instances
inner product
innerSS :: (Vector u (Int, a), Vector v (Int, a), Unbox a, Num a) => u (Int, a) -> v (Int, a) -> a Source #
sparse-sparse inner product
innerSD :: (Num a, Vector u (Int, a), Vector v a, Unbox a) => u (Int, a) -> v a -> a Source #
sparse-dense inner product
L2 distance
metricSSL2 :: (Floating a, Vector u a, Unbox a, Vector u (Int, a), Vector v (Int, a)) => u (Int, a) -> v (Int, a) -> a Source #
Vector distance induced by the L2 norm (sparse-sparse)
metricSDL2 :: (Floating a, Unbox a, Vector u (Int, a), Vector v a) => u (Int, a) -> v a -> a Source #
Vector distance induced by the L2 norm (sparse-dense)
Scale
Rendering
draw :: (Show a, Boxed a, PrintfArg v) => RPTree v a -> IO () Source #
Render a tree to stdout
Useful for debugging
This should be called only for small trees, otherwise the printed result quickly overflows the screen and becomes hard to read.
NB : prints distance information rounded to two decimal digits
CSV
Encode dataset as CSV and save into file
Testing
data BenchConfig Source #
BenchConfig | |
|
Instances
Show BenchConfig Source # | |
Defined in Data.RPTree.Internal.Testing showsPrec :: Int -> BenchConfig -> ShowS # show :: BenchConfig -> String # showList :: [BenchConfig] -> ShowS # |
Random generation
Conduit
:: Monad m | |
=> Int | number of vectors to generate |
-> GenT m a | random generator for the vector components |
-> ConduitT i a (GenT m) () |
Source of random data points
:: Monad m | |
=> Int | number of data points |
-> Int | vector dimension |
-> Double | nonzero density |
-> ConduitT i (SVector Double) (GenT m) () |
binary mixture of isotropic Gaussian rvs with sparse components
:: Monad m | |
=> Int | number of data points |
-> Int | vector dimension |
-> ConduitT i (DVector Double) (GenT m) () |
binary mixture of isotropic Gaussian rvs
Vector data
:: (Monad m, Unbox a) | |
=> Double | nonzero density |
-> Int | vector dimension |
-> GenT m a | random generator of vector components |
-> GenT m (SVector a) |
Generate a sparse random vector with a given nonzero density and components sampled from the supplied random generator