Copyright | Copyright © 2019 Lars Kuhtz <lakuhtz@gmail.com> |
---|---|
License | BSD3 |
Maintainer | Lars Kuhtz <lakuhtz@gmail.com> |
Stability | experimental |
Safe Haskell | None |
Language | Haskell2010 |
Haskell implementation of Cuckoo filters as described in
Cuckoo filters are a data structure for probabilistic set membership. They support insertion, deletion, and membership queries for set elements.
Membership queries may return false positive results. But queries don't return false negative results.
Unlike Bloom filters, Cuckoo filters maintain an upper bound on the false positive rate that is independent of the load of the filter. However, insertion of new elements in the filter can fail. For typical configurations this probability is very small for load factors smaller than 90 percent.
Example
{-# LANGUAGE DataKinds #-} {-# LANGUAGE TypeApplications #-} {-# LANGUAGE TypeFamilies #-} {-# OPTIONS_GHC -fno-warn-orphans #-} import Control.Monad (filterM) import Data.Cuckoo import Data.List ((\\)) -- Define CuckooFilterHash instance (this uses the default implementation) instance CuckooFilterHash Int main :: IO () main = do -- Create Filter for a minimum of 500000 entries f <- newCuckooFilter @4 @8 @Int 0 500000 -- Insert 450000 items failed <- filterM (fmap not . insert f) [0..500000-1] -- Query inserted items missing <- filterM (fmap not . member f) [0..500000-1] -- Test for false positives false <- filterM (member f) [500000..1000000 - 1] -- Report results putStrLn $ "failed inserts: " <> show (length failed) putStrLn $ "false positives: " <> show (length false) putStrLn $ "false positive rate (%): " <> show @Double (fromIntegral (length false) * 100 / 500000) putStrLn $ "missing (must be 0): " <> show (length $ missing \\ failed) -- Filter properties putStrLn $ "capacity: " <> show (capacityInItems f) putStrLn $ "size in allocated bytes: " <> show (sizeInAllocatedBytes f) -- computing the following is a bit slow c <- itemCount f putStrLn $ "item count: " <> show c lf <- loadFactor f putStrLn $ "load factor (%): " <> show lf putStrLn $ "bits per item: " <> show @Double (fromIntegral (sizeInAllocatedBytes f) * 8 / fromIntegral c)
Synopsis
- newtype Salt = Salt Int
- class CuckooFilterHash a where
- cuckooHash :: Salt -> a -> Word64
- cuckooFingerprint :: Salt -> a -> Word64
- sip :: Storable a => Int -> a -> Word64
- sip_bytes :: ByteArrayAccess a => Int -> a -> Word64
- fnv1a :: Storable a => Int -> a -> Word64
- fnv1a_bytes :: ByteArrayAccess a => Int -> a -> Word64
- data CuckooFilter s (b :: Nat) (f :: Nat) (a :: Type)
- type CuckooFilterIO b f a = CuckooFilter RealWorld b f a
- newCuckooFilter :: forall b f a m. KnownNat b => KnownNat f => PrimMonad m => Salt -> Natural -> m (CuckooFilter (PrimState m) b f a)
- insert :: forall b f a m. KnownNat f => KnownNat b => PrimMonad m => CuckooFilterHash a => CuckooFilter (PrimState m) b f a -> a -> m Bool
- member :: CuckooFilterHash a => PrimMonad m => KnownNat f => KnownNat b => CuckooFilter (PrimState m) b f a -> a -> m Bool
- delete :: CuckooFilterHash a => PrimMonad m => KnownNat f => KnownNat b => CuckooFilter (PrimState m) b f a -> a -> m Bool
- sizeInAllocatedBytes :: forall b f a s. KnownNat f => KnownNat b => CuckooFilter s b f a -> Int
- capacityInItems :: forall b f a s. KnownNat b => CuckooFilter s b f a -> Int
- itemCount :: forall b f a m. PrimMonad m => KnownNat b => KnownNat f => CuckooFilter (PrimState m) b f a -> m Int
- loadFactor :: forall b f a m. PrimMonad m => KnownNat b => KnownNat f => CuckooFilter (PrimState m) b f a -> m Double
- showFilter :: forall b f a. KnownNat f => KnownNat b => CuckooFilter RealWorld b f a -> IO [[String]]
- itemHashes :: forall b f a s. KnownNat f => CuckooFilterHash a => CuckooFilter s b f a -> a -> (Int, Int, Word64)
Hash Functions
Salt for hash computations.
class CuckooFilterHash a where Source #
Choosing good hash functions is imperative for a good performance of a cuckoo filter. The hash functions must be
- independent and
- provide good uniformity on the lower bits of the output.
The default implementations use sip hash for cuckooHash
and fnv1a
(64
bit) for cuckooFingerprint
and require an instance of Storable
.
>>>
instance CuckooFilterHash Int
The following example uses the hash functions that are provided in this
module to define an instance for ByteString
:
>>>
import qualified Data.ByteString as B
>>>
:{
instance CuckooFilterHash B.ByteString where cuckooHash (Salt s) a = fnv1a_bytes s a cuckooFingerprint (Salt s) a = sip_bytes s a {-# INLINE cuckooHash #-} {-# INLINE cuckooFingerprint #-} :}
Nothing
cuckooHash :: Salt -> a -> Word64 Source #
This function must provide good entropy on the lower \(2^b - 1\) bits of the result, where \(b\) is the number of buckets.
cuckooFingerprint :: Salt -> a -> Word64 Source #
This function must provide good entropy on the lower bits of the size of a fingerprint.
cuckooHash :: Storable a => Salt -> a -> Word64 Source #
This function must provide good entropy on the lower \(2^b - 1\) bits of the result, where \(b\) is the number of buckets.
cuckooFingerprint :: Storable a => Salt -> a -> Word64 Source #
This function must provide good entropy on the lower bits of the size of a fingerprint.
Hash functions
Computes a Sip hash for a value that has an Storable
instance.
The first argument is a salt value that is used to derive the key for the hash computation.
:: ByteArrayAccess a | |
=> Int | Salt |
-> a | Value that is hashes |
-> Word64 |
Computes a Sip hash for a value that is an instance of
ByteArrayAccess
.
The first argument is a salt value that is used to derive the key for the hash computation.
Computes a 64 bit Fnv1a hash for a value that has an Storable
instance.
The first argument is use as a salt.
:: ByteArrayAccess a | |
=> Int | Salt |
-> a | Value that is hashes |
-> Word64 |
Computes a 64 bit Fnv1a hash for a value that is an instance of
ByteArrayAccess
.
The first argument is use as a salt.
Cuckoo Filter
data CuckooFilter s (b :: Nat) (f :: Nat) (a :: Type) Source #
Cuckoo Filter with
- State token
s :: Type
, - bucket size
b :: Nat
, - fingerprint size
f :: Nat
, and - content type
a :: Type
.
The following constraints apply
- \(0 < f \leq 32\)
- \(0 < b\)
The implementation is not thread safe. For concurrent use the filter must be wrapped in a read-write lock.
type CuckooFilterIO b f a = CuckooFilter RealWorld b f a Source #
Cuckoo filter that can be used in the IO
monad.
:: KnownNat b | |
=> KnownNat f | |
=> PrimMonad m | |
=> Salt | Salt for the hash functions. |
-> Natural | Size. Must be at least 64. |
-> m (CuckooFilter (PrimState m) b f a) |
Create a new Cuckoo filter that has at least the given capacity.
Enabling the TypeApplications
language extension provides a convenient way
for passing the type parameters to the function.
>>>
:set -XTypeApplications -XDataKinds -XTypeFamilies
>>>
newCuckooFilter @4 @10 @Int 0 1000
The type parameters are
- bucket size
b :: Nat
, - fingerprint size
f :: Nat
, - content type
a :: Type
, and - Monad
m :: Type -> Type
,
The following constraints apply:
- \(0 < f \leq 32\),
- \(0 < b\), and
- \(64 \leq n\), where \(n\) is the requested size.
The false positive rate depends mostly on the value of f
. It is bounded
from above by \(\frac{2b}{2^f}\). In most cases 4
is a good choice for b
.
Actual performance depends on the choice of good hash functions that provide high uniformity on the lower bits.
The actual capacity may be much larger than what is requested, because the actual bucket count is a power of two.
>>>
f <- newCuckooFilter @4 @10 @Int 0 600
>>>
capacityInItems f
1024>>>
sizeInAllocatedBytes f
1284
Cuckoo Filter Operations
insert :: forall b f a m. KnownNat f => KnownNat b => PrimMonad m => CuckooFilterHash a => CuckooFilter (PrimState m) b f a -> a -> m Bool Source #
Insert an item into the filter and return whether the operation was successful. If insertion fails, the filter is unchanged. An item can be inserted more than once. The return value indicates whether insertion was successful. The operation can fail when the filter doesn't have enough space for the item.
This function is not thread safe. No concurrent writes or reads should occur while this function is executed. If this is needed a lock must be used.
This function is not exception safe. The filter must not be used any more after an asynchronous exception has been thrown during the computation of this function. If this function is used in the presence of asynchronous exceptions it should be apprioriately masked.
>>>
f <- newCuckooFilter @4 @10 @Int 0 1000
>>>
insert f 0
True>>>
insert f 0
True>>>
itemCount f
2
member :: CuckooFilterHash a => PrimMonad m => KnownNat f => KnownNat b => CuckooFilter (PrimState m) b f a -> a -> m Bool Source #
Test whether an item is in the set that is represented by the Cuckoo filter.
A negative result means that the item is definitively not in the set. A
positive result means that the item is most likely in the set. The rate of
false positives is bounded from above by \(\frac{2b}{2^f}\) where b
is the number
of items per bucket and f
is the size of a fingerprint in bits.
>>>
f <- newCuckooFilter @4 @10 @Int 0 1000
>>>
insert f 0
True>>>
member f 0
True>>>
member f 1
False
delete :: CuckooFilterHash a => PrimMonad m => KnownNat f => KnownNat b => CuckooFilter (PrimState m) b f a -> a -> m Bool Source #
Delete an items from the filter. An item that was inserted more than once can also be deleted more than once.
IMPORTANT An item must only be deleted if it was successfully added to the filter before (and hasn't been deleted since then).
Deleting an item that isn't in the filter can result in the filter returning false negative results.
This function is not thread safe. No concurrent writes must occur while this function is executed. If this is needed a lock must be used. Concurrent reads are fine.
>>>
f <- newCuckooFilter @4 @10 @Int 0 1000
>>>
insert f 0
True>>>
insert f 0
True>>>
itemCount f
2>>>
delete f 0
True>>>
itemCount f
1>>>
member f 0
True>>>
delete f 0
True>>>
itemCount f
0>>>
member f 0
False
Utils
sizeInAllocatedBytes :: forall b f a s. KnownNat f => KnownNat b => CuckooFilter s b f a -> Int Source #
The total number of bytes allocated for storing items in the filter.
capacityInItems :: forall b f a s. KnownNat b => CuckooFilter s b f a -> Int Source #
Total number of items that the filter can hold. In practice a load factor of ~95% of this number can be reached.
itemCount :: forall b f a m. PrimMonad m => KnownNat b => KnownNat f => CuckooFilter (PrimState m) b f a -> m Int Source #
Number of items currently stored in the filter.
Note that computing this number is expensive \(\mathcal{O}(n)\).
loadFactor :: forall b f a m. PrimMonad m => KnownNat b => KnownNat f => CuckooFilter (PrimState m) b f a -> m Double Source #
The current load factor of the filter in percent.
loadFactor f = 100 * itemCount f / capacityInItems
Note that computing this number is expensive \(\mathcal{O}(n)\).
Debugging Utils
showFilter :: forall b f a. KnownNat f => KnownNat b => CuckooFilter RealWorld b f a -> IO [[String]] Source #
Show the contents of the filter as a list of buckets with values show in hex. Used for debugging purposes.
itemHashes :: forall b f a s. KnownNat f => CuckooFilterHash a => CuckooFilter s b f a -> a -> (Int, Int, Word64) Source #
Returns the different hashes that are associated with an item in the filter. Used for debugging purposes.