Safe Haskell | None |
---|---|
Language | Haskell2010 |
An efficient implementation of the Aho-Corasick string matching algorithm. See http://web.stanford.edu/class/archive/cs/cs166/cs166.1166/lectures/02/Small02.pdf for a good explanation of the algorithm.
The memory layout of the automaton, and the function that steps it, were optimized to the point where string matching compiles roughly to a loop over the code units in the input text, that keeps track of the current state. Lookup of the next state is either just an array index (for the root state), or a linear scan through a small array (for non-root states). The pointer chases that are common for traversing Haskell data structures have been eliminated.
The construction of the automaton has not been optimized that much, because construction time is usually negligible in comparison to matching time. Therefore construction is a two-step process, where first we build the automaton as int maps, which are convenient for incremental construction. Afterwards we pack the automaton into unboxed vectors.
Synopsis
- data AcMachine v = AcMachine {
- machineValues :: !(Vector [v])
- machineTransitions :: !(Vector Transition)
- machineOffsets :: !(Vector Int)
- machineRootAsciiTransitions :: !(Vector Transition)
- build :: [([CodeUnit], v)] -> AcMachine v
- runText :: forall a v. a -> (a -> Match v -> Next a) -> AcMachine v -> Text -> a
- runLower :: forall a v. a -> (a -> Match v -> Next a) -> AcMachine v -> Text -> a
- debugBuildDot :: [[CodeUnit]] -> String
- data CaseSensitivity
- newtype CodeUnitIndex = CodeUnitIndex {
- codeUnitIndex :: Int
- data Match v = Match {
- matchPos :: !CodeUnitIndex
- matchValue :: v
- data Next a
Documentation
An Aho-Corasick automaton.
AcMachine | |
|
build :: [([CodeUnit], v)] -> AcMachine v Source #
Construct an Aho-Corasick automaton for the given needles.
Takes a list of code units rather than Text
, to allow mapping the code
units before construction, for example to lowercase individual code points,
rather than doing proper case folding (which might change the number of code
units).
debugBuildDot :: [[CodeUnit]] -> String Source #
Build the automaton, and format it as Graphviz Dot, for visual debugging.
data CaseSensitivity Source #
Instances
newtype CodeUnitIndex Source #
An index into the raw UTF-16 data of a Text
. This is not the code point
index as conventionally accepted by Text
, so we wrap it to avoid confusing
the two. Incorrect index manipulation can lead to surrogate pairs being
sliced, so manipulate indices with care. This type is also used for lengths.
Instances
Match | |
|