text-metrics-0.3.0: Calculate various string metrics efficiently

Copyright© 2016–2017 Mark Karpov
LicenseBSD 3 clause
MaintainerMark Karpov <markkarpov92@gmail.com>
Stabilityexperimental
Portabilityportable
Safe HaskellNone
LanguageHaskell2010

Data.Text.Metrics

Contents

Description

The module provides efficient implementations of various strings metric algorithms. It works with strict Text values.

Note: before version 0.3.0 the package used C implementations of the algorithms under the hood. Beginning from version 0.3.0, the implementations are written in Haskell while staying almost as fast, see:

https://markkarpov.com/post/migrating-text-metrics.html

Synopsis

Levenshtein variants

levenshtein :: Text -> Text -> Int Source #

Return Levenshtein distance between two Text values. Classic Levenshtein distance between two strings is the minimal number of operations necessary to transform one string into another. For Levenshtein distance allowed operations are: deletion, insertion, and substitution.

See also: https://en.wikipedia.org/wiki/Levenshtein_distance.

Heads up, before version 0.3.0 this function returned Natural.

levenshteinNorm :: Text -> Text -> Ratio Int Source #

Return normalized Levenshtein distance between two Text values. Result is a non-negative rational number (represented as Ratio Natural), where 0 signifies no similarity between the strings, while 1 means exact match.

See also: https://en.wikipedia.org/wiki/Levenshtein_distance.

Heads up, before version 0.3.0 this function returned Ratio Natural.

damerauLevenshtein :: Text -> Text -> Int Source #

Return Damerau-Levenshtein distance between two Text values. The function works like levenshtein, but the collection of allowed operations also includes transposition of two adjacent characters.

See also: https://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance.

Heads up, before version 0.3.0 this function returned Natural.

damerauLevenshteinNorm :: Text -> Text -> Ratio Int Source #

Return normalized Damerau-Levenshtein distance between two Text values. 0 signifies no similarity between the strings, while 1 means exact match.

See also: https://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance.

Heads up, before version 0.3.0 this function returned Ratio Natural.

Treating inputs like sets

overlap :: Text -> Text -> Ratio Int Source #

Return overlap coefficient for two Text values. Returned value is in the range from 0 (no similarity) to 1 (exact match). Return 1 if both Text values are empty.

See also: https://en.wikipedia.org/wiki/Overlap_coefficient.

Since: 0.3.0

jaccard :: Text -> Text -> Ratio Int Source #

Return Jaccard similarity coefficient for two Text values. Returned value is in the range from 0 (no similarity) to 1 (exact match). Return 1 if both

See also: https://en.wikipedia.org/wiki/Jaccard_index

Since: 0.3.0

Other

hamming :: Text -> Text -> Maybe Int Source #

O(n) Return Hamming distance between two Text values. Hamming distance is defined as the number of positions at which the corresponding symbols are different. The input Text values should be of equal length or Nothing will be returned.

See also: https://en.wikipedia.org/wiki/Hamming_distance.

Heads up, before version 0.3.0 this function returned Maybe Natural.

jaro :: Text -> Text -> Ratio Int Source #

Return Jaro distance between two Text values. Returned value is in the range from 0 (no similarity) to 1 (exact match).

While the algorithm is pretty clear for artificial examples (like those from the linked Wikipedia article), for arbitrary strings, it may be hard to decide which of two strings should be considered as one having “reference” order of characters (order of matching characters in an essential part of the definition of the algorithm). This makes us consider the first string the “reference” string (with correct order of characters). Thus generally,

jaro a b ≠ jaro b a

This asymmetry can be found in all implementations of the algorithm on the internet, AFAIK.

See also: https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance

Heads up, before version 0.3.0 this function returned Ratio Natural.

Since: 0.2.0

jaroWinkler :: Text -> Text -> Ratio Int Source #

Return Jaro-Winkler distance between two Text values. Returned value is in range from 0 (no similarity) to 1 (exact match).

See also: https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance

Heads up, before version 0.3.0 this function returned Ratio Natural.

Since: 0.2.0