Safe Haskell | None |
---|---|
Language | Haskell2010 |
This module offers regexes, combinators, and operations to work with the
Text
type from the text
package.
Synopsis
- data RE c a
- data TextToken
- type REText = RE TextToken
- token :: (Char -> Maybe a) -> REText a
- satisfy :: (Char -> Bool) -> REText Char
- char :: Char -> REText Char
- charIgnoreCase :: Char -> REText Char
- anyChar :: REText Char
- oneOf :: CharSet -> REText Char
- text :: Text -> REText Text
- textIgnoreCase :: Text -> REText Text
- manyText :: REText Text
- someText :: REText Text
- manyTextMin :: REText Text
- someTextMin :: REText Text
- manyTextOf :: CharSet -> REText Text
- someTextOf :: CharSet -> REText Text
- manyTextOfMin :: CharSet -> REText Text
- someTextOfMin :: CharSet -> REText Text
- naturalDec :: REText Natural
- integerDec :: REText a -> REText Integer
- naturalHex :: REText Natural
- integerHex :: REText a -> REText Integer
- wordRangeDec :: (Word, Word) -> REText Word
- intRangeDec :: REText a -> (Int, Int) -> REText Int
- wordRangeHex :: (Word, Word) -> REText Word
- intRangeHex :: REText a -> (Int, Int) -> REText Int
- wordDecN :: Int -> REText Word
- wordHexN :: Int -> REText Word
- foldlMany :: (b -> a -> b) -> b -> RE c a -> RE c b
- foldlManyMin :: (b -> a -> b) -> b -> RE c a -> RE c b
- toMatch :: REText a -> REText Text
- withMatch :: REText a -> REText (Text, a)
- data Many a
- manyr :: RE c a -> RE c (Many a)
- optionalMin :: RE c a -> RE c (Maybe a)
- someMin :: RE c a -> RE c [a]
- manyMin :: RE c a -> RE c [a]
- atLeast :: Int -> RE c a -> RE c [a]
- atMost :: Int -> RE c a -> RE c [a]
- betweenCount :: (Int, Int) -> RE c a -> RE c [a]
- atLeastMin :: Int -> RE c a -> RE c [a]
- atMostMin :: Int -> RE c a -> RE c [a]
- betweenCountMin :: (Int, Int) -> RE c a -> RE c [a]
- sepBy :: RE c a -> RE c sep -> RE c [a]
- sepBy1 :: RE c a -> RE c sep -> RE c [a]
- endBy :: RE c a -> RE c sep -> RE c [a]
- endBy1 :: RE c a -> RE c sep -> RE c [a]
- sepEndBy :: RE c a -> RE c sep -> RE c [a]
- sepEndBy1 :: RE c a -> RE c sep -> RE c [a]
- chainl1 :: RE c a -> RE c (a -> a -> a) -> RE c a
- chainr1 :: RE c a -> RE c (a -> a -> a) -> RE c a
- reParse :: REText a -> Text -> Maybe a
- data Parser c a
- type ParserText = Parser TextToken
- compile :: RE c a -> Parser c a
- compileBounded :: Int -> RE c a -> Maybe (Parser c a)
- parse :: ParserText a -> Text -> Maybe a
- parseSure :: ParserText a -> Text -> a
- find :: REText a -> Text -> Maybe a
- findAll :: REText a -> Text -> [a]
- splitOn :: REText a -> Text -> [Text]
- replace :: REText Text -> Text -> Maybe Text
- replaceAll :: REText Text -> Text -> Text
RE
s
A regular expression. Operates on a sequence of elements of type c
and
capable of parsing into an a
.
A RE
is a Functor, Applicative, and Alternative.
pure
: Succeed without consuming input.liftA2
,<*>
,*>
,<*
: Sequential composition.empty
: Fail.<|>
: Alternative composition. Left-biased, i.e. the result of parsing usinga <|> b
is the result of parsing usinga
if it succeeds, otherwise it is the result of parsing usingb
if it succeeds, otherwise parsing fails.many
: Zero or more.many a
parses multiplea
s sequentially. Biased towards matching more. UsemanyMin
for a bias towards matching less. Also see the section "Looping parsers".some
: One or more.some a
parses multiplea
s sequentially. Biased towards matching more. UsesomeMin
for a bias towards matching less.
In addition to expected Functor, Applicative, and Alternative laws,
RE
obeys these Applicative-Alternative laws:
a <*> empty = empty empty <*> a = empty (a <|> b) <*> c = (a <*> c) <|> (b <*> c) a <*> (b <|> c) = (a <*> b) <|> (a <*> c)
Note that, because of bias, it is not true that a <|> b = b <|> a
.
Performance tip: Prefer the smaller of equivalent regexes, i.e. prefer
(a <|> b) <*> c
over (a <*> c) <|> (b <*> c)
.
type REText = RE TextToken Source #
A type alias for convenience.
A function which accepts a RE c a
will accept a REText a
.
token :: (Char -> Maybe a) -> REText a Source #
Parse a Char
into an a
if the given function returns Just
.
charIgnoreCase :: Char -> REText Char Source #
Parse the given Char
, ignoring case.
Comparisons are performed after applying simple case folding as described by the Unicode standard.
textIgnoreCase :: Text -> REText Text Source #
Parse the given Text
, ignoring case.
Comparisons are performed after applying simple case folding as described by the Unicode standard.
manyTextMin :: REText Text Source #
Parse any Text
. Minimal, i.e. biased towards matching less.
someTextMin :: REText Text Source #
Parse any non-empty Text
. Minimal, i.e. biased towards matching less.
manyTextOf :: CharSet -> REText Text Source #
Parse any Text
containing members of the CharSet
.
Biased towards matching more.
someTextOf :: CharSet -> REText Text Source #
Parse any non-empty Text
containing members of the CharSet
.
Biased towards matching more.
manyTextOfMin :: CharSet -> REText Text Source #
Parse any Text
containing members of the CharSet
.
Minimal, i.e. biased towards matching less.
someTextOfMin :: CharSet -> REText Text Source #
Parse any non-empty Text
containing members of the CharSet
.
Minimal, i.e. biased towards matching less.
Numeric RE
s
naturalDec :: REText Natural Source #
Parse a decimal Natural
.
Leading zeros are not accepted. Biased towards matching more.
integerDec :: REText a -> REText Integer Source #
Parse a decimal Integer
. Parse an optional sign, '-'
or '+'
,
followed by the given RE
, followed by the absolute value of the integer.
Leading zeros are not accepted. Biased towards matching more.
naturalHex :: REText Natural Source #
Parse a hexadecimal Natural
. Both uppercase 'A'..'F'
and lowercase
'a'..'f'
are accepted.
Leading zeros are not accepted. Biased towards matching more.
integerHex :: REText a -> REText Integer Source #
Parse a hexadecimal Integer
. Parse an optional sign, '-'
or '+'
,
followed by the given RE
, followed by the absolute value of the integer.
Both uppercase 'A'..'F'
and lowercase 'a'..'f'
are accepted.
Leading zeros are not accepted. Biased towards matching more.
wordRangeDec :: (Word, Word) -> REText Word Source #
Parse a decimal Word
in the range [low..high]
.
Leading zeros are not accepted. Biased towards matching more.
intRangeDec :: REText a -> (Int, Int) -> REText Int Source #
Parse a decimal Int
in the range [low..high]
. Parse an optional sign,
'-'
or '+'
, followed by the given RE
, followed by the absolute
value of the integer.
Leading zeros are not accepted. Biased towards matching more.
wordRangeHex :: (Word, Word) -> REText Word Source #
Parse a hexadecimal Word
in the range [low..high]
. Both uppercase
'A'..'F'
and lowercase 'a'..'f'
are accepted.
Leading zeros are not accepted. Biased towards matching more.
intRangeHex :: REText a -> (Int, Int) -> REText Int Source #
Parse a hexadecimal Int
in the range [low..high]
. Parse an optional
sign, '-'
or '+'
, followed by the given RE
, followed by the
absolute value of the integer.
Both uppercase 'A'..'F'
and lowercase 'a'..'f'
are accepted.
Leading zeros are not accepted. Biased towards matching more.
wordDecN :: Int -> REText Word Source #
Parse a Word
of exactly n decimal digits, including any leading zeros.
Will not parse values that do not fit in a Word
.
Biased towards matching more.
wordHexN :: Int -> REText Word Source #
Parse a Word
of exactly n hexadecimal digits, including any leading
zeros. Both uppercase 'A'..'F'
and lowercase 'a'..'f'
are
accepted. Will not parse values that do not fit in a Word
.
Biased towards matching more.
Combinators
foldlMany :: (b -> a -> b) -> b -> RE c a -> RE c b Source #
Parse many occurences of the given RE
. Biased towards matching more.
Also see the section "Looping parsers".
foldlManyMin :: (b -> a -> b) -> b -> RE c a -> RE c b Source #
Parse many occurences of the given RE
. Minimal, i.e. biased towards
matching less.
toMatch :: REText a -> REText Text Source #
Rebuild the RE
such that the result is the matched Text
instead.
withMatch :: REText a -> REText (Text, a) Source #
Rebuild the RE
to include the matched Text
alongside the result.
A repeating value or a finite list.
Instances
Eq1 Many Source # | |
Ord1 Many Source # | |
Defined in Regex.Internal.Regex | |
Show1 Many Source # | |
NFData1 Many Source # | |
Defined in Regex.Internal.Regex | |
Functor Many Source # | |
Foldable Many Source # | |
Defined in Regex.Internal.Regex fold :: Monoid m => Many m -> m # foldMap :: Monoid m => (a -> m) -> Many a -> m # foldMap' :: Monoid m => (a -> m) -> Many a -> m # foldr :: (a -> b -> b) -> b -> Many a -> b # foldr' :: (a -> b -> b) -> b -> Many a -> b # foldl :: (b -> a -> b) -> b -> Many a -> b # foldl' :: (b -> a -> b) -> b -> Many a -> b # foldr1 :: (a -> a -> a) -> Many a -> a # foldl1 :: (a -> a -> a) -> Many a -> a # elem :: Eq a => a -> Many a -> Bool # maximum :: Ord a => Many a -> a # | |
NFData a => NFData (Many a) Source # | |
Defined in Regex.Internal.Regex | |
Show a => Show (Many a) Source # | |
Eq a => Eq (Many a) Source # | |
Ord a => Ord (Many a) Source # | |
manyr :: RE c a -> RE c (Many a) Source #
Zero or more. Biased towards matching more.
Also see the section "Looping parsers".
optionalMin :: RE c a -> RE c (Maybe a) Source #
Zero or one. Minimal, i.e. biased towards zero.
Use Control.Applicative.
for the same but biased towards one.optional
betweenCount :: (Int, Int) -> RE c a -> RE c [a] Source #
Between m and n times (inclusive). Biased towards matching more.
atLeastMin :: Int -> RE c a -> RE c [a] Source #
At least n times. Minimal, i.e. biased towards matching less.
atMostMin :: Int -> RE c a -> RE c [a] Source #
At most n times. Minimal, i.e. biased towards matching less.
betweenCountMin :: (Int, Int) -> RE c a -> RE c [a] Source #
Between m and n times (inclusive). Minimal, i.e. biased towards matching less.
sepBy :: RE c a -> RE c sep -> RE c [a] Source #
r `sepBy` sep
parses zero or more occurences of r
, separated by
sep
. Biased towards matching more.
sepBy1 :: RE c a -> RE c sep -> RE c [a] Source #
r `sepBy1` sep
parses one or more occurences of r
, separated by
sep
. Biased towards matching more.
endBy :: RE c a -> RE c sep -> RE c [a] Source #
r `endBy` sep
parses zero or more occurences of r
, separated and
ended by sep
. Biased towards matching more.
endBy1 :: RE c a -> RE c sep -> RE c [a] Source #
r `endBy1` sep
parses one or more occurences of r
, separated and
ended by sep
. Biased towards matching more.
sepEndBy :: RE c a -> RE c sep -> RE c [a] Source #
r `sepEndBy` sep
parses zero or more occurences of r
, separated and
optionally ended by sep
. Biased towards matching more.
sepEndBy1 :: RE c a -> RE c sep -> RE c [a] Source #
r `sepEndBy1` sep
parses one or more occurences of r
, separated and
optionally ended by sep
. Biased towards matching more.
chainl1 :: RE c a -> RE c (a -> a -> a) -> RE c a Source #
chainl1 r op
parses one or more occurences of r
, separated by op
.
The result is obtained by left associative application of all functions
returned by op
to the values returned by p
. Biased towards matching more.
chainr1 :: RE c a -> RE c (a -> a -> a) -> RE c a Source #
chainr1 r op
parses one or more occurences of r
, separated by op
.
The result is obtained by right associative application of all functions
returned by op
to the values returned by p
. Biased towards matching more.
Combinators in base
Various combinators are available in base
that work with RE
s, by virtue
of RE
being Applicative
and Alternative
.
Since this package does not attempt to redefine or re-export such
combinators, you need to import and use them. Commonly used combinators
are:
- Control.Applicative:
liftA2
,<|>
,empty
,many
,some
,optional
- Control.Monad:
void
,replicateM
,replicateM_
- Data.Foldable:
traverse_
,for_
,sequenceA_
,asum
- Data.Traversable:
traverse
,for
,sequenceA
Compile and parse
reParse :: REText a -> Text -> Maybe a Source #
\(O(mn \log m)\). Parse a Text
with a REText
.
Parses the entire Text
, not just a prefix or a substring.
Uses compile
, see the note there.
If parsing multiple Text
s using the same RE
, it is wasteful to compile
the RE
every time. So, prefer to
- Compile once with
compile
orcompileBounded
and use the compiledParserText
withparse
as many times as required. - Alternately, partially apply this function to a
RE
and use the function as many times as required.
type ParserText = Parser TextToken Source #
A type alias for convenience.
A function which accepts a Parser c a
will accept a ParserText a
.
compile :: RE c a -> Parser c a Source #
\(O(m)\). Compile a RE c a
to a Parser c a
.
Note: compile
does not limit the size of the RE
. See compileBounded
if you would like to limit the size.
RE
s with size greater than (maxBound::Int) `div` 2
are not supported
and the behavior of such a RE
is undefined.
compileBounded :: Int -> RE c a -> Maybe (Parser c a) Source #
\(O(\min(l,m))\). Compile a RE c a
to a Parser c a
.
Returns Nothing
if the size of the RE
is greater than the provided limit
\(l\). You may want to use this if you suspect that the RE
may be too
large, for instance if the regex is constructed from an untrusted source.
While the exact size of a RE
depends on an internal representation, it can
be assumed to be in the same order as the length of a
regex pattern
corresponding to the RE
.
parse :: ParserText a -> Text -> Maybe a Source #
\(O(mn \log m)\). Parse a Text
with a ParserText
.
Parses the entire Text
, not just a prefix or a substring.
parseSure :: ParserText a -> Text -> a Source #
\(O(mn \log m)\). Parse a Text
with a ParserText
. Calls error
on
parse failure.
For use with parsers that are known to never fail.
Parses the entire Text
, not just a prefix or a substring.
Text operations
find :: REText a -> Text -> Maybe a Source #
\(O(mn \log m)\). Find the first occurence of the given RE
in a Text
.
Examples
>>>
find (text "meow") "homeowner"
Just "meow"
To test whether a Text
is present in another Text
, like above, prefer
Data.Text.
.isInfixOf
>>>
find (textIgnoreCase "haskell") "Look I'm Haskelling!"
Just "Haskell">>>
find (text "backtracking") "parser-regex"
Nothing
findAll :: REText a -> Text -> [a] Source #
\(O(mn \log m)\). Find all non-overlapping occurences of the given RE
in
the Text
.
Examples
>>>
findAll (text "ana") "banananana"
["ana","ana"]
data Roll = Roll Natural -- ^ Rolls Natural -- ^ Faces on the die deriving Show roll :: REText Roll roll = Roll <$> (naturalDec
<|> pure 1) <*char
'd' <*> naturalDec
>>>
findAll roll "3d6, d10, 2d10"
[Roll 3 6,Roll 1 10,Roll 2 10]
splitOn :: REText a -> Text -> [Text] Source #
\(O(mn \log m)\). Split a Text
at occurences of the given RE
.
Examples
>>>
splitOn (char ' ') "Glasses are really versatile"
["Glasses","are","really","versatile"]
For simple splitting, like above, prefer Data.Text.
,
words
Data.Text.
, lines
Data.Text.
or
split
Data.Text.
, whichever is applicable.splitOn
>>>
splitOn (char ' ' *> oneOf "+-=" *> char ' ') "3 - 1 + 1/2 - 2 = 0"
["3","1","1/2","2","0"]
If the Text
starts or ends with a delimiter, the result will contain
empty Text
s at those positions.
>>>
splitOn (char 'a') "ayaya"
["","y","y",""]
replace :: REText Text -> Text -> Maybe Text Source #
\(O(mn \log m)\). Replace the first match of the given RE
with its
result. If there is no match, the result is Nothing
.
Examples
>>>
replace ("world" <$ text "Haskell") "Hello, Haskell!"
Just "Hello, world!"
>>>
replace ("," <$ some (char '.')) "one...two...ten"
Just "one,two...ten"
replaceAll :: REText Text -> Text -> Text Source #
\(O(mn \log m)\). Replace all non-overlapping matches of the given RE
with their results.
Examples
>>>
replaceAll (" and " <$ text ", ") "red, blue, green"
"red and blue and green"
For simple replacements like above, prefer Data.Text.
.replace
>>>
replaceAll ("Fruit" <$ text "Time" <|> "a banana" <$ text "an arrow") "Time flies like an arrow"
"Fruit flies like a banana"
sep =oneOf
"-./" digits n =toMatch
(replicateM_
n (oneOfdigit
)) toYmd d m y = mconcat [y, "-", m, "-", d] date = toYmd <$> digits 2 <* sep <*> digits 2 <* sep <*> digits 4
>>>
replaceAll date "01/01/1970, 01-04-1990, 03.07.2011"
"1970-01-01, 1990-04-01, 2011-07-03"
Additional information
Recursive definitions
It is not possible to define a RE
recursively. If it were permitted, it
would be capable of parsing more than
regular languages.
Unfortunately, there is no good way* to make it impossible to write such
a regex in the first place. So it must be avoided by the programmer. As an
example, avoid this:
re :: REText [Text] re = liftA2 (:) (text "ha") re <|> [] <$ text "!" -- diverges!
Instead, use appropriate combinators from this module:
re = many (text "ha") <* text "!"
For the same reason, be cautious when using combinators from the other
packages on RE
s. Make sure that they do not attempt to construct a
recursive RE
.
If you find that your regex is impossible to write without recursion, you are attempting to parse a non-regular language! You need a more powerful parser than what this library has to offer.
*Unlifted datatypes can be used for this but they are too inconvenient to work with.
Laziness
Parsing is lazy in the result value, i.e. the a
in RE c a
or
Parser c a
. In fact, for the algorithm used in this library, this laziness
is essential for good runtime complexity. However, there is little reason
to be lazy in other aspects, such as the elements of the sequence, c
, or
the functions and regexes used in combinators. Functions are strict in such
arguments.
-- Lazy in the result reParse (pure ⊥) "" = Just ⊥ reParse (fmap (\_ -> ⊥) (char 'a')) "a" = Just ⊥ -- Strict in places like char ⊥ = ⊥ fmap ⊥ r = ⊥ liftA2 f r ⊥ = ⊥
Looping parsers
What should be the result of reParse (many (pure ())) ""
?
Since many r
parses r
as many times as possible, and pure ()
succeeds
without consuming input, the result should arguably be the infinite list
repeat ()
. Similarly, reParse (foldlMany f z (pure ())) ""
should
diverge. Note that this applies to not just pure x
, but any regex that
can succeed without consuming input, such as many x
, manyMin x
, etc.
This library considers that such an outcome is not desirable in practice. It
would be surprising to get an infinite structure from a parser. So, in the
case that many
succeeds an infinite number of times, this library treats it
as succeeding zero times.
By this rule, reParse (many (pure ())) ""
parses as []
and
reParse (foldlMany f z (pure ())) ""
parses as z
.
This behavior makes it impossible to distinguish between zero parses and
infinite parses. To address this, an alternate combinator manyr
is provided. This parses into a Many
, a type that clearly
indicates if parsing succeeded without consuming input into an infinite list,
or if it succeeded a finite number of times.
Performance
This section describes some performance characteristics of this library, without requiring a dive into the source code.
Parsing with a RE
is done in two distinct steps.
- A
RE
is compiled to aParser
, which is a nondeterministic finite automaton (NFA), in \(O(m)\) time. \(m\) here is the size of theRE
, which is the number of nodes in its internal tree representation. The resultingParser
has \(O(m)\) size. - The
Parser
is run on aText
in \(O(mn \log m)\) time, where \(n\) is the length of theText
. This assumes that each(TextToken -> Maybe a)
function used to parse individual elements takes \(O(1)\) time.
Performance tip: Use (<$)
over (<$>)
, and (<*)
/(*>)
over
liftA2
/(<*>)
when ignoring the result of a RE
. Knowing the result is
ignored allows compiling to a faster parser.
Memory usage for parsing is \(O(nm)\), but
- If the result of a
RE
is ignored using(<$)
,(<*)
, or(*>)
, only \(O(m)\) memory is required. - To parse some slice of the input
Text
(using one ofmanyText
,manyTextOf
, etc.), memory required is \(O(1)\). FortoMatch r
, memory required is \(O(m' \min (m',n))\) where \(m'\) is the size ofr
.
This applies even as subcomponents. So, any subcomponent RE
of a larger
RE
that is only recognizing text or parsing a slice is cheaper in terms of
memory.