parser-regex-0.2.0.1: Regex based parsers
Safe HaskellNone
LanguageHaskell2010

Regex.Base

Description

This module exports base types and functions. You can use these to define functions to work on arbitrary sequence types.

If you want to work with Text or String, import and use Regex.Text or Regex.List instead.

Synopsis

RE and Parser

data RE c a Source #

A regular expression. Operates on a sequence of elements of type c and capable of parsing into an a.

A RE is a Functor, Applicative, and Alternative.

  • pure: Succeed without consuming input.
  • liftA2, <*>, *>, <*: Sequential composition.
  • empty: Fail.
  • <|>: Alternative composition. Left-biased, i.e. the result of parsing using a <|> b is the result of parsing using a if it succeeds, otherwise it is the result of parsing using b if it succeeds, otherwise parsing fails.
  • many: Zero or more. many a parses multiple as sequentially. Biased towards matching more. Use manyMin for a bias towards matching less. Also see the section "Looping parsers".
  • some: One or more. some a parses multiple as sequentially. Biased towards matching more. Use someMin for a bias towards matching less.

In addition to expected Functor, Applicative, and Alternative laws, RE obeys these Applicative-Alternative laws:

a <*> empty = empty
empty <*> a = empty
(a <|> b) <*> c = (a <*> c) <|> (b <*> c)
a <*> (b <|> c) = (a <*> b) <|> (a <*> c)

Note that, because of bias, it is not true that a <|> b = b <|> a.

Performance tip: Prefer the smaller of equivalent regexes, i.e. prefer (a <|> b) <*> c over (a <*> c) <|> (b <*> c).

Instances

Instances details
Alternative (RE c) Source # 
Instance details

Defined in Regex.Internal.Regex

Methods

empty :: RE c a #

(<|>) :: RE c a -> RE c a -> RE c a #

some :: RE c a -> RE c [a] #

many :: RE c a -> RE c [a] #

Applicative (RE c) Source # 
Instance details

Defined in Regex.Internal.Regex

Methods

pure :: a -> RE c a #

(<*>) :: RE c (a -> b) -> RE c a -> RE c b #

liftA2 :: (a -> b -> c0) -> RE c a -> RE c b -> RE c c0 #

(*>) :: RE c a -> RE c b -> RE c b #

(<*) :: RE c a -> RE c b -> RE c a #

Functor (RE c) Source # 
Instance details

Defined in Regex.Internal.Regex

Methods

fmap :: (a -> b) -> RE c a -> RE c b #

(<$) :: a -> RE c b -> RE c a #

Monoid a => Monoid (RE c a) Source #

mempty = pure mempty

Instance details

Defined in Regex.Internal.Regex

Methods

mempty :: RE c a #

mappend :: RE c a -> RE c a -> RE c a #

mconcat :: [RE c a] -> RE c a #

Semigroup a => Semigroup (RE c a) Source #

(<>) = liftA2 (<>)

Instance details

Defined in Regex.Internal.Regex

Methods

(<>) :: RE c a -> RE c a -> RE c a #

sconcat :: NonEmpty (RE c a) -> RE c a #

stimes :: Integral b => b -> RE c a -> RE c a #

data Parser c a Source #

A parser compiled from a RE c a.

Compile

compile :: RE c a -> Parser c a Source #

\(O(m)\). Compile a RE c a to a Parser c a.

Note: compile does not limit the size of the RE. See compileBounded if you would like to limit the size. REs with size greater than (maxBound::Int) `div` 2 are not supported and the behavior of such a RE is undefined.

compileBounded :: Int -> RE c a -> Maybe (Parser c a) Source #

\(O(\min(l,m))\). Compile a RE c a to a Parser c a.

Returns Nothing if the size of the RE is greater than the provided limit \(l\). You may want to use this if you suspect that the RE may be too large, for instance if the regex is constructed from an untrusted source.

While the exact size of a RE depends on an internal representation, it can be assumed to be in the same order as the length of a regex pattern corresponding to the RE.

Parse

The functions prepareParser, stepParser, and finishParser grant a large amount of control over the parsing process, making it possible to parse in a resumable or even branching manner.

parseFoldr and parseNext may be more convenient to use, depending on the sequence to parse.

data ParserState c a Source #

The state maintained for parsing.

prepareParser :: Parser c a -> Maybe (ParserState c a) Source #

\(O(m \log m)\). Prepare a parser for input.

Returns Nothing if parsing has failed regardless of further input. Otherwise, returns the initial ParserState.

stepParser :: ParserState c a -> c -> Maybe (ParserState c a) Source #

\(O(m \log m)\). Step a parser by feeding a single element c.

Returns Nothing if parsing has failed regardless of further input. Otherwise, returns an updated ParserState.

finishParser :: ParserState c a -> Maybe a Source #

\(O(1)\). Get the parse result for the input fed into the parser so far.

type Foldr f a = forall b. (a -> b -> b) -> b -> f -> b Source #

A fold function.

parseFoldr :: Foldr f c -> Parser c a -> f -> Maybe a Source #

\(O(mn \log m)\). Run a parser given a sequence f and a fold function.

Parses the entire sequence, not just a prefix or an substring. Returns early on parse failure, if the fold can short circuit.

Examples

Expand
import qualified Data.Vector.Generic as VG -- from vector

import Regex.Base (Parser)
import qualified Regex.Base as R

parseVector :: VG.Vector v c => Parser c a -> v c -> Maybe a
parseVector p v = R.parseFoldr VG.foldr p v
>>> import Control.Applicative (many)
>>> import qualified Data.Vector as V
>>> import Regex.Base (Parser)
>>> import qualified Regex.Base as R
>>> 
>>> let p = R.compile $ many ((,) <$> R.satisfy even <*> R.satisfy odd) :: Parser Int [(Int, Int)]
>>> parseVector p (V.fromList [0..5])
Just [(0,1),(2,3),(4,5)]
>>> parseVector p (V.fromList [0,2..6])
Nothing

parseNext :: Monad m => Parser c a -> m (Maybe c) -> m (Maybe a) Source #

\(O(mn \log m)\). Run a parser given a "next" action.

Calls next repeatedly to yield elements. A Nothing is interpreted as end-of-sequence.

Parses the entire sequence, not just a prefix or an substring. Returns without exhausting the input on parse failure.

Examples

Expand
import Conduit (ConduitT, await, sinkNull) -- from conduit

import Regex.Base (Parser)
import qualified Regex.Base as R

parseConduit :: Monad m => Parser c a -> ConduitT c x m (Maybe a)
parseConduit p = R.parseNext p await <* sinkNull
>>> import Control.Applicative (many)
>>> import Conduit ((.|), iterMC, runConduit, yieldMany)
>>> import Regex.Base (Parser)
>>> import qualified Regex.Base as R
>>> 
>>> let p = R.compile $ many ((,) <$> R.satisfy even <*> R.satisfy odd) :: Parser Int [(Int, Int)]
>>> let printYieldMany xs = yieldMany xs .| iterMC print
>>> runConduit $ printYieldMany [0..5] .| parseConduit p
0
1
2
3
4
5
Just [(0,1),(2,3),(4,5)]
>>> runConduit $ printYieldMany [0,2..6] .| parseConduit p
0
2
4
6
Nothing

Since: 0.2.0.0

REs and combinators

token :: (c -> Maybe a) -> RE c a Source #

Parse a c into an a if the given function returns Just.

anySingle :: RE c c Source #

Parse any c.

single :: Eq c => c -> RE c c Source #

Parse the given c.

satisfy :: (c -> Bool) -> RE c c Source #

Parse a c if it satisfies the given predicate.

foldlMany :: (b -> a -> b) -> b -> RE c a -> RE c b Source #

Parse many occurences of the given RE. Biased towards matching more.

Also see the section "Looping parsers".

foldlManyMin :: (b -> a -> b) -> b -> RE c a -> RE c b Source #

Parse many occurences of the given RE. Minimal, i.e. biased towards matching less.

data Many a Source #

A repeating value or a finite list.

Constructors

Repeat a

A single value repeating indefinitely

Finite [a]

A finite list

Instances

Instances details
Eq1 Many Source # 
Instance details

Defined in Regex.Internal.Regex

Methods

liftEq :: (a -> b -> Bool) -> Many a -> Many b -> Bool #

Ord1 Many Source # 
Instance details

Defined in Regex.Internal.Regex

Methods

liftCompare :: (a -> b -> Ordering) -> Many a -> Many b -> Ordering #

Show1 Many Source # 
Instance details

Defined in Regex.Internal.Regex

Methods

liftShowsPrec :: (Int -> a -> ShowS) -> ([a] -> ShowS) -> Int -> Many a -> ShowS #

liftShowList :: (Int -> a -> ShowS) -> ([a] -> ShowS) -> [Many a] -> ShowS #

NFData1 Many Source # 
Instance details

Defined in Regex.Internal.Regex

Methods

liftRnf :: (a -> ()) -> Many a -> () #

Functor Many Source # 
Instance details

Defined in Regex.Internal.Regex

Methods

fmap :: (a -> b) -> Many a -> Many b #

(<$) :: a -> Many b -> Many a #

Foldable Many Source # 
Instance details

Defined in Regex.Internal.Regex

Methods

fold :: Monoid m => Many m -> m #

foldMap :: Monoid m => (a -> m) -> Many a -> m #

foldMap' :: Monoid m => (a -> m) -> Many a -> m #

foldr :: (a -> b -> b) -> b -> Many a -> b #

foldr' :: (a -> b -> b) -> b -> Many a -> b #

foldl :: (b -> a -> b) -> b -> Many a -> b #

foldl' :: (b -> a -> b) -> b -> Many a -> b #

foldr1 :: (a -> a -> a) -> Many a -> a #

foldl1 :: (a -> a -> a) -> Many a -> a #

toList :: Many a -> [a] #

null :: Many a -> Bool #

length :: Many a -> Int #

elem :: Eq a => a -> Many a -> Bool #

maximum :: Ord a => Many a -> a #

minimum :: Ord a => Many a -> a #

sum :: Num a => Many a -> a #

product :: Num a => Many a -> a #

NFData a => NFData (Many a) Source # 
Instance details

Defined in Regex.Internal.Regex

Methods

rnf :: Many a -> () #

Show a => Show (Many a) Source # 
Instance details

Defined in Regex.Internal.Regex

Methods

showsPrec :: Int -> Many a -> ShowS #

show :: Many a -> String #

showList :: [Many a] -> ShowS #

Eq a => Eq (Many a) Source # 
Instance details

Defined in Regex.Internal.Regex

Methods

(==) :: Many a -> Many a -> Bool #

(/=) :: Many a -> Many a -> Bool #

Ord a => Ord (Many a) Source # 
Instance details

Defined in Regex.Internal.Regex

Methods

compare :: Many a -> Many a -> Ordering #

(<) :: Many a -> Many a -> Bool #

(<=) :: Many a -> Many a -> Bool #

(>) :: Many a -> Many a -> Bool #

(>=) :: Many a -> Many a -> Bool #

max :: Many a -> Many a -> Many a #

min :: Many a -> Many a -> Many a #

manyr :: RE c a -> RE c (Many a) Source #

Zero or more. Biased towards matching more.

Also see the section "Looping parsers".

optionalMin :: RE c a -> RE c (Maybe a) Source #

Zero or one. Minimal, i.e. biased towards zero.

Use Control.Applicative.optional for the same but biased towards one.

someMin :: RE c a -> RE c [a] Source #

One or more. Minimal, i.e. biased towards matching less.

manyMin :: RE c a -> RE c [a] Source #

Zero or more. Minimal, i.e. biased towards matching less.

atLeast :: Int -> RE c a -> RE c [a] Source #

At least n times. Biased towards matching more.

atMost :: Int -> RE c a -> RE c [a] Source #

At most n times. Biased towards matching more.

betweenCount :: (Int, Int) -> RE c a -> RE c [a] Source #

Between m and n times (inclusive). Biased towards matching more.

atLeastMin :: Int -> RE c a -> RE c [a] Source #

At least n times. Minimal, i.e. biased towards matching less.

atMostMin :: Int -> RE c a -> RE c [a] Source #

At most n times. Minimal, i.e. biased towards matching less.

betweenCountMin :: (Int, Int) -> RE c a -> RE c [a] Source #

Between m and n times (inclusive). Minimal, i.e. biased towards matching less.

sepBy :: RE c a -> RE c sep -> RE c [a] Source #

r `sepBy` sep parses zero or more occurences of r, separated by sep. Biased towards matching more.

sepBy1 :: RE c a -> RE c sep -> RE c [a] Source #

r `sepBy1` sep parses one or more occurences of r, separated by sep. Biased towards matching more.

endBy :: RE c a -> RE c sep -> RE c [a] Source #

r `endBy` sep parses zero or more occurences of r, separated and ended by sep. Biased towards matching more.

endBy1 :: RE c a -> RE c sep -> RE c [a] Source #

r `endBy1` sep parses one or more occurences of r, separated and ended by sep. Biased towards matching more.

sepEndBy :: RE c a -> RE c sep -> RE c [a] Source #

r `sepEndBy` sep parses zero or more occurences of r, separated and optionally ended by sep. Biased towards matching more.

sepEndBy1 :: RE c a -> RE c sep -> RE c [a] Source #

r `sepEndBy1` sep parses one or more occurences of r, separated and optionally ended by sep. Biased towards matching more.

chainl1 :: RE c a -> RE c (a -> a -> a) -> RE c a Source #

chainl1 r op parses one or more occurences of r, separated by op. The result is obtained by left associative application of all functions returned by op to the values returned by p. Biased towards matching more.

chainr1 :: RE c a -> RE c (a -> a -> a) -> RE c a Source #

chainr1 r op parses one or more occurences of r, separated by op. The result is obtained by right associative application of all functions returned by op to the values returned by p. Biased towards matching more.

toFind :: RE c a -> RE c a Source #

Results in the first occurence of the given RE. Fails if no occurence is found.

toFindMany :: RE c a -> RE c [a] Source #

Results in all non-overlapping occurences of the given RE. Always succeeds.

Strict combinators

These combinators force the result before continuing parsing. But beware! If that particular parse ends up failing, the work done will have been for nothing. This can blow up the complexity of parsing. For instance, fmap' sum (many digit) is \(O(n^2)\).

These functions are intended to be used when the work done in forcing the result is guaranteed to be cheaper than creating a thunk, saving memory and time. For instance, liftA2' (:) is a good usage, since (:) does a small amount of work and a thunk is avoided. As another example, liftA2' ((+) @Int) is not a good usage, because (+) is strict and forces its arguments, performing an arbitrary amount of work. However, it is okay to use liftA2' ((+) @Int) if it is known for certain that its arguments will be in WHNF.

WARNING: If you are not sure whether to use these function, don't use these functions. Simply use fmap, liftA2, foldlMany or foldlManyMin instead.

fmap' :: (a -> b) -> RE c a -> RE c b Source #

liftA2' :: (a1 -> a2 -> b) -> RE c a1 -> RE c a2 -> RE c b Source #

foldlMany' :: (b -> a -> b) -> b -> RE c a -> RE c b Source #

foldlManyMin' :: (b -> a -> b) -> b -> RE c a -> RE c b Source #

Additional information

Recursive definitions

It is not possible to define a RE recursively. If it were permitted, it would be capable of parsing more than regular languages. Unfortunately, there is no good way* to make it impossible to write such a regex in the first place. So it must be avoided by the programmer. As an example, avoid this:

re :: RE Int [Int]
re = liftA2 (:) (single 1) re <|> [] <$ single 0  -- diverges!

Instead, use appropriate combinators from this module:

re = many (single 1) <* single 0

For the same reason, be cautious when using combinators from the other packages on REs. Make sure that they do not attempt to construct a recursive RE.

If you find that your regex is impossible to write without recursion, you are attempting to parse a non-regular language! You need a more powerful parser than what this library has to offer.

*Unlifted datatypes can be used for this but they are too inconvenient to work with.

Laziness

Parsing is lazy in the result value, i.e. the a in RE c a or Parser c a. In fact, for the algorithm used in this library, this laziness is essential for good runtime complexity. However, there is little reason to be lazy in other aspects, such as the elements of the sequence, c, or the functions and regexes used in combinators. Functions are strict in such arguments.

-- Lazy in the result
parseFoldr foldr (compile (pure ⊥)) [] = Just ⊥
parseFoldr foldr (compile (fmap (\_ -> ⊥) (single 1))) [1] = Just ⊥

-- Strict in places like
single ⊥ = ⊥
fmap ⊥ r = ⊥
liftA2 f r ⊥ = ⊥

Looping parsers

What should be the result of parsing an empty sequence with (many (pure ()))?

Since many r parses r as many times as possible, and pure () succeeds without consuming input, the result should arguably be the infinite list repeat (). Similarly, parsing an empty sequence with (foldlMany f z (pure ())) should diverge. Note that this applies to not just pure x, but any regex that can succeed without consuming input, such as many x, manyMin x, etc.

This library considers that such an outcome is not desirable in practice. It would be surprising to get an infinite structure from a parser. So, in the case that many succeeds an infinite number of times, this library treats it as succeeding zero times.

By this rule, (many (pure ())) on an empty sequence parses as [] and (foldlMany f z (pure ())) parses as z.

This behavior makes it impossible to distinguish between zero parses and infinite parses. To address this, an alternate combinator manyr is provided. This parses into a Many, a type that clearly indicates if parsing succeeded without consuming input into an infinite list, or if it succeeded a finite number of times.

Performance

This section describes some performance characteristics of this library, without requiring a dive into the source code.

Parsing with a RE is done in two distinct steps.

  1. A RE is compiled to a Parser, which is a nondeterministic finite automaton (NFA), in \(O(m)\) time. \(m\) here is the size of the RE, which is the number of nodes in its internal tree representation. The resulting Parser has \(O(m)\) size.
  2. The Parser is run on a sequence in \(O(mn \log m)\) time, where \(n\) is the length of the sequence. This assumes that each (c -> Maybe a) function used to parse individual elements takes \(O(1)\) time.

Performance tip: Use (<$) over (<$>), and (<*)/(*>) over liftA2/(<*>) when ignoring the result of a RE. Knowing the result is ignored allows compiling to a faster parser.

Memory usage for parsing is \(O(nm)\), but

  • If the result of a RE is ignored using (<$), (<*), or (*>), only \(O(m)\) memory is required.

This applies even as subcomponents. So, any subcomponent RE of a larger RE that is only recognizing a section of the list is cheaper in terms of memory.