-- | This module exports base types and functions. You can use these to define -- functions to work on arbitrary sequence types. -- -- If you want to work with @Text@ or @String@, import and use "Regex.Text" or -- "Regex.List" instead. -- module Regex.Base ( -- * @RE@ and @Parser@ R.RE , P.Parser -- * Compile , P.compile , P.compileBounded -- * Parse -- $parse , P.ParserState , P.prepareParser , P.stepParser , P.finishParser , P.Foldr , P.parseFoldr , P.parseNext -- * @RE@s and combinators , R.token , R.anySingle , R.single , R.satisfy , R.foldlMany , R.foldlManyMin , R.Many(..) , R.manyr , R.optionalMin , R.someMin , R.manyMin , R.atLeast , R.atMost , R.betweenCount , R.atLeastMin , R.atMostMin , R.betweenCountMin , R.sepBy , R.sepBy1 , R.endBy , R.endBy1 , R.sepEndBy , R.sepEndBy1 , R.chainl1 , R.chainr1 , R.toFind , R.toFindMany -- * Strict combinators -- $strict , R.fmap' , R.liftA2' , R.foldlMany' , R.foldlManyMin' -- * Additional information -- ** Recursive definitions -- $recursive-definitions -- ** Laziness -- $laziness -- ** Looping parsers -- $looping-parsers -- ** Performance -- $performance ) where import qualified Regex.Internal.Regex as R import qualified Regex.Internal.Parser as P -- $parse -- -- The functions @prepareParser@, @stepParser@, and @finishParser@ grant -- a large amount of control over the parsing process, making it possible to -- parse in a resumable or even branching manner. -- -- @parseFoldr@ and @parseNext@ may be more convenient to use, depending on the -- sequence to parse. -- -- $strict -- -- These combinators force the result before continuing parsing. But beware! -- If that particular parse ends up failing, the work done will have been for -- nothing. This can blow up the complexity of parsing. For instance, -- @fmap' sum (many digit)@ is \(O(n^2)\). -- -- These functions are intended to be used when the work done in forcing the -- result is guaranteed to be cheaper than creating a thunk, saving memory and -- time. -- For instance, @liftA2' (:)@ is a good usage, since @(:)@ does a small amount -- of work and a thunk is avoided. As another example, @liftA2' ((+) \@Int)@ is -- /not/ a good usage, because @(+)@ is strict and forces its arguments, -- performing an arbitrary amount of work. However, it is okay to use -- @liftA2' ((+) \@Int)@ if it is known for certain that its arguments will be -- in WHNF. -- -- __WARNING__: If you are not sure whether to use these function, -- /don't use these functions/. Simply use @fmap@, @liftA2@, @foldlMany@ or -- @foldlManyMin@ instead. -- -- $recursive-definitions -- -- It is not possible to define a @RE@ recursively. If it were permitted, it -- would be capable of parsing more than -- [regular languages](https://en.wikipedia.org/wiki/Regular_language). -- Unfortunately, there is no good way\* to make it impossible to write such -- a regex in the first place. So it must be avoided by the programmer. As an -- example, avoid this: -- -- @ -- re :: RE Int [Int] -- re = liftA2 (:) (single 1) re \<|> [] \<$ single 0 -- diverges! -- @ -- -- Instead, use appropriate combinators from this module: -- -- @ -- re = many (single 1) <* single 0 -- @ -- -- For the same reason, be cautious when using combinators from the other -- packages on @RE@s. Make sure that they do not attempt to construct a -- recursive @RE@. -- -- If you find that your regex is impossible to write without recursion, -- you are attempting to parse a non-regular language! You need a more powerful -- parser than what this library has to offer. -- -- \*[Unlifted datatypes](https://ghc.gitlab.haskell.org/ghc/doc/users_guide/exts/primitives.html#unlifted-datatypes) -- can be used for this but they are too inconvenient to work with. -- -- $laziness -- -- Parsing is lazy in the result value, i.e. the @a@ in @RE c a@ or -- @Parser c a@. In fact, for the algorithm used in this library, this laziness -- is essential for good runtime complexity. However, there is little reason -- to be lazy in other aspects, such as the elements of the sequence, @c@, or -- the functions and regexes used in combinators. Functions are strict in such -- arguments. -- -- @ -- -- Lazy in the result -- parseFoldr foldr (compile (pure ⊥)) [] = Just ⊥ -- parseFoldr foldr (compile (fmap (\\_ -> ⊥) (single 1))) [1] = Just ⊥ -- -- -- Strict in places like -- single ⊥ = ⊥ -- fmap ⊥ r = ⊥ -- liftA2 f r ⊥ = ⊥ -- @ -- -- $looping-parsers -- -- What should be the result of parsing an empty sequence with -- @(many (pure ()))@? -- -- Since @many r@ parses @r@ as many times as possible, and @pure ()@ succeeds -- without consuming input, the result should arguably be the infinite list -- @repeat ()@. Similarly, parsing an empty sequence with -- @(foldlMany f z (pure ()))@ should diverge. Note that this applies to not -- just @pure x@, but any regex that can succeed without consuming input, such -- as @many x@, @manyMin x@, etc. -- -- This library considers that such an outcome is not desirable in practice. It -- would be surprising to get an infinite structure from a parser. So, in the -- case that @many@ succeeds an infinite number of times, this library treats it -- as succeeding /zero/ times. -- -- By this rule, @(many (pure ()))@ on an empty sequence parses as @[]@ and -- @(foldlMany f z (pure ()))@ parses as @z@. -- -- This behavior makes it impossible to distinguish between zero parses and -- infinite parses. To address this, an alternate combinator 'Regex.List.manyr' -- is provided. This parses into a 'Regex.List.Many', a type that clearly -- indicates if parsing succeeded without consuming input into an infinite list, -- or if it succeeded a finite number of times. -- -- $performance -- -- This section describes some performance characteristics of this library, -- without requiring a dive into the source code. -- -- Parsing with a @RE@ is done in two distinct steps. -- -- 1. A @RE@ is compiled to a @Parser@, which is a -- [nondeterministic finite automaton](https://en.wikipedia.org/wiki/Nondeterministic_finite_automaton) -- (NFA), in \(O(m)\) time. \(m\) here is the size of the @RE@, which is the -- number of nodes in its internal tree representation. The resulting @Parser@ -- has \(O(m)\) size. -- 2. The @Parser@ is run on a sequence in \(O(mn \log m)\) time, where \(n\) is -- the length of the sequence. This assumes that each @(c -> Maybe a)@ function -- used to parse individual elements takes \(O(1)\) time. -- -- /Performance tip/: Use @(\<$)@ over @(\<$>)@, and @(\<*)@\/@(*>)@ over -- @liftA2@\/@(\<*>)@ when ignoring the result of a @RE@. Knowing the result is -- ignored allows compiling to a faster parser. -- -- Memory usage for parsing is \(O(nm)\), but -- -- * If the result of a @RE@ is ignored using @(\<$)@, @(\<*)@, or @(*>)@, only -- \(O(m)\) memory is required. -- -- This applies even as subcomponents. So, any subcomponent @RE@ of a larger -- @RE@ that is only recognizing a section of the list is cheaper in terms of -- memory. --