Copyright | (c) 2020 Composewell Technologies |
---|---|
License | BSD-3-Clause |
Maintainer | streamly@composewell.com |
Stability | experimental |
Portability | GHC |
Safe Haskell | Safe-Inferred |
Language | Haskell2010 |
Streaming and backtracking parsers.
Parsers just extend folds. Please read the Fold
design notes in
Streamly.Internal.Data.Fold.Type for background on the design.
Parser Design
The Parser
type or a parsing fold is a generalization of the Fold
type.
The Fold
type always succeeds on each input. Therefore, it does not need
to buffer the input. In contrast, a Parser
may fail and backtrack to
replay the input again to explore another branch of the parser. Therefore,
it needs to buffer the input. Therefore, a Parser
is a fold with some
additional requirements. To summarize, unlike a Fold
, a Parser
:
- may not generate a new value of the accumulator on every input, it may generate a new accumulator only after consuming multiple input elements (e.g. takeEQ).
- on success may return some unconsumed input (e.g. takeWhile)
- may fail and return all input without consuming it (e.g. satisfy)
- backtrack and start inspecting the past input again (e.g. alt)
These use cases require buffering and replaying of input. To facilitate
this, the step function of the Fold
is augmented to return the next state
of the fold along with a command tag using a Step
functor, the tag tells
the fold driver to manipulate the future input as the parser wishes. The
Step
functor provides the following commands to the fold driver
corresponding to the use cases outlined in the previous para:
Continue
: buffer the current input and optionally go back to a previous position in the streamPartial
: buffer the current input and optionally go back to a previous position in the stream, drop the buffer before that position.Done
: parser succeeded, returns how much input was leftoverError
: indicates that the parser has failed without a result
How a Parser Works?
A parser is just like a fold, it keeps consuming inputs from the stream and accumulating them in an accumulator. The accumulator of the parser could be a singleton value or it could be a collection of values e.g. a list.
The parser may build a new output value from multiple input items. When it
consumes an input item but needs more input to build a complete output item
it uses Continue 0 s
, yielding the intermediate state s
and asking the
driver to provide more input. When the parser determines that a new output
value is complete it can use a Done n b
to terminate the parser with n
items of input unused and the final value of the accumulator returned as
b
. If at any time the parser determines that the parse has failed it can
return Error err
.
A parser building a collection of values (e.g. a list) can use the Partial
constructor whenever a new item in the output collection is generated. If a
parser building a collection of values has yielded at least one value then
it is considered successful and cannot fail after that. In the current
implementation, this is not automatically enforced, there is a rule that the
parser MUST use only Done
for termination after the first Partial
, it
cannot use Error
. It may be possible to change the implementation so that
this rule is not required, but there may be some performance cost to it.
takeWhile
and
some
combinators are good examples of
efficient implementations using all features of this representation. It is
possible to idiomatically build a collection of parsed items using a
singleton parser and Alternative
instance instead of using a
multi-yield parser. However, this implementation is amenable to stream
fusion and can therefore be much faster.
Error Handling
When a parser's step
function is invoked it may terminate by either a
Done
or an Error
return value. In an Alternative
composition an error
return can make the composed parser backtrack and try another parser.
If the stream stops before a parser could terminate then we use the
extract
function of the parser to retrieve the last yielded value of the
parser. If the parser has yielded at least one value then extract
MUST
return a value without throwing an error, otherwise it uses the ParseError
exception to throw an error.
We chose the exception throwing mechanism for extract
instead of using an
explicit error return via an Either
type for keeping the interface simple
as most of the time we do not need to catch the error in intermediate
layers. Note that we cannot use exception throwing mechanism in step
function because of performance reasons. Error
constructor in that case
allows loop fusion and better performance.
Optimizing backtracking
Applicative Composition
If a parser once returned Partial
it can never fail after that. This is
used to reduce the buffering. A Partial
results in dropping the buffer and
we cannot backtrack before that point.
Parsers can be composed using an Alternative, if we are in an alternative
composition we may have to backtrack to try the other branch. When we
compose two parsers using applicative f $ p1 * p2
we can return a
Partial
result only after both the parsers have succeeded. While running
p1
we have to ensure that the input is not dropped until we have run p2
,
therefore we have to return a Continue instead of a Partial.
However, if we know they both cannot fail then we know that the composed
parser can never fail. For this reason we should have "backtracking folds"
as a separate type so that we can compose them in an efficient manner. In p1
itself we can drop the buffer as soon as a Partial
result arrives. In
fact, there is no Alternative composition for folds because they cannot
fail.
Alternative Composition
In p1 | p2
as soon as the parser p1 returns Partial
we know that it
will not fail and we can immediately drop the buffer.
If we are not using the parser in an alternative composition we can downgrade the parser to a backtracking fold and use the "backtracking fold"'s applicative for more efficient implementation. To downgrade we can translate the Error of parser to an exception. This gives us best of both worlds, the applicative as well as alternative would have optimal backtracking buffer.
The "many" for parsers would be different than "many" for folds. In case of folds an error would be propagated. In case of parsers the error would be ignored.
Implementation Approach
Backtracking folds have an issue with tee style composition because each fold can backtrack independently, we will need independent buffers. Though this may be possible to implement it may not be efficient especially for folds that do not backtrack at all. Three types are possible, optimized for different use cases:
- Non-backtracking folds: efficient Tee
- Backtracking folds: efficient applicative
- Parsers: alternative
Downgrade parsers to backtracking folds for applicative used without alternative. Upgrade backtracking folds to parsers when we have to use them as the last alternative.
Future Work
It may make sense to move "takeWhile" type of parsers, which cannot fail but need some lookahead, to splitting folds. This will allow such combinators to be accepted where we need an unfailing Fold type.
Based on application requirements it should be possible to design even a richer interface to manipulate the input stream/buffer. For example, we could randomly seek into the stream in the forward or reverse directions or we can even seek to the end or from the end or seek from the beginning.
We can distribute and scan/parse a stream using both folds and parsers and merge the resulting streams using different merge strategies (e.g. interleaving or serial).
Naming
As far as possible, try that the names of the combinators in this module are consistent with:
Synopsis
- data Initial s b
- data Step s b
- extractStep :: Monad m => (s -> m (Step s1 b)) -> Step s b -> m (Step s1 b)
- bimapOverrideCount :: Int -> (s -> s1) -> (b -> b1) -> Step s b -> Step s1 b1
- data Parser a m b = forall s. Parser (s -> a -> m (Step s b)) (m (Initial s b)) (s -> m (Step s b))
- newtype ParseError = ParseError String
- rmapM :: Monad m => (b -> m c) -> Parser a m b -> Parser a m c
- fromPure :: Monad m => b -> Parser a m b
- fromEffect :: Monad m => m b -> Parser a m b
- splitWith :: Monad m => (a -> b -> c) -> Parser x m a -> Parser x m b -> Parser x m c
- split_ :: Monad m => Parser x m a -> Parser x m b -> Parser x m b
- die :: Monad m => String -> Parser a m b
- dieM :: Monad m => m String -> Parser a m b
- splitSome :: Monad m => Parser a m b -> Fold m b c -> Parser a m c
- splitMany :: Monad m => Parser a m b -> Fold m b c -> Parser a m c
- splitManyPost :: Monad m => Parser a m b -> Fold m b c -> Parser a m c
- alt :: Monad m => Parser x m a -> Parser x m a -> Parser x m a
- concatMap :: Monad m => (b -> Parser a m c) -> Parser a m b -> Parser a m c
- lmap :: (a -> b) -> Parser b m r -> Parser a m r
- lmapM :: Monad m => (a -> m b) -> Parser b m r -> Parser a m r
- filter :: Monad m => (a -> Bool) -> Parser a m b -> Parser a m b
- noErrorUnsafeSplitWith :: Monad m => (a -> b -> c) -> Parser x m a -> Parser x m b -> Parser x m c
- noErrorUnsafeSplit_ :: Monad m => Parser x m a -> Parser x m b -> Parser x m b
- noErrorUnsafeConcatMap :: Monad m => (b -> Parser a m c) -> Parser a m b -> Parser a m c
Setup
>>>
:m
>>>
import Control.Applicative ((<|>))
>>>
import Data.Bifunctor (second)
>>>
import Data.Char (isSpace)
>>>
import qualified Data.Foldable as Foldable
>>>
import qualified Data.Maybe as Maybe
>>>
import Streamly.Data.Fold (Fold)
>>>
import Streamly.Data.Parser (Parser)
>>>
import qualified Streamly.Data.Fold as Fold
>>>
import qualified Streamly.Data.Parser as Parser
>>>
import qualified Streamly.Data.Stream as Stream
For APIs that have not been released yet.
>>>
import qualified Streamly.Internal.Data.Fold as Fold
>>>
import qualified Streamly.Internal.Data.Parser as Parser
Types
The type of a Parser'
s initial action.
Internal
IPartial !s | Wait for step function to be called with state |
IDone !b | Return a result right away without an input. |
IError !String | Return an error right away without an input. |
The return type of a Parser
step.
The parse operation feeds the input stream to the parser one element at a
time, representing a parse Step
. The parser may or may not consume the
item and returns a result. If the result is Partial
we can either extract
the result or feed more input to the parser. If the result is Continue
, we
must feed more input in order to get a result. If the parser returns Done
then the parser can no longer take any more input.
If the result is Continue
, the parse operation retains the input in a
backtracking buffer, in case the parser may ask to backtrack in future.
Whenever a 'Partial n' result is returned we first backtrack by n
elements
in the input and then release any remaining backtracking buffer. Similarly,
'Continue n' backtracks to n
elements before the current position and
starts feeding the input from that point for future invocations of the
parser.
If parser is not yet done, we can use the extract
operation on the state
of the parser to extract a result. If the parser has not yet yielded a
result, the operation fails with a ParseError
exception. If the parser
yielded a Partial
result in the past the last partial result is returned.
Therefore, if a parser yields a partial result once it cannot fail later on.
The parser can never backtrack beyond the position where the last partial result left it at. The parser must ensure that the backtrack position is always after that.
Pre-release
Partial !Int !s |
|
Continue !Int !s |
|
Done !Int !b | Done with leftover input count and result.
|
Error !String | Parser failed without generating any output. The parsing operation may backtrack to the beginning and try another alternative. |
extractStep :: Monad m => (s -> m (Step s1 b)) -> Step s b -> m (Step s1 b) Source #
Map an extract function over the state of Step
bimapOverrideCount :: Int -> (s -> s1) -> (b -> b1) -> Step s b -> Step s1 b1 Source #
Bimap discarding the count, and using the supplied count instead.
A parser is a fold that can fail and is represented as Parser step
initial extract
. Before we drive a parser we call the initial
action to
retrieve the initial state of the fold. The parser driver invokes step
with the state returned by the previous step and the next input element. It
results into a new state and a command to the driver represented by Step
type. The driver keeps invoking the step function until it stops or fails.
At any point of time the driver can call extract
to inspect the result of
the fold. If the parser hits the end of input extract
is called.
It may result in an error or an output value.
Pre-release
Instances
Monad m => MonadFail (Parser a m) Source # | |
Defined in Streamly.Internal.Data.Parser.ParserD.Type | |
(Monad m, MonadIO m) => MonadIO (Parser a m) Source # | |
Defined in Streamly.Internal.Data.Parser.ParserD.Type | |
Monad m => Alternative (Parser a m) Source # | Sequential alternative. The input is first passed to the first parser, and if it succeeds, the result is returned. However, if the first parser fails, the parser driver backtracks and tries the same input on the second parser, returning the result if it succeeds. Note: The implementation of
WARNING! this is not suitable for large scale use. As a thumb rule stream
fusion works well for less than 8 compositions of this operation, otherwise
consider using |
Monad m => Applicative (Parser a m) Source # |
|
Defined in Streamly.Internal.Data.Parser.ParserD.Type | |
Functor m => Functor (Parser a m) Source # | |
Monad m => Monad (Parser a m) Source # | See documentation of Although this implementation allows stream fusion, it has quadratic
complexity, making it suitable only for a small number of compositions. As a
thumb rule use it for less than 8 compositions, use |
newtype ParseError Source #
This exception is used when a parser ultimately fails, the user of the parser is intimated via this exception.
Pre-release
Instances
Exception ParseError Source # | |
Defined in Streamly.Internal.Data.Parser.ParserD.Type toException :: ParseError -> SomeException # fromException :: SomeException -> Maybe ParseError # displayException :: ParseError -> String # | |
Show ParseError Source # | |
Defined in Streamly.Internal.Data.Parser.ParserD.Type showsPrec :: Int -> ParseError -> ShowS # show :: ParseError -> String # showList :: [ParseError] -> ShowS # |
rmapM :: Monad m => (b -> m c) -> Parser a m b -> Parser a m c Source #
rmapM f parser
maps the monadic function f
on the output of the parser.
>>>
rmap = fmap
Constructors
fromPure :: Monad m => b -> Parser a m b Source #
A parser that always yields a pure value without consuming any input.
fromEffect :: Monad m => m b -> Parser a m b Source #
A parser that always yields the result of an effectful action without consuming any input.
splitWith :: Monad m => (a -> b -> c) -> Parser x m a -> Parser x m b -> Parser x m c Source #
Sequential parser application.
Apply two parsers sequentially to an input stream. The first parser runs and processes the input, the remaining input is then passed to the second parser. If both parsers succeed, their outputs are combined using the supplied function. If either parser fails, the operation fails.
This implementation is strict in the second argument, therefore, the following will fail:
>>>
Stream.parse (Parser.splitWith const (Parser.satisfy (> 0)) undefined) $ Stream.fromList [1]
*** Exception: Prelude.undefined ...
Although this implementation allows stream fusion, it has quadratic complexity, making it suitable only for a small number of compositions. As a thumb rule use it for less than 8 compositions, use ParserK otherwise.
Below are some common idioms that can be expressed using splitWith
and
other parser primitives:
>>>
span p f1 f2 = Parser.splitWith (,) (Parser.takeWhile p f1) (Parser.fromFold f2)
>>>
spanBy eq f1 f2 = Parser.splitWith (,) (Parser.groupBy eq f1) (Parser.fromFold f2)
Pre-release
split_ :: Monad m => Parser x m a -> Parser x m b -> Parser x m b Source #
Sequential parser application ignoring the output of the first parser. Apply two parsers sequentially to an input stream. The input is provided to the first parser, when it is done the remaining input is provided to the second parser. The output of the parser is the output of the second parser. The operation fails if any of the parsers fail.
This implementation is strict in the second argument, therefore, the following will fail:
>>>
Stream.parse (Parser.split_ (Parser.satisfy (> 0)) undefined) $ Stream.fromList [1]
*** Exception: Prelude.undefined ...
Although this implementation allows stream fusion, it has quadratic complexity, making it suitable only for a small number of compositions. As a thumb rule use it for less than 8 compositions, use ParserK otherwise.
Pre-release
die :: Monad m => String -> Parser a m b Source #
A parser that always fails with an error message without consuming any input.
dieM :: Monad m => m String -> Parser a m b Source #
A parser that always fails with an effectful error message and without consuming any input.
Pre-release
splitSome :: Monad m => Parser a m b -> Fold m b c -> Parser a m c Source #
See documentation of some
.
Pre-release
splitMany :: Monad m => Parser a m b -> Fold m b c -> Parser a m c Source #
See documentation of many
.
Pre-release
splitManyPost :: Monad m => Parser a m b -> Fold m b c -> Parser a m c Source #
Like splitMany, but inner fold emits an output at the end even if no input is received.
Internal
alt :: Monad m => Parser x m a -> Parser x m a -> Parser x m a Source #
Sequential alternative. The input is first passed to the first parser, and if it succeeds, the result is returned. However, if the first parser fails, the parser driver backtracks and tries the same input on the second parser, returning the result if it succeeds.
Note: This implementation is not lazy in the second argument. The following will fail:
> Stream.parse (Parser.satisfy (> 0) `Parser.alt` undefined) $ Stream.fromList [1..10]
- ** Exception: Prelude.undefined
Although this implementation allows stream fusion, it has quadratic complexity, making it suitable only for a small number of compositions. As a thumb rule use it for less than 8 compositions, use ParserK otherwise.
Time Complexity: O(n^2) where n is the number of compositions.
Pre-release
Input transformation
lmap :: (a -> b) -> Parser b m r -> Parser a m r Source #
lmap f parser
maps the function f
on the input of the parser.
>>>
Stream.parse (Parser.lmap (\x -> x * x) (Parser.fromFold Fold.sum)) (Stream.enumerateFromTo 1 100)
Right 338350
lmap = Parser.lmapM return
lmapM :: Monad m => (a -> m b) -> Parser b m r -> Parser a m r Source #
lmapM f parser
maps the monadic function f
on the input of the parser.
filter :: Monad m => (a -> Bool) -> Parser a m b -> Parser a m b Source #
Include only those elements that pass a predicate.
>>>
Stream.parse (Parser.filter (> 5) (Parser.fromFold Fold.sum)) $ Stream.fromList [1..10]
Right 40