Safe Haskell | Safe |
---|---|
Language | Haskell2010 |
A monad for writing pure tokenizers in an imperative-looking way.
Main idea: You walk
through the input string like a turtle, and everytime
you find a token boundary, you call emit
. If some specific kinds of tokens
should be suppressed, you can discard
them instead (or filter afterwards).
This module supports is specialized for strict text. The module Control.Monad.Tokenizer provides more general types.
Example for a simple tokenizer, that splits words by whitespace and discards stop symbols:
tokenizeWords :: T.Text -> [T.Text] tokenizeWords = runTokenizer $ untilEOT $ do c <- pop if isStopSym c then discard else if c `elem` (" \t\r\n" :: [Char]) then discard else do walkWhile (\c -> (c=='_') || not (isSpace c || isPunctuation' c)) emit
Synopsis
- type Tokenizer = Tokenizer Text
- runTokenizer :: Tokenizer () -> Text -> [Text]
- runTokenizerCS :: Tokenizer () -> Text -> [Text]
- untilEOT :: Tokenizer () -> Tokenizer ()
- peek :: Tokenizer Char
- isEOT :: Tokenizer Bool
- lookAhead :: [Char] -> Tokenizer Bool
- walk :: Tokenizer ()
- walkBack :: Tokenizer ()
- pop :: Tokenizer Char
- walkWhile :: (Char -> Bool) -> Tokenizer ()
- walkFold :: a -> (Char -> a -> Maybe a) -> Tokenizer ()
- emit :: Tokenizer ()
- discard :: Tokenizer ()
- restore :: Tokenizer ()
Monad
type Tokenizer = Tokenizer Text Source #
Tokenizer monad. Use runTokenizer or runTokenizerCS to run it
runTokenizer :: Tokenizer () -> Text -> [Text] Source #
Split a string into tokens using the given tokenizer
runTokenizerCS :: Tokenizer () -> Text -> [Text] Source #
Split a string into tokens using the given tokenizer, case sensitive version
untilEOT :: Tokenizer () -> Tokenizer () Source #
Repeat a given tokenizer as long as the end of text is not reached
Tests
lookAhead :: [Char] -> Tokenizer Bool Source #
Check if the next input chars agree with the given string
Movement
walkBack :: Tokenizer () Source #
Walk back to the previous character, unless it was discarded/emitted.
walkFold :: a -> (Char -> a -> Maybe a) -> Tokenizer () Source #
Proceed as long as a given fold returns Just (generalization of walkWhile)