Safe Haskell | Safe |
---|---|
Language | Haskell2010 |
A monad for writing pure tokenizers in an imperative-looking way.
Main idea: You walk
through the input string like a turtle, and everytime
you find a token boundary, you call emit
. If some specific kinds of tokens
should be suppressed, you can discard
them instead (or filter afterwards).
This module supports is specialized for strict ByteString
s. The module
Control.Monad.Tokenizer provides more general types, but does not export a
Tokenizable
instance for ByteString
s, as its implementation depends on the
encoding. This module assumes ASCII encoding (you have been warned!).
Example for a simple tokenizer, that splits words by whitespace and discards stop symbols:
tokenizeWords :: ByteString -> [ByteString] tokenizeWords = runTokenizer $ untilEOT $ do c <- pop if isStopSym c then discard else if c `elem` (" \t\r\n" :: [Char]) then discard else do walkWhile (\c -> (c=='_') || not (isSpace c || isPunctuation' c)) emit
Synopsis
- type Tokenizer = Tokenizer ByteString
- runTokenizer :: Tokenizer () -> ByteString -> [ByteString]
- runTokenizerCS :: Tokenizer () -> ByteString -> [ByteString]
- untilEOT :: Tokenizer () -> Tokenizer ()
- peek :: Tokenizer Char
- isEOT :: Tokenizer Bool
- lookAhead :: [Char] -> Tokenizer Bool
- walk :: Tokenizer ()
- walkBack :: Tokenizer ()
- pop :: Tokenizer Char
- walkWhile :: (Char -> Bool) -> Tokenizer ()
- walkFold :: a -> (Char -> a -> Maybe a) -> Tokenizer ()
- emit :: Tokenizer ()
- discard :: Tokenizer ()
- restore :: Tokenizer ()
Monad
type Tokenizer = Tokenizer ByteString Source #
Tokenizer monad. Use runTokenizer or runTokenizerCS to run it
runTokenizer :: Tokenizer () -> ByteString -> [ByteString] Source #
Split a string into tokens using the given tokenizer
runTokenizerCS :: Tokenizer () -> ByteString -> [ByteString] Source #
Split a string into tokens using the given tokenizer, case sensitive version
untilEOT :: Tokenizer () -> Tokenizer () Source #
Repeat a given tokenizer as long as the end of text is not reached
Tests
lookAhead :: [Char] -> Tokenizer Bool Source #
Check if the next input chars agree with the given string
Movement
walkBack :: Tokenizer () Source #
Walk back to the previous character, unless it was discarded/emitted.
walkFold :: a -> (Char -> a -> Maybe a) -> Tokenizer () Source #
Proceed as long as a given fold returns Just (generalization of walkWhile)
Transactions
Orphan instances
Tokenizable ByteString Source # | Assuming ASCII encoding |
tnull :: ByteString -> Bool Source # thead :: ByteString -> Char Source # ttail :: ByteString -> ByteString Source # ttake :: Int -> ByteString -> ByteString Source # tdrop :: Int -> ByteString -> ByteString Source # tlower :: ByteString -> ByteString Source # |