inchworm-1.1.1.1: Simple parser combinators for lexical analysis.

Safe HaskellSafe
LanguageHaskell98

Text.Lexer.Inchworm

Contents

Description

Parser combinator framework specialized to lexical analysis. Tokens can be specified via simple fold functions, and we include baked in source location handling.

Matchers for standard tokens like comments and strings are in the Text.Lexer.Inchworm.Char module.

No dependencies other than the Haskell base library.

If you want to parse expressions instead of performing lexical analysis then try the parsec or attoparsec packages, which have more general purpose combinators.

Minimal example

The following code demonstrates how to perform lexical analysis of a simple LISP-like language. We use two separate name classes, one for variables that start with a lower-case letter, and one for constructors that start with an upper case letter.

Integers are scanned using the scanInteger function from the Text.Lexer.Inchworm.Char module.

The result of scanStringIO contains the list of leftover input characters that could not be parsed. In a real lexer you should check that this is empty to ensure there has not been a lexical error.

import Text.Lexer.Inchworm.Char
import qualified Data.Char      as Char

-- | A source token.
data Token 
        = KBra | KKet | KVar String | KCon String | KInt Integer
        deriving Show

-- | A thing with attached location information.
data Located a
        = Located FilePath Location a
        deriving Show

-- | Scanner for a lispy language.
scanner :: FilePath
        -> Scanner IO Location [Char] (Located Token)
scanner fileName
 = skip Char.isSpace
 $ alts [ fmap (stamp id)   $ accept '(' KBra
        , fmap (stamp id)   $ accept ')' KKet
        , fmap (stamp KInt) $ scanInteger 
        , fmap (stamp KVar)
          $ munchWord (\ix c -> if ix == 0 then Char.isLower c
                                           else Char.isAlpha c) 
        , fmap (stamp KCon) 
          $ munchWord (\ix c -> if ix == 0 then Char.isUpper c
                                           else Char.isAlpha c)
        ]
 where  -- Stamp a token with source location information.
        stamp k (l, t) 
          = Located fileName l (k t)

main 
 = do   let fileName = "Source.lispy"
        let source   = "(some (Lispy like) 26 Program 93 (for you))"
        toks    <- scanStringIO source (scanner fileName)
        print toks
Synopsis

Basic Types

data Source m loc input Source #

An abstract source of input tokens that we want to perform lexical analysis on.

Each token is associated with a source location loc. A a sequence of tokens has type input, and a single token type (Elem input).

data Scanner m loc input a Source #

Scanner of input tokens that produces a result value of type a when successful.

Instances
Monad m => Monad (Scanner m loc input) Source # 
Instance details

Defined in Text.Lexer.Inchworm.Scanner

Methods

(>>=) :: Scanner m loc input a -> (a -> Scanner m loc input b) -> Scanner m loc input b #

(>>) :: Scanner m loc input a -> Scanner m loc input b -> Scanner m loc input b #

return :: a -> Scanner m loc input a #

fail :: String -> Scanner m loc input a #

Monad m => Functor (Scanner m loc input) Source # 
Instance details

Defined in Text.Lexer.Inchworm.Scanner

Methods

fmap :: (a -> b) -> Scanner m loc input a -> Scanner m loc input b #

(<$) :: a -> Scanner m loc input b -> Scanner m loc input a #

Monad m => Applicative (Scanner m loc input) Source # 
Instance details

Defined in Text.Lexer.Inchworm.Scanner

Methods

pure :: a -> Scanner m loc input a #

(<*>) :: Scanner m loc input (a -> b) -> Scanner m loc input a -> Scanner m loc input b #

liftA2 :: (a -> b -> c) -> Scanner m loc input a -> Scanner m loc input b -> Scanner m loc input c #

(*>) :: Scanner m loc input a -> Scanner m loc input b -> Scanner m loc input b #

(<*) :: Scanner m loc input a -> Scanner m loc input b -> Scanner m loc input a #

Generic Scanning

scanListIO Source #

Arguments

:: Eq i 
=> loc

Starting source location.

-> (i -> loc -> loc)

Function to bump the current location by one input token.

-> [i]

List of input tokens.

-> Scanner IO loc [i] a

Scanner to apply.

-> IO ([a], loc, [i]) 

Scan a list of generic input tokens in the IO monad, returning the source location of the final input token, along with the remaining input.

NOTE: If you just want to scan a String of characters use scanStringIO from Text.Lexer.Inchworm.Char

Source Construction

makeListSourceIO Source #

Arguments

:: Eq i 
=> loc

Starting source location.

-> (i -> loc -> loc)

Function to bump the current location by one input token.

-> [i]

List of input tokens.

-> IO (Source IO loc [i]) 

Make a source from a list of input tokens, maintaining the state in the IO monad.

Scanner Evaluation

scanSourceToList :: Monad m => Source m loc [i] -> Scanner m loc [i] a -> m ([a], loc, [i]) Source #

Apply a scanner to a source of input tokens, where the tokens are represented as a lazy list.

The result values are also produced in a lazy list.

Combinators

Basic

satisfies :: Monad m => (Elem input -> Bool) -> Scanner m loc input (Range loc, Elem input) Source #

Accept the next token if it matches the given predicate, returning that token as the result.

skip :: Monad m => (Elem input -> Bool) -> Scanner m loc input a -> Scanner m loc input a Source #

Skip tokens that match the given predicate, before applying the given argument scanner.

When lexing most source languages you can use this to skip whitespace.

Accept

accept :: (Monad m, Eq (Elem input)) => Elem input -> a -> Scanner m loc input (Range loc, a) Source #

Accept the next input token if it is equal to the given one, and return a result of type a.

accepts :: (Monad m, Sequence input, Eq input) => input -> a -> Scanner m loc input (Range loc, a) Source #

Accept a fixed length sequence of tokens that match the given sequence, and return a result of type a.

From

from :: Monad m => (Elem input -> Maybe a) -> Scanner m loc input (Range loc, a) Source #

Use the given function to check whether to accept the next token, returning the result it produces.

froms :: Monad m => Maybe Int -> (input -> Maybe a) -> Scanner m loc input (Range loc, a) Source #

Use the given function to check whether to accept a fixed length sequence of tokens, returning the result it produces.

Alternation

alt :: Monad m => Scanner m loc input a -> Scanner m loc input a -> Scanner m loc input a Source #

Combine two argument scanners into a result scanner, where the first argument scanner is tried before the second.

alts :: Monad m => [Scanner m loc input a] -> Scanner m loc input a Source #

Combine a list of argumenet scanners a result scanner, where each argument scanner is tried in turn until we find one that matches (or not).

Munching

munchPred Source #

Arguments

:: Monad m 
=> Maybe Int

Maximum number of tokens to consider, or Nothing for no maximum.

-> (Int -> Elem input -> Bool)

Predicate to decide whether to consider the next input token, also passed the index of the token in the prefix.

-> (input -> Maybe a)

Take the prefix of input tokens and decide whether to produce a result value.

-> Scanner m loc input (Range loc, a)

Scan a prefix of tokens of type is, and produce a result of type a if it matches.

Munch input tokens, using a predicate to select the prefix to consider.

Given munch (Just n) match accept, we select a contiguous sequence of tokens up to length n using the predicate match, then pass that sequence to accept to make a result value out of it. If match selects no tokens, or accept returns Nothing then the scanner fails and no tokens are consumed from the source.

For example, to scan natural numbers use:

scanNat :: Monad m => Scanner m loc [Char] (Range loc, Integer)
scanNat = munchPred Nothing match accept
        where match _ c = isDigit c
              accept cs = Just (read cs)

To match Haskell style constructor names use:

scanCon :: Monad m => Scanner m loc [Char] (Range loc, String)
scanCon = munchPred Nothing match accept
        where  match 0 c = isUpper    c
               match _ c = isAlphaNum c
               accept cs = Just cs

If you want to detect built-in constructor names like Int and Float then you can do it in the accept function and produce a different result constructor for each one.

munchWord Source #

Arguments

:: Monad m 
=> (Int -> Elem input -> Bool)

Predicate to decide whether to accept the next input token, also passed the index of the token in the prefix.

-> Scanner m loc input (Range loc, input) 

Like munchPred, but we accept prefixes of any length, and always accept the input tokens that match.

munchFold Source #

Arguments

:: Monad m 
=> Maybe Int

Maximum number of tokens to consider, or Nothing for no maximum.

-> (Int -> Elem input -> state -> Maybe state)

Fold function to decide whether to consider the next input token. The next token will be considered if the function produces a Just with its new state. We stop considering tokens the first time it returns Nothing.

-> state

Initial state for the fold.

-> (input -> Maybe a)

Take the prefix of input tokens and decide whether to produce a result value.

-> Scanner m loc input (Range loc, a)

Scan a prefix of tokens of type is, and produce a result of type a if it matches.

Like munchPred, but we can use a fold function to select the prefix of tokens to consider. This is useful when lexing comments, and string literals where consecutive tokens can have special meaning (ie escaped quote characters).

See the source of scanHaskellChar in the Text.Lexer.Inchworm.Char, module for an example of its usage.