Safe Haskell | Safe-Inferred |
---|---|
Language | Haskell2010 |
NLP Tokenizer, adapted to use Text instead of Strings from the
tokenize
package.
Synopsis
- newtype EitherList a b = E {}
- type Tokenizer = Text -> EitherList Text Text
- tokenize :: Text -> [Text]
- run :: Tokenizer -> Text -> [Text]
- defaultTokenizer :: Tokenizer
- whitespace :: Tokenizer
- uris :: Tokenizer
- punctuation :: Tokenizer
- finalPunctuation :: Tokenizer
- initialPunctuation :: Tokenizer
- allPunctuation :: Tokenizer
- contractions :: Tokenizer
- negatives :: Tokenizer
Documentation
newtype EitherList a b Source #
The EitherList is a newtype-wrapped list of Eithers.
Instances
Applicative (EitherList a) Source # | |
Defined in NLP.Tokenize.Text pure :: a0 -> EitherList a a0 # (<*>) :: EitherList a (a0 -> b) -> EitherList a a0 -> EitherList a b # liftA2 :: (a0 -> b -> c) -> EitherList a a0 -> EitherList a b -> EitherList a c # (*>) :: EitherList a a0 -> EitherList a b -> EitherList a b # (<*) :: EitherList a a0 -> EitherList a b -> EitherList a a0 # | |
Functor (EitherList a) Source # | |
Defined in NLP.Tokenize.Text fmap :: (a0 -> b) -> EitherList a a0 -> EitherList a b # (<$) :: a0 -> EitherList a b -> EitherList a a0 # | |
Monad (EitherList a) Source # | |
Defined in NLP.Tokenize.Text (>>=) :: EitherList a a0 -> (a0 -> EitherList a b) -> EitherList a b # (>>) :: EitherList a a0 -> EitherList a b -> EitherList a b # return :: a0 -> EitherList a a0 # |
type Tokenizer = Text -> EitherList Text Text Source #
A Tokenizer is function which takes a list and returns a list of Eithers
(wrapped in a newtype). Right Texts will be passed on for processing
to tokenizers down
the pipeline. Left Texts will be passed through the pipeline unchanged.
Use a Left Texts in a tokenizer to protect certain tokens from further
processing (e.g. see the uris
tokenizer).
You can define your own custom tokenizer pipelines by chaining tokenizers together:
whitespace :: Tokenizer Source #
Split string on whitespace. This is just a wrapper for Data.List.words
punctuation :: Tokenizer Source #
Split off initial and final punctuation
finalPunctuation :: Tokenizer Source #
Split off word-final punctuation
initialPunctuation :: Tokenizer Source #
Split off word-initial punctuation
allPunctuation :: Tokenizer Source #
Split tokens on transitions between punctuation and non-punctuation characters. This tokenizer is not included in defaultTokenizer pipeline because dealing with word-internal punctuation is quite application specific.
contractions :: Tokenizer Source #
Split common contractions off and freeze them. | Currently deals with: 'm, 's, 'd, 've, 'll