Copyright | Dmitry Zuikov 2020 |
---|---|
License | MIT |
Maintainer | dzuikov@gmail.com |
Stability | experimental |
Portability | unknown |
Safe Haskell | None |
Language | Haskell2010 |
The lightweight and multi-functional text tokenizer allowing different types of text tokenization depending on it's settings.
It may be used in different sutiations, for DSL, text markups or even for parsing simple grammars easier and sometimes faster than in case of usage mainstream parsing combinators or parser generators.
The primary goal of this package is to parse unstructured text data, however it may be used for parsing such data formats as CSV with ease.
Currently it supports the following types of entities: atoms, string literals (currently with the minimal set of escaped characters), punctuation characters and delimeters.
Examples
Simple CSV-like tokenization
>>>
tokenize (delims ":") "aaa : bebeb : qqq ::::" :: [Text]
["aaa "," bebeb "," qqq "]
>>>
tokenize (delims ":"<>sq<>emptyFields ) "aaa : bebeb : qqq ::::" :: [Text]
["aaa "," bebeb "," qqq ","","","",""]
>>>
> tokenize (delims ":"<>sq<>emptyFields ) "aaa : bebeb : qqq ::::" :: [Maybe Text]
[Just "aaa ",Just " bebeb ",Just " qqq ",Nothing,Nothing,Nothing,Nothing]
>>>
tokenize (delims ":"<>sq<>emptyFields ) "aaa : 'bebeb:colon inside' : qqq ::::" :: [Maybe Text]
[Just "aaa ",Just " ",Just "bebeb:colon inside",Just " ",Just " qqq ",Nothing,Nothing,Nothing,Nothing]
>>>
let spec = sl<>delims ":"<>sq<>emptyFields<>noslits
>>>
tokenize spec " aaa : 'bebeb:colon inside' : qqq ::::" :: [Maybe Text]
[Just "aaa ",Just "bebeb:colon inside ",Just "qqq ",Nothing,Nothing,Nothing,Nothing]
>>>
let spec = delims ":"<>sq<>emptyFields<>uw<>noslits
>>>
tokenize spec " a b c : 'bebeb:colon inside' : qqq ::::" :: [Maybe Text]
[Just "a b c",Just "bebeb:colon inside",Just "qqq",Nothing,Nothing,Nothing,Nothing]
Notes
About the delimeter tokens
This type of tokens appears during a "delimited"
formats processing and disappears in results. Currenly
you will never see it unless normalization is turned off by nn
option.
The delimeters make sense in case of processing the CSV-like formats, but in this case you probably need only values in results.
This behavior may be changed later. But right now delimeters seem pointless in results. If you process some sort of grammar where delimeter character is important, you may use punctuation instead, i.e:
>>>
let spec = delims " \t"<>punct ",;()" <>emptyFields<>sq
>>>
tokenize spec "( delimeters , are , important, 'spaces are not');" :: [Text]
["(","delimeters",",","are",",","important",",","spaces are not",")",";"]
Other
For CSV-like formats it makes sense to split text to lines first, otherwise newline characters may cause to weird results
Synopsis
- data TokenizeSpec
- class IsToken a where
- tokenize :: IsToken a => TokenizeSpec -> Text -> [a]
- esc :: TokenizeSpec
- addEmptyFields :: TokenizeSpec
- emptyFields :: TokenizeSpec
- nn :: TokenizeSpec
- sq :: TokenizeSpec
- sqq :: TokenizeSpec
- noslits :: TokenizeSpec
- sl :: TokenizeSpec
- sr :: TokenizeSpec
- uw :: TokenizeSpec
- delims :: String -> TokenizeSpec
- comment :: Text -> TokenizeSpec
- punct :: Text -> TokenizeSpec
- indent :: TokenizeSpec
- itabstops :: Int -> TokenizeSpec
- keywords :: [Text] -> TokenizeSpec
- eol :: TokenizeSpec
Documentation
data TokenizeSpec Source #
Tokenization settings. Use mempty for an empty value and construction functions for changing the settings.
Instances
class IsToken a where Source #
Typeclass for token values.
Note, that some tokens appear in results
only when nn
option is set, i.e. sequences
of characters turn out to text tokens or string literals
and delimeter tokens are just removed from the
results
Create a character token
Create a string literal character token
Create a punctuation token
Create a text chunk token
mkStrLit :: Text -> a Source #
Create a string literal token
mkKeyword :: Text -> a Source #
Create a keyword token
Create an empty field token
Create a delimeter token
Creates an indent token
Creates an EOL token
Instances
IsToken Text Source # | |
Defined in Data.Text.Fuzzy.Tokenize | |
IsToken (Maybe Text) Source # | |
Defined in Data.Text.Fuzzy.Tokenize mkChar :: Char -> Maybe Text Source # mkSChar :: Char -> Maybe Text Source # mkPunct :: Char -> Maybe Text Source # mkText :: Text -> Maybe Text Source # mkStrLit :: Text -> Maybe Text Source # mkKeyword :: Text -> Maybe Text Source # mkEmpty :: Maybe Text Source # mkDelim :: Maybe Text Source # |
esc :: TokenizeSpec Source #
Turn on character escaping inside string literals. Currently the following escaped characters are supported: [" ' t n r a b f v ]
addEmptyFields :: TokenizeSpec Source #
Raise empty field tokens (note mkEmpty method) when no tokens found before a delimeter. Useful for processing CSV-like data in order to distingush empty columns
emptyFields :: TokenizeSpec Source #
same as addEmptyFields
nn :: TokenizeSpec Source #
Turns off token normalization. Makes the tokenizer generate character stream. Useful for debugging.
sq :: TokenizeSpec Source #
Turns on single-quoted string literals. Character stream after '\'' character will be proceesed as single-quoted stream, assuming all delimeter, comment and other special characters as a part of the string literal until the next unescaped single quote character.
sqq :: TokenizeSpec Source #
Enable double-quoted string literals support
as sq
for single-quoted strings.
noslits :: TokenizeSpec Source #
Disable separate string literals.
Useful when processed delimeted data (csv-like formats). Normally, sequential text chunks are concatenated together, but consequent text and string literal will produce the two different tokens and it may cause weird results if data is in csv-like format, i.e:
>>>
tokenize (delims ":"<>emptyFields<>sq ) "aaa:bebe:'qq' aaa:next::" :: [Maybe Text]
[Just "aaa",Just "bebe",Just "qq",Just " aaa",Just "next",Nothing,Nothing]
look: "qq" and " aaa" are turned into two separate tokens that makes the result of CSV processing looks improper, like it has an extra-column. This behavior may be avoided using this option, if you don't need to distinguish text chunks and string literals:
>>>
tokenize (delims ":"<>emptyFields<>sq<>noslits) "aaa:bebe:'qq:foo' aaa:next::" :: [Maybe Text]
[Just "aaa",Just "bebe",Just "qq:foo aaa",Just "next",Nothing,Nothing]
sl :: TokenizeSpec Source #
Strip spaces on left side of a token.
Does not affect string literals, i.e string are processed normally. Useful mostly for
processing CSV-like formats, otherwise delims
may be used to skip unwanted spaces.
sr :: TokenizeSpec Source #
Strip spaces on right side of a token.
Does not affect string literals, i.e string are processed normally. Useful mostly for
processing CSV-like formats, otherwise delims
may be used to skip unwanted spaces.
uw :: TokenizeSpec Source #
Strips spaces on right and left sides and transforms multiple spaces into the one. Name origins from unwords . words
Does not affect string literals, i.e string are processed normally. Useful mostly for
processing CSV-like formats, otherwise delims
may be used to skip unwanted spaces.
delims :: String -> TokenizeSpec Source #
Specify the list of delimers (characters)
to split the character stream into fields. Useful for CSV-like separated formats. Support for
empty fields in token stream may be enabled by addEmptyFields
function
comment :: Text -> TokenizeSpec Source #
Specify the line comment prefix. All text after the line comment prefix will be ignored until the newline character appearance. Multiple line comments are supported.
punct :: Text -> TokenizeSpec Source #
Specify the punctuation characters. Any punctuation character is handled as a separate token. Any token will be breaked on a punctiation character.
Useful for handling ... er... punctuaton, like
> function(a,b)
or
> (apply function 1 2 3)
>>>
tokenize spec "(apply function 1 2 3)" :: [Text]
["(","apply","function","1","2","3",")"]
indent :: TokenizeSpec Source #
Enable identation support
itabstops :: Int -> TokenizeSpec Source #
Set tab expanding multiplier i.e. each tab extends into n spaces before processing. It also turns on the indentation. Only the tabs at the beginning of the string are expanded, i.e. before the first non-space character appears.
keywords :: [Text] -> TokenizeSpec Source #
Specify the keywords list. Each keyword will be threated as a separate token.
eol :: TokenizeSpec Source #
Turns on EOL token generation