tokenizer-streaming: A variant of tokenizer-monad that supports streaming.

[ gpl, library, text ] [ Propose Tags ] [ Report a vulnerability ]

This monad transformer is a modification of tokenizer-monad that can work on streams of text/string chunks or even on (Unicode) bytestring streams.

Versions [RSS],
Change log
Dependencies base (>=4.9 && <5.0), bytestring, mtl, streaming, streaming-bytestring (>=0.1.6), streaming-commons (>= && <0.3), text, tokenizer-monad (>= && <0.3) [details]
License GPL-3.0-only
Copyright (c) 2019 Enum Cohrs
Author Enum Cohrs
Category Text
Source repo head: darcs get
Uploaded by implementation at 2019-01-22T21:41:50Z
Downloads 928 total (5 in the last 30 days)
Rating (no votes yet) [estimated by Bayesian average]
Status Docs available [build log]
Last success reported on 2019-01-22 [all 1 reports]

Readme for tokenizer-streaming-

Motivation: You might have stumpled upon the package tokenizer-monad. It is another project by me, for writing tokenizers that act on pure text/strings. However, there are situations when you cannot keep all the text in memory. You might want to tokenize text from network streams or from large corpus files.

Main idea: A monad transformer called TokenizerT implements exactly the same methods as Tokenizer from tokenizer-monad, such that all tokenizers can be ported without code changes (if you used MonadTokenizer in the type signatures)

Supported text types

  • streams of Char lists can be tokenized into streams of Char lists
  • streams of strict Text can be tokenized into streams of strict Text
  • streams of lazy Text can be tokenized into streams of lazy Text
  • streams of strict ASCII ByteStrings can be tokenized into streams of strict ASCII ByteStrings
  • streams of lazy ASCII ByteStrings can be tokenized into streams of lazy ASCII ByteStrings
  • bytestring streams (from streaming-bytestring) with Unicode encodings (UTF-8, UTF-16 LE & BE, UTF-32 LE & BE) can be tokenized into streams of strict Text