text-2.1: An efficient packed Unicode text type.
LicenseBSD-style (see LICENSE)
Stabilityexperimental
Safe HaskellTrustworthy
LanguageHaskell2010

Data.Text.Internal.Encoding

Contents

Description

Warning: this is an internal module, and does not have a stable API or name. Functions in this module may not check or enforce preconditions expected by public modules. Use at your own risk!

Internals of Data.Text.Encoding.

Since: 2.0.2

Synopsis

Documentation

validateUtf8Chunk :: ByteString -> (Int, Maybe Utf8State) Source #

Validate a ByteString as UTF-8-encoded text. To be continued using validateUtf8More.

See also validateUtf8More for details on the result of this function.

validateUtf8Chunk = validateUtf8More startUtf8State

Properties

Given:

validateUtf8Chunk chunk = (n, ms)

validateUtf8More :: Utf8State -> ByteString -> (Int, Maybe Utf8State) Source #

Validate another ByteString chunk in an ongoing stream of UTF-8-encoded text.

Returns a pair:

  1. The first component n is the end position, relative to the current chunk, of the longest prefix of the accumulated bytestring which is valid UTF-8. n may be negative: that happens when an incomplete code point started in a previous chunk and is not completed by the current chunk (either that code point is still incomplete, or it is broken by an invalid byte).
  2. The second component ms indicates the following:

    • if ms = Nothing, the remainder of the chunk contains an invalid byte, within four bytes from position n;
    • if ms = Just s', you can carry on validating another chunk by calling validateUtf8More with the new state s'.

Properties

Given:

validateUtf8More s chunk = (n, ms)

decodeUtf8Chunk :: ByteString -> (StrictBuilder, ByteString, Maybe Utf8State) Source #

Decode a chunk of UTF-8 text. To be continued with decodeUtf8More.

See decodeUtf8More for details on the result.

Properties

decodeUtf8Chunk = decodeUtf8More startUtf8State

Given:

decodeUtf8Chunk chunk = (builder, rest, ms)

builder is a prefix and rest is a suffix of chunk.

encodeUtf8 (strictBuilderToText builder) <> rest = chunk

Since: 2.0.2

decodeUtf8More :: Utf8State -> ByteString -> (StrictBuilder, ByteString, Maybe Utf8State) Source #

Decode another chunk in an ongoing UTF-8 stream.

Returns a triple:

  1. A StrictBuilder for the decoded chunk of text. You can accumulate chunks with (<>) or output them with toText.
  2. The undecoded remainder of the given chunk, for diagnosing errors and resuming (presumably after skipping some bytes).
  3. Just the new state, or Nothing if an invalid byte was encountered (it will be within the first 4 bytes of the undecoded remainder).

Properties

Given:

(pre, suf, ms) = decodeUtf8More s chunk
  1. If the output pre is nonempty (alternatively, if length chunk > length suf)

    s2b pre `append` suf = p2b s `append` chunk
    

    where

    s2b = encodeUtf8 . toText
    p2b = partUtf8ToByteString
    
  2. If the output pre is empty (alternatively, if length chunk = length suf)

    suf = chunk
  3. Decoding chunks separately is equivalent to decoding their concatenation.

    Given:

    (pre1, suf1, Just s1) = decodeUtf8More s chunk1
    (pre2, suf2,     ms2) = decodeUtf8More s1 chunk2
    (pre3, suf3,     ms3) = decodeUtf8More s (chunk1 `B.append` chunk2)
    

    we have:

    s2b (pre1 <> pre2) = s2b pre3
    ms2 = ms3
    

decodeUtf8With2 :: OnDecodeError -> String -> Utf8State -> ByteString -> (StrictBuilder, ByteString, Utf8State) Source #

Helper for decodeUtf8With, streamDecodeUtf8With, and lazy decodeUtf8With, which use an OnDecodeError to process bad bytes.

See decodeUtf8Chunk for a more flexible alternative.

Since: 2.0.2

data Utf8State Source #

State of decoding a ByteString in UTF-8. Enables incremental decoding (validateUtf8Chunk, validateUtf8More, decodeUtf8Chunk, decodeUtf8More).

Since: 2.0.2

Instances

Instances details
Show Utf8State Source # 
Instance details

Defined in Data.Text.Internal.Encoding

Eq Utf8State Source # 
Instance details

Defined in Data.Text.Internal.Encoding

startUtf8State :: Utf8State Source #

Initial Utf8State.

Since: 2.0.2

data StrictBuilder Source #

A delayed representation of strict Text.

Since: 2.0.2

Instances

Instances details
Monoid StrictBuilder Source # 
Instance details

Defined in Data.Text.Internal.StrictBuilder

Semigroup StrictBuilder Source #

Concatenation of StrictBuilder is right-biased: the right builder will be run first. This allows a builder to run tail-recursively when it was accumulated left-to-right.

Instance details

Defined in Data.Text.Internal.StrictBuilder

Internal

skipIncomplete :: OnDecodeError -> String -> Utf8State -> StrictBuilder Source #

Call the error handler on each byte of the partial code point stored in Utf8State and append the results.

Exported for use in lazy decodeUtf8With.

Since: 2.0.2

getCompleteLen :: Utf8State -> Int Source #

Exported for testing.

getPartialUtf8 :: Utf8State -> ByteString Source #

Exported for testing.