---
layout: post
title: "Text Internals"
date: 2014-02-09
comments: true
external-url:
author: Eric Seidel
published: false
categories: benchmarks, text
demo: TextInternal.hs
---
So far we have mostly discussed LiquidHaskell in the context of
recursive data structures like lists, but there comes a time in
many programs when you have to put down the list and pick up an
array for the sake of performance.
In this series we're going to examine the `text` library, which
does exactly this in addition to having extensive Unicode support.
`text` is a popular library for efficient text processing.
It provides the high-level API haskellers have come to expect while
using stream fusion and byte arrays under the hood to guarantee high
performance.
The thing that makes `text` stand out as an interesting target for
LiquidHaskell, however, is its use of Unicode.
Specifically, `text` uses UTF-16 as its internal encoding, where
each character is represented with either two or four bytes.
We'll see later on how this encoding presents a challenge for
verifying memory-safety, but first let us look at how a `Text`
is represented.
\begin{code}
{-# LANGUAGE BangPatterns, CPP, MagicHash, Rank2Types,
RecordWildCards, UnboxedTuples, ExistentialQuantification #-}
{-@ LIQUID "--no-termination" @-}
module TextInternal where
import Control.Monad.ST.Unsafe (unsafeIOToST)
import Data.Bits (shiftR, xor, (.&.))
import Foreign.C.Types (CSize)
import GHC.Base (Int(..), ByteArray#, MutableByteArray#, newByteArray#,
writeWord16Array#, indexWord16Array#, unsafeCoerce#, ord,
iShiftL#)
import GHC.ST (ST(..), runST)
import GHC.Word (Word16(..))
import Language.Haskell.Liquid.Prelude
{-@ data Array = Array { aBA :: ByteArray#
, aLen :: Nat
}
@-}
{-@ aLen :: a:Array -> {v:Nat | v = (alen a)} @-}
{-@ data MArray s = MArray { maBA :: MutableByteArray# s
, maLen :: Nat } @-}
{-@ maLen :: a:MArray s -> {v:Nat | v = (malen a)} @-}
new :: forall s. Int -> ST s (MArray s)
unsafeWrite :: MArray s -> Int -> Word16 -> ST s ()
unsafeFreeze :: MArray s -> ST s Array
unsafeIndex :: Array -> Int -> Word16
copyM :: MArray s -- ^ Destination
-> Int -- ^ Destination offset
-> MArray s -- ^ Source
-> Int -- ^ Source offset
-> Int -- ^ Count
-> ST s ()
--------------------------------------------------------------------------------
--- Helper Code
--------------------------------------------------------------------------------
{-@ predicate RoomN N MA I = (I+N <= (malen MA)) @-}
{-@ writeChar :: ma:MArray s -> i:{Nat | i < (malen ma) - 1} -> Char
-> ST s {v:Nat | (RoomN v ma i)}
@-}
writeChar :: MArray s -> Int -> Char -> ST s Int
writeChar marr i c
| n < 0x10000 = do
unsafeWrite marr i (fromIntegral n)
return 1
| otherwise = do
unsafeWrite marr i lo
unsafeWrite marr (i+1) hi
return 2
where n = ord c
m = n - 0x10000
lo = fromIntegral $ (m `shiftR` 10) + 0xD800
hi = fromIntegral $ (m .&. 0x3FF) + 0xDC00
{-@ measure isUnknown :: Size -> Prop
isUnknown (Exact n) = false
isUnknown (Max n) = false
isUnknown (Unknown) = true
@-}
{-@ measure getSize :: Size -> Int
getSize (Exact n) = n
getSize (Max n) = n
@-}
{-@ invariant {v:Size | (getSize v) >= 0} @-}
data Size = Exact {-# UNPACK #-} !Int -- ^ Exact size.
| Max {-# UNPACK #-} !Int -- ^ Upper bound on size.
| Unknown -- ^ Unknown size.
deriving (Eq, Show)
{-@ upperBound :: k:Nat -> s:Size -> {v:Nat | v = ((isUnknown s) ? k : (getSize s))} @-}
upperBound :: Int -> Size -> Int
upperBound _ (Exact n) = n
upperBound _ (Max n) = n
upperBound k _ = k
data Step s a = Done
| Skip !s
| Yield !a !s
data Stream a =
forall s. Stream
(s -> Step s a) -- stepper function
!s -- current state
!Size -- size hint
{-@ shiftL :: i:Nat -> n:Nat -> {v:Nat | ((n = 1) => (v = (i * 2)))} @-}
shiftL :: Int -> Int -> Int
shiftL = undefined -- (I# x#) (I# i#) = I# (x# `iShiftL#` i#)
{-@ memcpyM :: MutableByteArray# s -> CSize -> MutableByteArray# s -> CSize -> CSize -> IO () @-}
memcpyM :: MutableByteArray# s -> CSize -> MutableByteArray# s -> CSize -> CSize -> IO ()
memcpyM = undefined
\end{code}
`text` splits the reading and writing array operations between two
types of arrays, immutable `Array`s and mutable `MArray`s. This leads to
the following general lifecycle:
![The lifecycle of a `Text`](/images/text-lifecycle.png)
\begin{code}
data Array = Array {
aBA :: ByteArray#
, aLen :: !Int
}
{-@ measure alen :: Array -> Int
alen (Array a n) = n
@-}
data MArray s = MArray {
maBA :: MutableByteArray# s
, maLen :: !Int
}
{-@ measure malen :: MArray s -> Int
malen (MArray a n) = n
@-}
\end{code}
Both types carry around with them the number of `Word16`s they can
hold (this is actually only true when you compile with asserts turned
on, but we use this to ease the verification process).
The main three array operations we care about are:
1. **writing** into an `MArray`,
2. **reading** from an `Array`, and
3. **freezing** an `MArray` into an `Array`.
But first, let's see how one creates an `MArray`.
\begin{code}
{-@ type MArrayN s N = {v:MArray s | (malen v) = N} @-}
{-@ new :: forall s. n:Nat -> ST s (MArrayN s n) @-}
new n
| n < 0 || n .&. highBit /= 0 = error $ "new: size overflow"
| otherwise = ST $ \s1# ->
case newByteArray# len# s1# of
(# s2#, marr# #) -> (# s2#, MArray marr# n #)
where !(I# len#) = bytesInArray n
highBit = maxBound `xor` (maxBound `shiftR` 1)
bytesInArray n = n `shiftL` 1
\end{code}
`new n` is an `ST` action that produces an `MArray s` with `n` slots,
denoted by the type alias `MArrayN s n`. Note that we are not talking
about bytes here, `text` deals with `Word16`s internally and as such
we actualy allocate `2*n` bytes. While this may seem like a lot of
code to just create an array, the verification process here is quite
simple. LiquidHaskell simply recognizes that the `n` used to construct
the returned array (`MArray marr# n`) is the same `n` passed to
`new`. It should be noted that we're abstracting away some detail here
with respect to the underlying `MutableByteArray#`, specifically we're
making the assumption that any *unsafe* operation will be caught and
dealt with before the `MutableByteArray#` is touched.
Once we have an `MArray` in hand, we'll want to be able to write our
data into it. A `Nat` is a valid index into an `MArray` `ma` if it is
less than the number of slots, for which we have another type alias
`MAValidI ma`. `text` includes run-time assertions that check this
property, but LiquidHaskell can statically prove the assertions will
always pass.
\begin{code}
{-@ type MAValidI MA = {v:Nat | v < (malen MA)} @-}
{-@ unsafeWrite :: ma:MArray s -> MAValidI ma -> Word16 -> ST s () @-}
unsafeWrite MArray{..} i@(I# i#) (W16# e#)
| i < 0 || i >= maLen = liquidError "out of bounds"
| otherwise = ST $ \s1# ->
case writeWord16Array# maBA i# e# s1# of
s2# -> (# s2#, () #)
\end{code}
So now we can write individual `Word16`s into an array, but maybe we
have a whole bunch of text we want to dump into the array. Remember,
`text` is supposed to be fast!
C has `memcpy` for cases like this but it's notoriously unsafe; with
the right type however, we can regain safety. `text` provides a wrapper around
`memcpy` to copy `n` elements from one `MArray` to another.
\begin{code}
{-@ type MAValidO MA = {v:Nat | v <= (malen MA)} @-}
{-@ copyM :: dest:MArray s
-> didx:MAValidO dest
-> src:MArray s
-> sidx:MAValidO src
-> {v:Nat | (((didx + v) <= (malen dest))
&& ((sidx + v) <= (malen src)))}
-> ST s ()
@-}
copyM dest didx src sidx count
| count <= 0 = return ()
| otherwise =
liquidAssert (sidx + count <= maLen src) .
liquidAssert (didx + count <= maLen dest) .
unsafeIOToST $ memcpyM (maBA dest) (fromIntegral didx)
(maBA src) (fromIntegral sidx)
(fromIntegral count)
\end{code}
`copyM` requires two `MArray`s and valid offsets into each -- note
that a valid offset is **not** necessarily a valid *index*, it may
be one element out-of-bounds -- and a `count` of elements to copy.
The `count` must represent a valid region in each `MArray`, in
other words `offset + count <= length` must hold for each array.
`memcpyM` is an FFI function writen in C, which we don't currently
support, so we simply leave it `undefined`.
Before we can package up our `MArray` into a `Text`, we need to
*freeze* it, preventing any further mutation. The key property here is
of course that the frozen `Array` should have the same length as the
`MArray`.
\begin{code}
{-@ type ArrayN N = {v:Array | (alen v) = N} @-}
{-@ unsafeFreeze :: ma:MArray s -> ST s (ArrayN (malen ma)) @-}
unsafeFreeze MArray{..} = ST $ \s# ->
(# s#, Array (unsafeCoerce# maBA) maLen #)
\end{code}
Again, LiquidHaskell is happy to prove our specification as we simply
copy the length parameter `maLen` over into the `Array`.
Finally, we will eventually want to read a value out of the
`Array`. As with `unsafeWrite` we require a valid index into the
`Array`, which we denote using the `AValidI` alias.
\begin{code}
{-@ type AValidI A = {v:Nat | v < (alen A)} @-}
{-@ unsafeIndex :: a:Array -> AValidI a -> Word16 @-}
unsafeIndex Array{..} i@(I# i#)
| i < 0 || i >= aLen = liquidError "out of bounds"
| otherwise = case indexWord16Array# aBA i# of
r# -> (W16# r#)
\end{code}
As before, LiquidHaskell can easily prove that the run-time assertions
will never fail.
Now we can finally define the core datatype of the `text` package!
A `Text` value consists of an *array*, an *offset*, and a *length*.
The offset and length are `Nat`s satisfying two properties:
1. `off <= alen arr`, and
2. `off + len <= alen arr`
These invariants ensure that any *index* we pick between `off` and
`off + len` will be a valid index into `arr`. If you're not quite
convinced, consider the following `Text`s.
![Multiple valid `Text` configurations, all using an `Array` with 10 slots. The valid slots are shaded. Note that the key invariant is that `off + len <= alen`.](/images/text-layout.png)
\begin{code}
data Text = Text Array Int Int
{-@ data Text [tlen] = Text (arr :: Array)
(off :: TValidO arr)
(len :: TValidL off arr)
@-}
{-@ measure tarr :: Text -> Array
tarr (Text a o l) = a
@-}
{-@ measure toff :: Text -> Int
toff (Text a o l) = o
@-}
{-@ measure tlen :: Text -> Int
tlen (Text a o l) = l
@-}
\end{code}
The liquid-type for `Text` makes use of the following two type-aliases to
express the core invariant.
\begin{code}
{-@ type TValidO A = {v:Nat | v <= (alen A)} @-}
{-@ type TValidL O A = {v:Nat | (v+O) <= (alen A)} @-}
\end{code}
Before we go, let's take a quick look at a function that combines
`MArray`s, `Array`s, and `Text`s. `unstream` is a major workhorse of
the `text` library. It transforms a `Stream` of `Char`s into a `Text`,
and enables GHC use a technique called *stream fusion* to combine
multiple loops over a sequence into a single loop.
\begin{code}
unstream :: Stream Char -> Text
unstream (Stream next0 s0 len) = runST $ do
let mlen = upperBound 4 len
arr0 <- new mlen
let outer arr top = loop
where
loop !s !i =
case next0 s of
Done -> do
arr' <- unsafeFreeze arr
return $! Text arr' 0 i
Skip s' -> loop s' i
Yield x s'
| j >= top -> do
let top' = (top + 1) `shiftL` 1
arr' <- new top'
copyM arr' 0 arr 0 top
outer arr' top' s i
| otherwise -> do
d <- writeChar arr i x
loop s' (i+d)
where j | ord x < 0x10000 = i
| otherwise = i + 1
outer arr0 mlen s0 0
\end{code}
`unstream` repeatedly writes the characters coming out of the `Stream`
into the `arr`, until it runs out of room. Then it has to allocate a
new, larger `MArray` and copy everything into the new array before
continuing. Note that LiquidHaskell has successfully inferred that
`arr'` is longer than `arr` and that `top` is a valid offset into
both, thus proving that the call to `copyM` is safe! Unfortunately for
us, however, the `writeChar` call is flagged as *unsafe*.. Astute
readers will notice that `writeChar` (whose implementation we haven't
yet seen) has a slightly odd type, it requires that the index `i` be
less than `malen arr - 1`. This is indeed odd and, I should add, not
the final type, but it is a safe approximation because not all `Char`s
are created equal. Depending on your encoding, some won't fit into a
single `Word16`, so we may need extra room to write!
Stay tuned, next time we'll dig into how `text` uses Unicode to
represent `Char`s internally.