---
layout: post
title: "Pointers Gone Wild"
date: 2014-05-28
comments: true
external-url:
author: Eric Seidel
published: true
categories: benchmarks, text
demo: TextInternal.hs
---
A large part of the allure of Haskell is its elegant, high-level ADTs
that ensure[^compilercorrectness] that programs won't be plagued by problems
like the infamous [SSL heartbleed bug](http://en.wikipedia.org/wiki/Heartbleed).
[^compilercorrectness]: Assuming the absence of errors in the compiler and run-time...
However, another part of Haskell's charm is that when you really really
need to, you can drop down to low-level pointer twiddling to squeeze the
most performance out of your machine. But of course, that opens the door
to the #heartbleeds.
Can we have have our cake and eat it too?
Can we twiddle pointers and still get the nice safety assurances of high-level types?
To understand the potential for potential bleeding,
let's study the popular `text` library for efficient
text processing. The library provides the high-level
API Haskellers have come to expect while using stream
fusion and byte arrays under the hood to guarantee
high performance.
Suppose we wanted to get the *i*th `Char` of a `Text`,
\begin{code} we could write a function[^bad]
charAt (Text a o l) i = word2char $ unsafeIndex a (o+i)
where
word2char = chr . fromIntegral
\end{code}
which extracts the underlying array `a`, indexes into it starting
at the offset `o` and casts the `Word16` to a `Char`, using
functions exported by `text`.
[^bad]: This function is bad for numerous reasons, least of which is that `Data.Text.index` is already provided, but stay with us...
\begin{code}Let's try this out in GHCi.
ghci> let t = pack ['d','o','g']
ghci> charAt t 0
'd'
ghci> charAt t 2
'g'
\end{code}
\begin{code}Looks good so far, what happens if we keep going?
ghci> charAt t 3
'\NUL'
ghci> charAt t 100
'\8745'
\end{code}
Oh dear, not only did we not get any sort of exception from Haskell,
we weren't even stopped by the OS with a segfault. This is quite
dangerous since we have no idea what sort of data we just read!
To be fair to the library's authors, we did use a function that
was clearly branded `unsafe`, but these functions, while not
intended for *clients*, pervade the implementation of the *library*.
Wouldn't it be nice to have these last two calls *rejected at compile time*?
In this post we'll see exactly how prevent invalid memory accesses
like this with LiquidHaskell.
\begin{code}
{-# LANGUAGE BangPatterns, MagicHash, Rank2Types,
RecordWildCards, UnboxedTuples, ExistentialQuantification #-}
{-@ LIQUID "--no-termination" @-}
module TextInternal (test, goodMain, badMain, charAt, charAt') where
import qualified Control.Exception as Ex
import Control.Applicative ((<$>))
import Control.Monad (when)
import Control.Monad.ST.Unsafe (unsafeIOToST)
import Data.Bits (shiftR, xor, (.&.))
import Data.Char
import Foreign.C.Types (CSize)
import GHC.Base (Int(..), ByteArray#, MutableByteArray#, newByteArray#,
writeWord16Array#, indexWord16Array#, unsafeCoerce#, ord,
iShiftL#)
import GHC.ST (ST(..), runST)
import GHC.Word (Word16(..))
import qualified Data.Text.Lazy.IO as TIO
import qualified Data.Text as T
import qualified Data.Text.Internal as T
import Language.Haskell.Liquid.Prelude
{-@ aLen :: a:Array -> {v:Nat | v = (aLen a)} @-}
{-@ maLen :: a:MArray s -> {v:Nat | v = (maLen a)} @-}
new :: forall s. Int -> ST s (MArray s)
unsafeWrite :: MArray s -> Int -> Word16 -> ST s ()
unsafeFreeze :: MArray s -> ST s Array
unsafeIndex :: Array -> Int -> Word16
copyM :: MArray s -- ^ Destination
-> Int -- ^ Destination offset
-> MArray s -- ^ Source
-> Int -- ^ Source offset
-> Int -- ^ Count
-> ST s ()
{-@ memcpyM :: MutableByteArray# s -> CSize -> MutableByteArray# s -> CSize -> CSize -> IO () @-}
memcpyM :: MutableByteArray# s -> CSize -> MutableByteArray# s -> CSize -> CSize -> IO ()
memcpyM = undefined
--------------------------------------------------------------------------------
--- Helper Code
--------------------------------------------------------------------------------
{-@ shiftL :: i:Nat -> n:Nat -> {v:Nat | ((n = 1) => (v = (i * 2)))} @-}
shiftL :: Int -> Int -> Int
shiftL = undefined -- (I# x#) (I# i#) = I# (x# `iShiftL#` i#)
pack :: String -> Text
pack = undefined -- not "actually" using
assert b a = Ex.assert b a
data Text = Text Array Int Int
{-@ tLength :: t:Text -> {v:_ | v = (tLen t)} @-}
tLength (Text _ _ n) = n
\end{code}
The `Text` Lifecycle
--------------------
`text` splits the reading and writing array operations between two
types of arrays, immutable `Array`s and mutable `MArray`s. This leads to
the following general lifecycle:
![The lifecycle of a `Text`](/images/text-lifecycle.png)
The main four array operations we care about are:
1. **creating** an `MArray`,
2. **writing** into an `MArray`,
3. **freezing** an `MArray` into an `Array`, and
4. **reading** from an `Array`.
Creating an `MArray`
--------------------
The (mutable) `MArray` is a thin wrapper around GHC's primitive
`MutableByteArray#`, additionally carrying the number of `Word16`s it
can store.
\begin{code}
data MArray s = MArray { maBA :: MutableByteArray# s
, maLen :: !Int
}
\end{code}
It doesn't make any sense to have a negative length, so we *refine*
the data definition to require that `maLen` be non-negative.
\begin{code}
{-@ data MArray s = MArray { maBA :: MutableByteArray# s
, maLen :: Nat
}
@-}
\end{code}
\begin{code} As an added bonus, the above specification generates **field-accessor measures** that we will use inside the refined types:
{-@ measure maLen :: MArray s -> Int
maLen (MArray a l) = l
@-}
\end{code}
We can use these accessor measures to define `MArray`s of size `N`:
\begin{code}
{-@ type MArrayN a N = {v:MArray a | (maLen v) = N} @-}
\end{code}
and we can use the above alias, to write a type that tracks the size
of an `MArray` at the point where it is created:
\begin{code}
{-@ new :: forall s. n:Nat -> ST s (MArrayN s n) @-}
new n
| n < 0 || n .&. highBit /= 0 = error "size overflow"
| otherwise = ST $ \s1# ->
case newByteArray# len# s1# of
(# s2#, marr# #) -> (# s2#, MArray marr# n #)
where !(I# len#) = bytesInArray n
highBit = maxBound `xor` (maxBound `shiftR` 1)
bytesInArray n = n `shiftL` 1
\end{code}
`new n` is an `ST` action that produces an `MArray s` with `n` slots each
of which is 2 bytes (as internally `text` manipulates `Word16`s).
The verification process here is quite simple; LH recognizes that
the `n` used to construct the returned array (`MArray marr# n`)
the same `n` passed to `new`.
Writing into an `MArray`
------------------------
Once we have *created* an `MArray`, we'll want to write our data into it.
A `Nat` is a valid index into an `MArray` if it is *strictly less than*
the size of the array.
\begin{code}
{-@ type MAValidI MA = {v:Nat | v < (maLen MA)} @-}
\end{code}
We use this valid index alias to refine the type of `unsafeWrite`
\begin{code}
{-@ unsafeWrite :: ma:MArray s -> MAValidI ma -> Word16 -> ST s () @-}
unsafeWrite MArray{..} i@(I# i#) (W16# e#)
| i < 0 || i >= maLen = assert False $ error "out of bounds"
| otherwise = ST $ \s1# ->
case writeWord16Array# maBA i# e# s1# of
s2# -> (# s2#, () #)
\end{code}
Note that, when compiled with appropriate options, the implementation of
`text` checks the bounds at run-time. However, LiquidHaskell can statically
prove that the error branch is unreachable, i.e. the `assert` **cannot fail**
(as long as the inputs adhere to the given specification) by giving `assert`
the type:
\begin{code}
{-@ assert assert :: {v:Bool | (Prop v)} -> a -> a @-}
\end{code}
Bulk Writing into an `MArray`
-----------------------------
So now we can write individual `Word16`s into an array, but maybe we
have a whole bunch of text we want to dump into the array. Remember,
`text` is supposed to be fast! C has `memcpy` for cases like this but
it's notoriously unsafe; with the right type however, we can regain safety.
`text` provides a wrapper around `memcpy` to copy `n` elements from
one `MArray` to another.
`copyM` requires two `MArray`s and valid offsets into each -- note
that a valid offset is **not** necessarily a valid *index*, it may
be one element out-of-bounds
\begin{code}
{-@ type MAValidO MA = {v:Nat | v <= (maLen MA)} @-}
\end{code}
-- and a `count` of elements to copy.
The `count` must represent a valid region in each `MArray`, in
other words `offset + count <= length` must hold for each array.
\begin{code}
{-@ copyM :: dest:MArray s
-> didx:MAValidO dest
-> src:MArray s
-> sidx:MAValidO src
-> {v:Nat | (((didx + v) <= (maLen dest))
&& ((sidx + v) <= (maLen src)))}
-> ST s ()
@-}
copyM dest didx src sidx count
| count <= 0 = return ()
| otherwise =
assert (sidx + count <= maLen src) .
assert (didx + count <= maLen dest) .
unsafeIOToST $ memcpyM (maBA dest) (fromIntegral didx)
(maBA src) (fromIntegral sidx)
(fromIntegral count)
\end{code}
Again, the two `assert`s in the function were in the original code as
(optionally compiled out) run-time checks of the precondition, but with
LiquidHaskell we can actually *prove* that the `assert`s **always succeed**.
Freezing an `MArray` into an `Array`
------------------------------------
Before we can package up our `MArray` into a `Text`, we need to
*freeze* it, preventing any further mutation. The key property
here is of course that the frozen `Array` should have the same
length as the `MArray`.
Just as `MArray` wraps a mutable array, `Array` wraps an *immutable*
`ByteArray#` and carries its length in `Word16`s.
\begin{code}
data Array = Array { aBA :: ByteArray#
, aLen :: !Int
}
\end{code}
As before, we get free accessor measures `aBA` and `aLen` just by
refining the data definition
\begin{code}
{-@ data Array = Array { aBA :: ByteArray#
, aLen :: Nat
}
@-}
\end{code}
so we can refer to the components of an `Array` in our refinements.
Using these measures, we can define
\begin{code}
{-@ type ArrayN N = {v:Array | (aLen v) = N} @-}
{-@ unsafeFreeze :: ma:MArray s -> ST s (ArrayN (maLen ma)) @-}
unsafeFreeze MArray{..} = ST $ \s# ->
(# s#, Array (unsafeCoerce# maBA) maLen #)
\end{code}
Again, LiquidHaskell is happy to prove our specification as we simply
copy the length parameter `maLen` over into the `Array`.
Reading from an `Array`
------------------------
Finally, we will eventually want to read a value out of the
`Array`. As with `unsafeWrite` we require a valid index into the
`Array`, which we denote using the `AValidI` alias.
\begin{code}
{-@ type AValidI A = {v:Nat | v < (aLen A)} @-}
{-@ unsafeIndex :: a:Array -> AValidI a -> Word16 @-}
unsafeIndex Array{..} i@(I# i#)
| i < 0 || i >= aLen = assert False $ error "out of bounds"
| otherwise = case indexWord16Array# aBA i# of
r# -> (W16# r#)
\end{code}
As before, LiquidHaskell can easily prove that the error branch
is unreachable, i.e. is *never* executed at run-time.
Wrapping it all up
------------------
Now we can finally define the core datatype of the `text` package!
A `Text` value consists of three fields:
A. an `Array`,
B. an `Int` offset into the *middle* of the array, and
C. an `Int` length denoting the number of valid indices *after* the offset.
We can specify the invariants for fields (b) and (c) with the refined type:
\begin{code}
{-@ data Text
= Text { tArr :: Array
, tOff :: {v:Nat | v <= (aLen tArr)}
, tLen :: {v:Nat | v+tOff <= (aLen tArr)}
}
@-}
\end{code}
These invariants ensure that any *index* we pick between `tOff` and
`tOff + tLen` will be a valid index into `tArr`.
As shown above with `new`, `unsafeWrite`, and `unsafeFreeze`, we can type the
top-level function that creates a `Text` from a `[Char]` as:
\begin{code}
{-@ pack :: s:String -> {v:Text | (tLen v) = (len s)} @-}
\end{code}
Preventing Bleeds
-----------------
Now, let us close the circle and return to potentially *bleeding* function:
\begin{code}
charAt' (Text a o l) i = word2char $ unsafeIndex a (o+i)
where
word2char = chr . fromIntegral
\end{code}
Aha! LiquidHaskell flags the call to `unsafeIndex` because of course, `i` may fall
outside the bounds of the given array `a`! We can remedy that by specifying
a bound for the index:
\begin{code}
{-@ charAt :: t:Text -> {v:Nat | v < (tLen t)} -> Char @-}
charAt (Text a o l) i = word2char $ unsafeIndex a (o+i)
where
word2char = chr . fromIntegral
\end{code}
That is, we can access the `i`th `Char` as long as `i` is a `Nat` less
than the the size of the text, namely `tLen t`. Now LiquidHaskell is convinced
that the call to `unsafeIndex` is safe, but of course, we have passed
the burden of proof onto users of `charAt`.
Now, if we try calling `charAt` as we did at the beginning
\begin{code}
test = [good,bad]
where
dog = ['d','o','g']
good = charAt (pack dog) 2
bad = charAt (pack dog) 3
\end{code}
we see that LiquidHaskell verifies the `good` call, but flags `bad` as
**unsafe**.
Enforcing Sanitization
----------------------
EDIT: As several folks have pointed out, the #heartbleed error was
due to inputs not being properly sanitized. The above approach
**ensures, at compile time**, that proper sanitization has been
performed.
To see this in action, lets write a little function that just shows the
character at a given position:
\begin{code}
{-@ showCharAt :: t:_ -> {v:Nat | v < (tLen t)} -> _ @-}
showCharAt t i = putStrLn $ show $ charAt t i
\end{code}
Now, the following function, that correctly sanitizes is **accepted**
\begin{code}
goodMain :: IO ()
goodMain
= do txt <- pack <$> getLine
i <- readLn
if 0 <= i && i < tLength txt
then showCharAt txt i
else putStrLn "Bad Input!"
\end{code}
but this function, which has insufficient sanitization, is **rejected**
\begin{code}
badMain :: IO ()
badMain
= do txt <- pack <$> getLine
i <- readLn
if 0 <= i
then showCharAt txt i
else putStrLn "Bad Input!"
\end{code}
Thus, we can use LiquidHaskell to block, at compile time, any serious bleeding
from pointers gone wild.