Safe Haskell | Safe |
---|---|
Language | Haskell2010 |
Overview and basic concepts
This library implements compilation and analysis facilities for language compilers supporting ML- or Haskell-style pattern matching. It provides a compiler from a matching to a decision tree, an efficient representation mapping easily into low level languages. It supports most features one would expect, such as variable bindings, or- and as-patterns, etc. and is also able to detect anomalies in the maching, such as non-exhaustivity or redundantness. It is based on Compiling pattern-matching to good decision trees and Warnings for pattern matching by Luc Maranget.
Pattern matching
Patterns are assumed to be linear and matched “from top to
bottom”. This library adopts a simplified view of patterns, or
pattern Skel
etons, that should be rich enough to accomodate most
compilers need. It is either a catch-all pattern, eventually
binding an identifier or a constructor pattern made of a tag
and
subpatterns, filtering only those expression sharing the same tag
and whose subexpressions are also filtered by the subpatterns.
As-patterns and or-patterns are also supported, while as-patterns have
there own Skel
eton, or-patterns must first be
decomposed into distinct lists of patterns.
In this documentation, a “row” is a list of patterns associated with an output, that will be selected if an expression matches all those patterns, while a “column” is a list of patterns that are tested against an expression from top to bottom.
Decision trees
Decision trees can be thought of as cascading switches. Each
Switch
checks the constructor of an expression to decide what
path to take, until it reaches a Leaf
(success) or encounters a
dead-end 'Fail. Consider this Haskell example:
case e of ([], 0) -> 0 (_ : _, 1) -> 1
A possible decision tree corresponding to this expression could be:
Switch e +--- (,) ---> Switch e(,).0 +--- [] ----> Switch e(,).1 +---- 0 ----> Leaf 0 | | | \---- _ ----> Fail [([], 1), ([], 2), ...] | \--- _:_ ----> Switch e(,).1 +---- 1 ----> Leaf 1 | \---- _ ----> Fail [(_:_, 0), (_:_, 2), (_:_, 3), ...]
First, the expression e
is checked for the tag (,)
. Since
there is no other constructor for (,)
, this always
succeeds. Matching on a tuple yields to subexpression that we name
e(,).0
and e(,).1
(the Select
type handles subexpression
selection), that must be matched to two “columns” of patterns:
e(,).0
against []
or _:_
and e(,).1
against 0
or
1
. Note that we have a choice which test to perform first. Here
we decide to check e(,).0
against []
and _:_
. Since this is
the set of all possible constructors for lists, there is no
possibility for the match to fail here. We are then left with
e(,).1
to match against 0
,in the branch where e(,).0
is []
and 1
when e(,).0
is _:_
. In either case, the matching can
fail since 0
and 1
do not cover the full range of integers.
Characteristics of decision trees
A decision tree is only one possible target to compile pattern-matching. An alternative is to compile to backtracking automata (see, for instance Compiling pattern matching). Unlike decision trees, backtracking automata guarantee linear code size, however, as the name suggests, they may backtrack, thus testing more than once the same expression, which decision trees are guaranteed never to do.
Heuristics
In the example above, we choose to test e(,).0
before e(,).1
,
but we could have made the opposite choice. Also, in the _:_
branch we entirely ommited to test e(,).0(_:_).0
, e(,).0(_:_).1
(i.e. the head and the tail of the list introducing by matching on
_:_
) against the two wildcards of _:_
. This would of course
have been useless, since matching against a wildcard always
succeeds. The algorithm can make similar choices as the one we did
through the use of Heuristic
s. The role of Heuristic
s is to
attribute a score to a given list of patterns, so that the
algorithm will first match against the list of patterns with the
best score. In this case, we attributed a bigger score to the
pattern 1
than to the two wildcards. A detailed list of how
heuristics work, as well as all the heuristics studied by Maranget
are presented later.
How to use?
The library is centered around the match
and anomalies
functions. match
compiles a matching to a decision tree while
anomalies
simply gathers the anomalies in a matching. Note that
the anomalies can only be retrieved from the structure of the
decision tree.
The documentation makes heavy use of polymorphism to accomodate the internal representation of most languages. The convention for the names of the parameters is:
ident
is the type of identifiers bound in patterns,tag
is the type of tags of constructors,pat
is the type of patterns in the user's language,expr
is the type of expressions in the user's language,out
is the type of the outputs of a matching.
To work, these functions need three things from the user (apart from the actual matching):
- a way to decompose the user's language patterns into the simplified
representation. This is a function of type
pat -> [Skel ident tag]
, returning a list allows to account for or-patterns. The list of skeletons returned is tested from left-to-right. - for the
tag
type to be a member of theIsTag
typeclass. This requires to be able to compute some informations from atag
, such as the range oftag
s it belongs to. Further information is given with theIsTag
class.
Complete example
Consider the following typed language with its own Pattern
representation:
data Typ = TInt | TList Typ data Pattern = VarPat Typ String | IntPat Int | NilPat Typ -- NilPat typ has type TList typ | ConsPat Typ Pattern Pattern -- ConsPat typ _ _ has type TList typ | OrPat Pattern Pattern | AsPat Pattern String
This language supports variables, integers and lists. It can have or- and as-patterns.
This custom representation must first be converted into a
Skel
-based representation. This implies defining the tag
of constructors:
data Tag = NilTag | ConsTag Typ | IntTag Int
and doing the conversion:
toSkel :: Pattern -> [Skel String Tag] toSkel (VarPat typ var) = [WildSkel (rangeOfTyp typ) (Just var)] toSkel (IntPat i) = [ConsSkel (cons (IntTag i) [])] toSkel (NilPat _) = [ConsSkel (cons NilTag [])] toSkel (ConsPat typ p ps) = [ ConsSkel (cons (ConsTag typ) [subp, subps]) | subp <- toSkel p , subps <- toSkel ps ] toSkel (OrPat p1 p2) = toSkel p1 ++ toSkel p2 toSkel (AsPat p i) = [ AsSkel s i | s <- toSkel p ]
where rangeOfTyp
defines the range of Tag
s patterns of a certain
type can have:
rangeOfTyp :: Typ -> [Tag] rangeOfTyp TInt = fmap IntTag [minBound .. maxBound] rangeOfTyp (TList typ) = [NilTag, ConsTag typ]
Finally, Tag
must be made an instance of IsTag
. IsTag
has two
functions:
that outputs the signature
a given tagRange
:: tag -> [tag]tag
belongs to and
. subTags
:: tag ->
[[tag]]
defines the range of subTags
ttag
s that can be
found in the subpatterns of a constructor with tag
t
. For
instance, a constructor tagged with ConsTag TInt
will have two
subfields: the first one (the head of the list), can contain any
integers, the second one (the tail of the list), can be either the
NilTag
or another ConsTag
. This gives us the following instance:
instance IsTag Tag where tagRange NilTag = [NilTag, ConsTag] tagRange ConsTag = [NilTag, ConsTag] tagRange (IntTag j) = fmap IntTag [minBound..maxBound] subTags NilTag = [] subTags (ConsTag typ) = [rangeOf typ, rangeOf (TList typ)] subTags (IntTag _) = []
and this all one needs to do (apart from choosing a Heuristic
) to
use the compiler.
Preserving sharing
The presence of or-patterns, like in this example, can cause duplication of outputs in leaves of the decision tree. Consider this example in OCaml syntax:
match e with | 0 | 1 -> e1 | _ -> e2
The resulting decision tree, would be:
Switch e +--- 0 ---> e1 | \--- 1 ---> e1 | \--- _ ---> e2
with e1 being duplicated, which is undesirable when compiling this decision tree further to machine code as it would lead to increased code size. As a result, it might be worth to consider using labels for outputs and a table linking these labels to expressions. This would make the decision tree suitable for compilation using jumps, avoiding duplication.
Synopsis
- match :: IsTag tag => Heuristic ident tag expr out -> (pat -> [Skel ident tag]) -> expr -> [(pat, out)] -> DecTree ident tag pat expr out
- data Anomalies ident tag pat = Anomalies {
- redundantPatterns :: Maybe [pat]
- unmatchedPatterns :: Maybe [Skel ident tag]
- anomalies :: IsTag tag => (pat -> [Skel ident tag]) -> [pat] -> Anomalies ident tag pat
- data Cons ident tag where
- cons :: IsTag tag => tag -> [Skel ident tag] -> Cons ident tag
- data Skel ident tag
- class Ord tag => IsTag tag where
- data Select expr tag
- data DecTree ident tag pat expr out
- = Fail [Skel ident tag]
- | Leaf {
- leafBindings :: [Binding ident (Select expr tag)]
- leafOut :: out
- leafRedundant :: Maybe [pat]
- | Switch {
- switchOn :: Select expr tag
- switchBranches :: Map tag (DecTree ident tag pat expr out)
- switchCatchAll :: Maybe (DecTree ident tag pat expr out)
- data Binding ident expr = (Maybe ident) := expr
- type Index = Int
- type Score = Int
- data Heuristic ident tag expr out
- seqHeuristics :: [Heuristic ident tag expr out] -> Heuristic ident tag expr out
- firstRow :: Heuristic ident tag expr out
- smallDefault :: Heuristic ident tag expr out
- smallBranchingFactor :: IsTag tag => Heuristic ident tag expr out
- arity :: Heuristic ident tag expr out
- leafEdge :: IsTag tag => Heuristic ident tag expr out
- fewerChildRule :: IsTag tag => Heuristic ident tag expr out
- neededColumns :: IsTag tag => Heuristic ident tag expr out
- neededPrefix :: IsTag tag => Heuristic ident tag expr out
- constructorPrefix :: IsTag tag => Heuristic ident tag expr out
- noHeuristic :: Heuristic ident tag expr out
- reverseNoHeuristic :: Heuristic ident tag expr out
- shorterOccurence :: (Select expr tag -> Score) -> Heuristic ident tag expr out
Documentation
:: IsTag tag | |
=> Heuristic ident tag expr out | The heuristic to use to resolve ambiguous choices |
-> (pat -> [Skel ident tag]) | A way to decompose the language's patterns into
|
-> expr | The expression being scrutanized |
-> [(pat, out)] | The list of patterns to match on with the output associated. Patterns are tried from left to right. |
-> DecTree ident tag pat expr out |
Compiles a matching to a decision tree, using the given heuristic.
data Anomalies ident tag pat Source #
Gathers all the anomalies present in a matching. Nothing
indicating the absence of an anomaly.
Anomalies | |
|
anomalies :: IsTag tag => (pat -> [Skel ident tag]) -> [pat] -> Anomalies ident tag pat Source #
Simplified version of match
, that simply gathers the anomalies of
the decision tree.
Generic pattern representation
data Cons ident tag where Source #
A generic description of a constructor pattern, made of a tag
and
subpatterns.
class Ord tag => IsTag tag where Source #
tagRange :: tag -> [tag] Source #
The range of tags a given tag could have had. t
should always
be an element of tagRange t
. In other words:
elem t (tagRange t) == True
The range of a tag
is used to generate missing patterns in
non-exhaustive matches. It might be interesting to consider the
order the range is given in, to improve the quality of error
messages. For instance, if one allows pattern-matching on
integers, instead of simply giving the range
[minBound..maxBound], it could be better to give the range
[0..maxBound] ++ [-1,-2..minBound]
(or a range alternating
positive and negative integers, starting at 0
) could help
generate simpler messages.
subTags :: tag -> [[tag]] Source #
The range of the tag
s that can appear in all the
subfields of a constructor with the given tag
.
Example
Consider the _:_
tag for patterns of type [Bool]
in
Haskell. This pattern has two subpatterns, the head can be either
True
or False
, while the tail can be either []
or
_:_
. Thus subTags
would simply be, in pseudo-Haskell:
[[trueTag, falseTag], [nilTag, consTag]]
Expression selection
Encodes the selection of a subexpression given a tag
.
NoSel expr | An untouched expression |
Sel (Select expr tag) tag Int |
For example, |
Decision trees
data DecTree ident tag pat expr out Source #
A decision tree can be thought of as a cascade of switches,
matching on the tag
of expressions and subexpressions until
reaching a result. They map fairly naturally to constructs in low
level languages, such as C.
Fail [Skel ident tag] | Pattern-matching failure, with a list of all the patterns
that aren't matched. The list is lazily generated and may be
infinite for |
Leaf | Pattern-matching success |
| |
Switch | Branching on an |
|
data Binding ident expr Source #
Binding of an identifier to an expression. Bindings of wildcards are conserved.
Heuristics
Most of the time, there are multiple ways to construct a decision
tree, since we are often faced with a choice as to which column
of pattern to match first. Doing the wrong choice can lead to
larger decision trees or to more tests on average. Heuristic
s
allows us to choose between those different choices.
In there simplest form, heuristics attribute a score to a column, given it's position in the list of columns to match, the expression to match it against and the column of patterns. Some more complicated heuristics exist that require having access to the entire list of columns.
Combining heuristics
A single heuristic may give the same score to several columns, leading to ambiguity on the one to choose. Combining heuristic allows to use a second heuristic to break such a tie.
Note that if there is a tie after applying the heuristic supplied by the user, the algorithm will choose the left-most pattern with the highest score.
Influence on the semantic
Heuristics might have an influence on the semantic of the language if the resulting decision tree is used to guide evaluation, as it can be the case in a lazy language.
“But how do I choose?”
Detailed benchmarks are given in section 9 of Maranget's paper, in terms of code size and average path length in a prototype compiler, both for single and combined heuristics (up to 3 combinations). A conclusion to his findings is given in section 9.2 and is reproduced here:
- Good primary heuristics are
firstRow
,neededPrefix
andconstructorPrefix
. This demonstrates the importance of considering clause order in heuristics. - If we limit choice to combinations of at most two heuristics,
fewerChildRule
is a good complement to all primary heuristics.smallBranchingFactor
looks sufficient to break the ties left byneededPrefix
andconstructorPrefix
. - If we limit choice to heuristics that are simple to compute,
that is if we eliminate
neededColumns
,neededPrefix
,fewerChildRule
andleafEdge
, then good choices are:
,seqHeuristics
[firstRow
,smallDefault
smallBranchingFactor
]
,seqHeuristics
[constructorPrefix
,smallBranchingFactor
]
(eventually further composed withseqHeuristics
[constructorPrefix
,smallBranchingFactor
]arity
orsmallDefault
).
data Heuristic ident tag expr out Source #
Score ([[Skel ident tag]] -> Index -> Select expr tag -> [Skel ident tag] -> Score) | Computes the |
Combine (Heuristic ident tag expr out) (Heuristic ident tag expr out) | Combine two heuristics: compute an initial score with the first heuristic and, if several columns have obtained the same score, use the second heuristic to choose among them. |
seqHeuristics :: [Heuristic ident tag expr out] -> Heuristic ident tag expr out Source #
Combine a list of heuristics from left-to-right, defaulting to
using no heuristic. Defined as foldr Combine noHeuristic
.
Simple heuristics
A set of simple and cheap to compute heuristics.
firstRow :: Heuristic ident tag expr out Source #
This heuristic favours columns whose top pattern is a generalized constructor pattern. If the first pattern is a wildcard, the heuristic gives \(0\) and \(1\) otherwise.
smallDefault :: Heuristic ident tag expr out Source #
This heuristic favours columns with the least number of wildcard patterns. If \(v(i)\) is the number of wildcards in column \(i\), then \(-v(i)\) is the score of column \(i\).
smallBranchingFactor :: IsTag tag => Heuristic ident tag expr out Source #
Favours columns resulting in smaller switches. The score of a column is the number of branches of the switch resulting of the compilation (including an eventually default branch), negated.
arity :: Heuristic ident tag expr out Source #
The sum of the arity of the constructors of this column, negated.
Expensive heuristics
The following heuristics are deemed expensive as they require manipulation on the matrix of patterns to compute a score.
leafEdge :: IsTag tag => Heuristic ident tag expr out Source #
The score is the number of children of the emitted switch node that are leaves.
fewerChildRule :: IsTag tag => Heuristic ident tag expr out Source #
This heuristic favours columns that lead to fewer rows to test.
Necessity based heuristics
A column \(i\) is deemed necessary for row \(j\) when all paths to \(j\), in all possible decision trees, tests column \(i\). A column \(i\) is necessary if it is necessary for all outputs.
It seems sensible to favour useful columns over non-useful ones as, by definition a useful column will be tested in all paths, whether we choose it or not. Choosing it early might however result in shorter paths.
Necessity is computed according to the algorithm in section 3 of Warnings for pattern matching.
neededColumns :: IsTag tag => Heuristic ident tag expr out Source #
The score is the number of output needing this column.
neededPrefix :: IsTag tag => Heuristic ident tag expr out Source #
The score is the number of consecutive outputs needing the column.
constructorPrefix :: IsTag tag => Heuristic ident tag expr out Source #
A cheaper version of neededPrefix
, where a pattern counts in the
score if it is a constructor pattern.
Pseudo heuristics
The following heuristics are called pseudo-heuristics as
they do not compute a score based on the patterns but rather on the
expressions being matched, such as shorterOccurence
or simply on
the position of the column in the matrix, such as noHeuristic
or
reverseNoHeuristic
. They make for good default heuristic, either
alone or as the last heuristic of a combination.
noHeuristic :: Heuristic ident tag expr out Source #
Leaves the column in the same order by giving the score \(-i\) to column \(i\).
reverseNoHeuristic :: Heuristic ident tag expr out Source #
Reverse the order of the columns by giving the score \(i\) to column \(i\). It can be useful in combination with another heuristic to reverse the left-to-right bias of this implementation.