I’ve started working on Plasma's type system, I’m currently implementing basic algebraic data types (ADTs) (sum and product types), which I’m quite comfortable with. I’ve also heard about these things called generalised algebraic data types (GADTs) but until recently I found it difficult to understand what they were and more importantly when are they useful. This article attempts to answer those questions. I assume that the reader is comfortable with ADTs and Haskell syntax.
String formatting
The problem we’re going to apply this to is string formatting.
data Format = Show | Quote | Pad Char Int printf :: Show a => Format -> a -> String printf Show s = show s printf Quote s = "\"" ++ (escape (show s)) ++ "\"" printf (Pad c n) s = pad c n $ show s -- I've included these functions for completeness. pad c n s = let len = length s in if len >= n then s else (replicate (n - len) c) ++ s escape [] = [] escape (x:xs0) = case x of '"' -> "\\\"" ++ xs _ -> x:xs where xs = escape xs0
This Haskell code compiles and works correctly, it’s simple and I’m almost
happy with it.
The printf
function is the main part of this program,
it simply uses its first argument to decide how to format its second argument.
The first argument is a format type, which allows for three different
formatting options, the first just calls show
, the second escapes and quotes
the string, and the third will pad the string - useful for tabular data.
The pad
and escape
functions are included for completeness.
The show
function belongs to the Show
typeclass, our printf
function
requires the type variable a
to be in this typeclass, this is what
Show a =>
means in printf
's type signature.
The problem
One thing I really like when presenting information is to be able to format numbers with thousands separators, let’s add that.
data Format = Show | Quote | Pad Char Int | ThousandsNumber printf :: Show a => Format -> a -> String printf Show s = show s printf Quote s = "\"" ++ (escape (show s)) ++ "\"" printf (Pad c n) s = pad c n $ show s printf ThousandsNumber n = thousandsNumber n -- Included for completeness thousandsNumber :: Integer -> String thousandsNumber n = if n < 1000 then show n else let bigpart = show $ n `div` 1000 littlepart = printf (Pad '0' 3) $ show (n `mod` 1000) in bigpart ++ "," ++ littlepart
It didn’t work:
adt.hs:13:44: Couldn't match expected type ‘Integer’ with actual type ‘a’ ‘a’ is a rigid type variable bound by the type signature for printf :: Show a => Format -> a -> String at adt.hs:9:11 Relevant bindings include n :: a (bound at adt.hs:13:24) printf :: Format -> a -> String (bound at adt.hs:10:1) In the first argument of ‘thousandsNumber’, namely ‘n’ In the expression: thousandsNumber n
This shouldn’t be surprising.
The new clause for printf
expects its parameter to be an Integer
,
but its signature (and my intentions) require it to be polymorphic (a
).
This is exactly the error that GHC is reporting.
At this point I’d like to acknowledge that this example is a little contrived.
It would be simple at this point to make each Format
a separate function that
does the formatting, and is either passed to printf
as a higher order value
or printf
is removed and these functions are used directly.
However for the sake of explaining GADTs let’s assume that for some reason we
need to stick with the design we have.
Note that this would be a more realistic example if we can imagine more than
one printf
function that all format data, but might format it differently,
eg: a formatter for HTML that might implement the Pad
format by
using right alignment rather than inserting spaces.
The ADT solution
To solve this problem without GADTs we can introduce a new type (Value
)
which allows us to implement the ThousandsNumber case for Integer
but not
for String
:
data Format = Show | Quote | Pad Char Int | ThousandsNumber data Value = String String | Integer Integer -- Value implements show, which makes it easy to keep most of the code the -- same as before. instance Show Value where show (String s) = s show (Integer i) = show i printf :: Format -> Value -> String printf Show v = show v printf Quote v = "\"" ++ (escape (show v)) ++ "\"" printf (Pad c n) v = pad c n $ show v printf ThousandsNumber (Integer n) = thousandsNumber n printf ThousandsNumber (String _) = error "Unhandled case"
I have omitted a small modification required to the thousandsNumber
function.
The main thing to notice with this solution, is that a call like
printf ThousandsNumber (String "foo")
will throw an exception. Or the code could be changed to return a Maybe
,
allowing the error to be caught more easily. Regardless of how errors are
raised, they can be raised, and I wouldn’t disagree with anyone who wanted
to say that this code isn’t really type safe.
This is the main problem that the GADT solution solves.
There are some other problems with this solution:
-
The code is now more complicated.
-
printf
is no-longer polymorphic on its second argument.
The latter problem can also be fixed using GADTs.
The GADT solution
We’ll take this a bit at a time, explaining each part and how it works: First enable the GADT language extension:
{-#LANGUAGE GADTs #-}
Here’s the tricky bit, the GADT itself:
data Format = Show | Quote | Pad Char Int | ThousandsNumber
data Format a where Raw :: Format String Quote :: Format String Pad :: Char -> Int -> Format String ThousandsNumber :: Format Integer
The very first difference between these data types is that the in the GADT case
I’ve added a type variable.
This is not part of understanding GADTs but it is part of understanding why
I’m using GADTs in this program,
this variable represents the type of data that is being formatted,
since what we’re trying to do is say that ThousandsNumber
is only applicable
to Integer
.
The syntax is also quite different,
which makes sense once we understand what is going on.
A normal ADT (left) has the same type (Format
) for all of its
constructors.
A GADT however (right) can associate a different but related type with each of
its constructors.
Therefore the GADT syntax allows us to give a type to each constructor,
each constructor has the form:
::
Type
The first two constructors' types are easy enough to understand.
They are both Format String
.
The second type is a function with two parameters.
I found this initially confusing,
until I realized that this is the type of Pad
before it is given its
arguments (the character to pad with, and the total width);
with suitable arguments it simply has the type Format String
.
We can see the same argument types in the same constructor on the left,
it’s just a different syntax.
Finally the fourth constructor has the type Format Integer
,
it formats integers, that’s what we’ve been working towards.
Now we’re ready to understand what a GADT is. It is like an ADT, except that each constructor may have a different type. The value of a variable now plays a role in deciding the variable’s type, this is a step towards dependent types, but it does not itself, make this dependent typing.
Finally, let’s see the printf
function:
printf :: Format a -> a -> String printf Raw s = s printf Quote s = "\"" ++ (escape s) ++ "\"" printf (Pad c n) s = pad c n s printf ThousandsNumber n = thousandsNumber n
Now that the Format type is parametric, we have to adjust the type signature
here.
We’ve also removed the Show a
constraint, and not calling show
as we once
were.
We can do this in each case because the constructor is known, the actual type
of the type variable a
is also known, and in the first three cases known to
be String
.
This makes each of the first three cases type correct, for example in the case
for Raw
this means that s
must be a String, and s
is the RHS of this
case, and printf
needs to return a string, then this is type correct.
In the ThousandsNumber case, the GADT ensures that n
is an Integer, because
the type variable a
is known to be Integer
. This is the same type that the
ThousandsNumber
function requires, and therefore this case is also type
correct.
This is why we made the Format
type parametric, it allows us to link the
type of each constructor to the type of the second argument to printf
.
Improving the GADT solution
One problem with the above solution is that it is no-longer polymorphic when it should be. This was deliberate and made explaining GADTs easier. Now we will show how GADTs can interact with typeclass constraints
data Format a where Show :: (Show a) => Format a Quote :: (Show a) => Format a Pad :: (Show a) => Char -> Int -> Format a ThousandsNumber :: Format Integer printf :: Format a -> a -> String printf Show s = show s printf Quote s = "\"" ++ (escape (show s)) ++ "\"" printf (Pad c n) s = pad c n $ show s printf ThousandsNumber n = thousandsNumber n
In this solution typeclass constraints have been added to the first three data
constructors.
They are part of the type of each data constructor and will affect printf’s
type when it is called with a particular constructor.
This means that printf
now looks more like the original printf
,
it contains calls to show
.
It is very likely that the type of ThousandsNumber
can be relaxed to work
with any Integral type,
in a similar way to the other constructors.
Are GADTs useful?
Maybe. In this article I tried to use an example that most programmers are somewhat likely to encounter. However as we mentioned above this could be solved in other ways, unless how you formatted something changed with output format. For example to right-align something on a console you would normally add spaces to the left, but in HTML you would use a style on that block. Therefore I think this example is still meaningful, I’ve just not shown that in my code. Another common example used is the one on the GADT Wikipedia page which is a bit more specialised and therefore some developers may find it less motivating.
I’ve never used GADTs "in anger" (in a real project), so I’m not aware of there are any pitfalls that you may be likely to encounter. This is going to be mostly because Mercury, which I tend to use most of the time, does not support GADTs. Also, until recently I did not know exactly what they were.
Finally, I think they could be useful, but they won’t be useful most of the time. As far as I can tell GADTs will rarely be the only solution to a given problem. If your language of choice already supports them, then that’s cool, you’ve got an extra tool in your belt. If not, it’s probably not worth worrying about, you can almost certainly get by with slightly less elegant code. It seems that the law of diminishing returns applies to type system features.
I’m no expert on GADTs, so if you disagree with anything I’ve said here, or can otherwise provide some information about how useful they are in practice, I’d be grateful.