Paul Bone

Case Sensitivity In Grammars

A recent change to Plasma caused me to think choices that various languages make with regard to case in their grammar. I originally assumed there are just two choices in certain grammars, Case sensitivity or sigils. But there are some other options and I’ll share them here.

Parsing

Computer languages are compiled or interpreted in a number of stages. The first four stages are:

  • Tokenisation (aka lexical analysis)

  • Parsing (aka syntactic analysis)

  • Symbol resolution

  • Semantic analysis

Often some of these stages are combined, many languages historically did symbol resolution during parsing, and some do tokenisation and parsing at the same time.

This article is concerned with parsing: taking a list of tokens and generating an abstract syntax tree. The parser must figure out what each token’s role is. For example in a C program does the '*' symbol represent multiplication, or pointer-dereference, figuring that out is the parser’s job.

One important consideration is that information should not flow backwards from later stages to earlier ones. This means that the parser should not need information from semantic analysis to make such decisions. It should be possible to decide from the token stream and grammar alone.

If you’re wondering why I’ve emphasised "should" and are thinking "but this just seems sensible, surely all languages do this?" there are popular languages that definitely break this guideline. One example is C++, which needs to lookup the meaning of symbols to figure out if a < symbol is a less than token or the beginning of a template variable list. So if C++ can do it why is it so bad?

It makes compilers more complex, usually this is handled by performing more than one compilation stage at a time, and that means both making the implementation more complex and requiring other changes to the language. In C and C++ declarations must be parsed and have some types of semantic analysis performed before uses of those declarations can be parsed. That means that the declaration must occur before the use in the source file, and even in header files. Which is why the order that header files are included matters. All this makes the language more difficult for programmers to use, that’s something we want to avoid with Plasma.

Additionally, by making the tools more complex it makes it harder to make new tools like documentation generators, linters and autoformatters. They may need to run more compilation stages to achieve their job when languages break this guideline.

Ambiguity

In Plasma the ambiguity we’re concerned about is that in some cases identifiers could refer to either a type or a type variable - syntactically different concepts we’d like to distinguish during parsing. The above guideline says that we have to be able to make this distinction without looking at anything but the parsing context and the identifier itself.

Note
Identifiers are the names of variables, functions, types and other symbols appearing in a language - they’re not keywords such as "return" or "if", or other symbols.

Languages with case sensitivity

Languages such as Haskell, Prolog and Mercury are case sensitive, they use case to distinguish the role of some symbols from others:

  • Haskell: Types begin with an uppercase letter while type variables begin with a lower case letter:

-- Haskell allows the [ ] syntax in type declarations,
-- but we use the word 'List' for our example.
length :: List a -> Integer
  • Prolog & Mercury: Atoms, predicates and functors begin with lower case letters and variable begin with upper case.

length([], 0).
length([X | Xs], N+1) :-
   length(Xs, N).
  • Mercury also distinguishes types from type variables like Haskell, but with upper and lower case reversed.

:- pred length(list(T)::in, int::out) is det.
  • There are other examples of languages with case sensitivity in their syntax, such as Java, but I don’t know if they’d be ambiguous without it.

Prolog is a particularly strong case where an ambiguity could be a problem. In Prolog both atoms and variables are created just by using them once. Mixing them would make Prolog even more permissive than it already is.

Languages with sigils

OCaml resolves the same ambiguity between types and type variables that both Haskell and Mercury have, but it does so using sigils (') for type variables:

val length : 'a list -> int

We’ve made the same change to Plasma, and used the same symbol. This was made to support programming in scripts without case, however I hope that while people program in a language with the roman script (e.g: English) we will use case as a matter of style:

// Plasma will allow the [ ] syntax in the type signature in the future
func length(list : List('x)) -> Int {
   match (list) {
      [] ->               { return 0 }
      [var x | var xs] -> { return 1 + length(xs) }
   }
}

Note also that we have changed the pattern matching syntax to also use the var keyword to introduce new variables. This functions a lot like the sigil in type expressions to disambiguate data constructors from new variables. It has the interesting property that it is now used in two out of three places where a new variable is introduced (the remaining one is parameter lists).

Live with ambiguity?

Of course, one option is to just live with ambiguity. Let’s look at a type declaration.

type List(x) = Nil
             | Cons { x, List(x) }

There’s actually nothing ambiguous here at all! Each identifier here is introduced in the same declaration. Nil and Cons are their own declarations, and the other two identifiers List and x are declared on the left-hand-side of = then used on the right. When they occur on the right we’ve already parsed their declaration on the left. This means, depending on how you look at it, the grammar is either context sensitive or we’ve broken the guideline above that parsing must not depend on later stages. Let’s call it a "weak" case of breaking the above guideline, because it’s within the same declaration we can allow it.

Even if other symbols appeared on the right-hand-side of = their role (in valid programs) would be unambiguous. They’d have to be names of types because all variables must be introduced on the left.

However a type expression when it appears inside a function declaration is ambiguous:

func length(list : List(x)) -> Int

Here x may be a type name or a type variable. We may resolve the ambiguity by checking if a type named x already exists, and if it does assume this is a name. This could be resolved after parsing, preserving our guideline above which would allow us to define the type x after using it.

Depending on the type system List may also be ambiguous, e.g: it could be a type variable of kind * → * or the name of a type. However if the language has only kind * then we know it must be a type’s name, because only type names can take parameters. Depending on what languages you’ve used before, this may have made no sense, if that’s the case I’ll just say it’s an important detail in Haskell because monads use the kind * → *.

How about that other case we had in Plasma, of pattern matching?

   match (list) {
      [] ->       { return 0 }
      [x | xs] -> { return 1 + length(xs) }
   }

Again we could lookup x and xs in a symbol table and find out if they’re data constructors, and if that fails we can assume they’re type names.

While this is possible, and means sense for valid programs, but ambiguity becomes a pain in the neck when a program is not what the programmer intends to write. I’ve not said invalid programs because it’s possible that many of these programs would be accepted by the compiler but then be buggy.

type MyTypeA
type MyList = List(MyTypeB)

There’s a spelling error here. The programmer has written MyTypeB instead of MyTypeA. Either way the compiler may accept this program but now MyList has a different meaning. Arguably that’s no worse than if there were two types with names with similar spellings, but that’s a code smell programmers may wish to avoid because it leads to the same type of confusion.

Likewise, let’s say a programmer writes:

func length(list : List(x)) -> Int

And they expect x to be a type variable, and use it as such:

var a = length([1, 2, 3])
var b = length(['a', 'b', 'c'])

That’s fine until elsewhere in the same module they add a type called 'x'! Now they have compilation errors where they expect to use their generic list. But of course their length function where the identifier list occurred won’t be where the compilation error is reported, it may even be reported in a separate module if length is exported but not used in this module. They’re going to be very annoyed with the language designer! This may also work fine until some time six months from now they add the type x!

I generally want to avoid this kind of ambiguity while designing Plasma. Plus it breaks one of my design philosophies for Plasma and that is a thing should look like what it is. That’s another way to say the language should be unambiguous, or even that the grammar should be context free (but I’m not making that a hard rule, there’ll be times when I break it).

I also considered allowing partial ambiguity. That is using case to hint to the compiler the role an identifier plays, and if it doesn’t make sense then the compiler can re-evaluate the meaning. This ends up being very similar to a complete ambiguity that it’s not worth much discussion.

Languages that declare type parameters

There’s one other way to avoid ambiguity. And we saw it above:

type List(x) = Nil
             | Cons { x, List(x) }

Here all the identifiers are introduced in such a way as they’re unambiguous. There’s other languages that do this with type variables appearing in function declarations and definitions, e.g: C++.

template<typename T>
int length(List<T> list);

The template keyword introduces a parameter list of type variables. We could do that in Plasma:

generic(x)
func length(list : List(x)) -> Int {
   match (list) {
      [] ->         { return 0 }
      var x, xs:
      [x | xs] ->   { return 1 + length(xs) }
   }
}

Note that we also did the same to introduce x and xs as variables before they occurred in the pattern.

Java doesn’t have the template keyword like C++, instead its grammar allows it to use use < > brackets to give the parameter list. I assume that C++ can’t do this because it’d be ambiguous. In Plasma that could look like:

func length<x>(list : List(x)) -> Int

That’s not terrible, and would certainly solve the ambiguity with type variables. I personally have a gut-reaction against this because it reminds me of C++ and gnarly template errors, but I realise as I’m writing this, I wouldn’t rule it out because it could also make it easier to add constraints (such as interfaces) on these type variables in a way that fits the syntax:

func sort<x : Ord>(list : List(x)) -> List(x)

We can’t reasonably extend this concept to the pattern match, there the functor (in the syntactical sense) is the and I wouldn’t want to add parameters to that.

We can however use this idea to remove Haskell’s case sensitive grammar without introducing sigils, with a slight modification to the typeclass constraint syntax:

list :: a => [a] -> Int

Wrapping up

There’s more than just the two choices I had initially assumed. I’ve found five:

  • Case sensitive grammar

  • Sigils

  • Ambiguity

  • Declare new type variables with a keyword (C++)

  • Declare new type variables with a parameter list (Java)

And I’m surprised to find that although I’ve made the decision to use sigils, I’m also drawn to the parameter list idea such as Java uses. Plasma is far from 1.0, and for now I reserve the right to change these decisions as necessary and may revisit this decision in the future. For now I will at the very least sleep on it.