More on closures

Paul Bone — Dec 10, 2017

After writing about closures last week I was pointed to some papers that describe some prior work with closures. Like so many things in compiler construction I had mostly picked it up from folklore, and hypocritically failed to look up some of the research. I’m glad I was directed to these papers because they offer some insights I may be able to use.

To recap my last article I proposed constructing closures as a code prelude (for efficiency rather than a function pointer) followed by data containing the free variables of the closure. The prelude would load the free variables onto the stack and then call the actual closure (now a regular PZ function).

This has a few benefits:

it does not overly-burden the branch predictor,
it has indirection only when a closure is used,
it allows closures to be passed as a pointer-wide value.

But it has one of the two following drawbacks:

it is potentially insecure to mix code and data
securely mixing code and data will be inefficient due to the mprotect() calls.

Paper 1: Compiling a Functional Language

Luca Cardelli: Compiling a Functional Language

Cardelli describes an optimising ML compiler. I gather that it is the first ML compiler that attempts to generate efficient code. The paper describes multiple aspects of code generation for ML. The part that’s interesting to me is Cardelli’s Functional Abstract machine (FAM) and how it is used to handle closures. I’m informed that within compiler-construction folklore these care called "Flat closures".

The FAM has an additional register (in addition to the usual stack machine registers) which points to the current closure. Unlike my method the prelude is not needed to copy free variables onto the stack. Instead each function has access to the current closure and can retrieve variables from it directly. We can see some other benefits if we then imagine that there is always an active closure. The first closure is the global environment or REPL. By linking environments together like this, and involving the GC, it makes hot code loading a lot more feasible. This isn’t a feature I had planned on for Plasma, but it is definitely something I can see the benefit of having and am interested in. I’m tempted to

Cardelli’s scheme also has the following disadvantages:

all function calls have an extra indirection,
potentially uses an extra machine register,
requires extra code to manage that extra register.

It’s fairly clear that these can be mitigated in the case of handling plain calls more directly - without a closure. This means that only indirect calls require the extra indirection or management of a closure register. It still requires all indirect calls to have an extra level of indirection. My prelude idea could be used to avoid the extra level of indirection when a higher-order value is not a closure, but that re-introduces the problems of security above.

Paper 2: Optimising Closures in O(0) time

Andrew Keep, Alex Hearn, R. Kent Dybvig: Optimising Closures in O(0) time

This paper describes a series of transformations that can remove closures from a program. The transformations work with one-another to:

remove closures, reducing them to direct calls or bare function pointers,
remove free variables,
de-duplicate closures,
and share closure structures.

These are mostly pure transformations that have no impact on the generated code, they can be implemented almost entirely within the compiler. However they will require some minor changes to closure structures, creating closures that are structures of data with no function pointer, or have multiple function pointers plus their closure arguments. The can also reduce closures to bare function pointers, The language runtime will need to support bare function pointers by boxing them, tagging them or using a prelude trick.

Generally this paper is interesting because it’ll reduce the impact of closures on performance, reducing the various performance concerns.

More about Plasma

I’m very tempted to use the ideas on both these papers. They’ll be particularly interesting for Plasma’s loops. Although loops will often be compiled to something efficient and more direct when the iterator’s type is known to the compiler. When it’s not closures are a reasonable solution, and I’d like them to be fast and work across threads easily.

This:

p = ...
for [x <- xs] {
    accumulator a0, a initial ...

    # Code that uses p, x, a0 and produces a.
    ...
    a = ...

    output a = value of a
}

could be compiled to:

p = ...
reduce((\a, x -> ..., ...p...), xs, ...)

Before anyone asks or tells me that these loops are no good. I want to remind y’all (I’m currently on my way to Texas) that their expressive power will really shine as your loops get more and more complex. Iterating over multiple structures, building multiple results etc. They’re also more declarative than typical functional code, giving the compiler more freedom for optimisation.

In this example a is a free variable in the closure. The closure is created only once but could be executed thousands of times. I’d like to ensure that execution is efficient, any of these strategies will be efficient enough, however Cardelli’s will be more efficient, (and could be optimised for situations when we know the closure is a true closure.

If such a loop is executed in parallel then in either case the closure and x are the only items that need to be passed to that thread. If a thread performs multiple iterations and the compiler is able to specialise this, then the closure can be passed once at the beginning of the loop and the data can be passed for each iteration (probably being shared by a work queue of some kind). Here is where Cardelli’s solution seems particularly strong, it does much less work per iteration, and this is true for sequential and parallel loops since it only needs to setup the closure once.