Compiling closures

Paul Bone — Dec 3, 2017

It’s time to implement closures in Plasma. This article describes some of different ways closures can be implemented in a language and justifies the particular approach I have chosen.

What is a closure?

Closure is a term that’s thrown around a fair bit in functional programming. Before we talk about how to compile them I want to give my favorite definition. A closure is a function with data. Let’s take a look at some Python code:

def f(x, y):
    return x + y

This (f) is not a closure.

y = 3
def f(x):
    return x + y

This (f) is. It closes over a value from its environment, the y. Closures can be functions like in these examples. They can also be lambdas. From a compiler implementer perspective they can also include partial application.

def add(x, y):
    return x + y
add3 = add(3)

Assuming this is valid python. add3 is a closure (by my definition) but add isn’t. That’s because add3 is equivilent to

x = 3
def add3(y):
    return x + y

(One of the examples we saw above.) Now we could have implemented that differently. For example:

def add3(y):
    return 3 + y

That isn’t a closure, the closure-ness of the previous add3 is an implementation detail. If instead of 3 we used some other variable, and/or the body of add was much larger and more expensive to inline. Then this implementation isn’t as good and a closure would be better choice.

To go further with my definition I’d like to use the declaration of pthread_create.

int pthread_create(pthread_t *thread,
                   const pthread_attr_t *attr,
                   void *(*start_routine) (void *),
                   void *arg);

start_routine and arg are a closure (when considered together). They match our definition of a function with data, even though the clumsiness of C makes this difficult to express. Finally I’ll also point out that most objects in OO languages are also closures.

Unsuitable options

Let’s consider some options that I am not considering for Plasma.

Compiling them as separate types

The challenge with compiling a closure is that it’s more than just a function pointer. The generated code must also pass around the included data, and arrange for it to be used when the closure is called.

This could be handled by creating distinct type for closures when compared with regular non-closure function pointers. I thought that I remembered that Pony did this, but when I checked before publishing this article I can’t seem to find it, so maybe I’m thinking of some other language.

This isn’t what I want for Plasma, since it makes the code less composable. If something expects a bare function and you need to pass a closure then you can’t.

Boxing all functions

Next we could box all functions. Closures and non-closures, giving them a pointer tag so that generated code can change how it calls them. This fixes the composability problem. Now every function pointer is potentially a tagged pointer, and every indirect call must perform a branch. Mispredicted branches are one of the two things that show down CPUs (to first approximation) the other being a cache miss.

# How we store a closure in memory
struct closure {
    func* function_ptr;
    void* closure_data;
};

# About to call an indirect pointer
if (PTR_IS_CLOSURE(ptr)) {
    call(CLOSURE_UNBOX(ptr), CLOSURE_GET_DATA(ptr));
} else {
    call(PTR_UNTAG(ptr));
}

If the compiler is smart enough to know which calls will be to closures and which are to bare functions, then it’d also know their callees and could have optimised these to direct calls. So attempting to use optimize these branches away via simplification is not worth discussing.

But small optimisations are possible, PTR_UNTAG can be a no-op if we arrange for the tag-bit to be 1 when there’s a closure and 0 when it’s a function.

Edit 2017-12-10

ericbb reminded me that if you do actually box all functions, and not simply tag them, then the extra branch can be removed. However this then means that all higher order function calls have an extra indirection and memory usage. However that extra memory cell can be allocated statically and shared reducing its cost. This may be worth while if most higher order calls are to closures and that seems likely.

Solution, moving the branch into the function pointer

We can move the branch decision and its actions into the callee itself.

# How we store a closure in memory
struct closure {
    asm {
        LOAD DATA closure_data
        CALL function_ptr
        RET
    }
    void* closure_data;
}

# How it's called
call(ptr);

Yes this isn’t really C code, it’s pseudo-code that might remind you of C + assembler.

If we need to call a closure, we can add some extra code to retrieve the data and provide it as an argument to the actual code. In our example, and in the PZ machine this is done by putting it in the right position on the stack. This example is simplified and assumes there’s just one closed-over argument at the first position on the stack, and no-other arguments.

There’s no branch to be mispredicted, if this is a closure then the code of the closure handles its extra arguments itself before jumping to the actual payload. If it’s not a closure then the function itself is called directly. The only significant costs when compared with a plain function is the memory required and the extra indirection, and there’s no avoiding either of those.

This solution is also tail-call agnostic. Whether this closure is called as a tail-call or a regular call simply doesn’t matter, its no different from a direct call.

While we’re thinking of tail-calls let’s use a tail-call for the closure trampoline code itself:

    LOAD DATA closure_data
    TCALL function_ptr

This works unless we also wanted to add the capability for the closure to re-arrange or drop some of the output values. Remember that Plasma functions may return more than one value. Only closures that re-arranged outputs wound need to avoid tail-calls.

Multiple arguments

Let’s say our function looks like:

func foo(a, b, c, d) {
    ...
}

And is called as:

a = 1
bar(lambda b, d: foo(a, b, 3, d))

Note that Plasma doesn’t have a lambda syntax yet, think of this as Python-like pseudo-code). So we have two arguments that need to be part of the closure, and two being passed into the lambda function. It’s the lambda function that is the closure (over a). The closure would look like:

closure {
    asm {
        # the stack is: b, d
        # add a,
        LOAD a
        rollr 3 # rotate 3 arguments, one position to the right.
        # Now the stack is: a, b, d
        3
        SWAP
        # Now the stack is: a, b, 3, d
        TCALL foo
    }
    type_a* a;
}

In this example 3 is an immediate value, because it’s known statically. If it were not known statically it could be stored in the closure after a.

Storing code and data together

Quite a few readers will now have some alarm bells going off about security due to the issue of storing code and writable data together. We’ll get back to that in a moment, for now I am only talking about performance hazards.

When data directly follows code the CPU may try to decode the data as if it is executable instructions. This can happen even if the data is unreachable (occurs after a ret or jmp instruction.) This can cause performance problems, the decoded data will take up resources in caches, and may even begin to be allocated execution resources within the CPU pipeline. The solution is either to pad the code with some nop instructions (and use long ones if possible) or to allocate the data before the code. The right amount of nop instructions may vary by CPU, even by model within the same architecture.

If we allocate the data before the code, the function pointer must still point to the code. Therefore the data is stored at negative offsets from this pointer. If we choose this option, at least for some CPU architectures then the GC must understand interior pointers. None of this is impossible, it’s all a set of choices that must be traded-off against one-another.

Is it best:

To pad the code to the next (eg: 16-byte) boundary
To allocate the data before the code and support interior pointers (some padding may still be required, but it may be less).
To confuse the CPU with data that it misinterprets as code.

We may have to make a different decision for each CPU type. This will also depend on how we execute code.

One other issue is that CPUs often have separate instruction and data caches. Some CPUs cannot handle writes into their code section, their caches simply don’t handle it and undefined behaviour ensures. x86/x86_64 is more tolerant, it will mealy slow-down heavily when it detects a write into anything in its instruction cache. In either case both of these are something you want to avoid.

Interpretation and compilation

Our other requirement is that our solution works correctly with the PZ abstract machine, that it is the right level of abstraction. This seems to be true for this solution.

We should be able to statically define the code section for each closure that the program may create, then when we create each closure allow the caller to refer to which code section they’d like copied in, and provide each piece of data required.

The runtime system itself will be able to choose how these structures are laid out, whether the data needs to be placed before the code, how much padding is required and so on. Depending on weather the program is compiled or interpreted, and how it is interpreted we may need to address the security and performance challenges of mixing code and writable data. Obviously if code is interpreted by a token-oriented interpreter (like the one we have now) then this and the CPU cache and pipeline issues are not a problem. However if we use any call-threaded execution, or compilation then these are a problem. We may be able to avoid some of this by allocating closures into different areas on the heap and allowing them to have different protection levels. For stronger security it’d be better to Exclude execute and write on the same pages at the same time. It will also make users lives easier since they’re less likely to need to break the security rules of their system. See also Code loading. Therefore being able to switch memory from writable to executable between the time a closure is created and when it’s executed may help. The challenge in this case is now knowing, at the granularity of a page, when to change permissions. There are a number of ways to do this, some static and some dynamic. We could have the compiler analyse when a good point to move closures from the closure nursery (where they are writable) into an older closure space. This could be combined with tag bits (adding a minor cost to all indirect function calls) or by trapping page faults when execution fails and moving the closures then.

I’m not ready to explore some of these details yet, but it looks like this would be a good topic for a follow up blog post.

Interaction with coroutining

So far we capture each parameter separately. Another option (again their are trade-offs) it to capture a pointer to the current environment. Usually when using a standard program stack this is difficult. The stack frame may have been popped (and clobbered!) by the time we’re executing the closure. There are numerous ways of capturing the current continuation, including representing stacks on the heap, once again each option with its own costs. If we ever change how we represent and manipulate stacks, then we should revisit how function pointers are stored and consider this option.