Bytecode Interpreter

Paul Bone — May 10, 2018

In this article I’m going to explain how basic bytecode interpreters work, such as the first one written for Plasma’s bytecode format. But first I’d like to cover some background.

Compilation and Interpretation

Computer science classes often teach us that there are two ways to build and run a computer program. To either interpret it or to compile (and then execute) it. For example the Java compiler will compile the Java source code into JVM bytecode, which can then be interpreted by the JVM (but is also often JITed - another type of compilation). So here are all the types of compilers and interpreters I can think of:

Compile source code into machine code producing an executable.
Compile source code into machine code (sometimes called object code) then link the object files which contain the machine code into one executable or library.
Compile source code into assembly code, assemble that code to produce machine code and usually have a separate linking step (but not always) as above.

I’ll pause my list now to say, yes there are (have been?) compilers that emitted executable code directly. Some also used anther intermediate format that I can’t remember the name of. S-code or something. Let’s continue:

Interpret the code by line, re-parsing then performing the action on each line when it is encounted:

10 PRINT "HELLO"
20 GOTO 10

Parse the whole program first then walk the Abstract Syntax Tree (AST) to execute it (known as an AST walker).
Parse and compile the program (or smaller compilation units of it) into some other language such as C, LLVM IR etc, then handle that somehow. This can mean calling other tool on the output (like the C compiler) or calling to a library (like LLVM, instead of using the LLVM IR). It also includes transpiling (sic) PureScript/ELM/etc to JavaScript.
Parse and compile the whole program into a Bytecode then either keep that bytecode in memory where it is interpreted (Python) or save it to the disk for later execution (Java, also Python, Plasma).

Once you have bytecode you can:

Treat the bytecode as an abstract machine and compile it to something else, such as using LLVM as a code generator. Or kind-of how Mercury uses its low level C backend although that’s not really a bytecode.
Interpret the bytecode more-or-less directly token-threaded execution.

Some fancier methods exist for interpreting bytecode, many come from a long history in interpreting Forth programs, you can read more here:

token-threaded (we will discuss below)
direct-threaded
indirect-threaded
call-threaded
subroutine-threaded
segment-threaded

Finally:

Just In Time compile (JIT) the bytecode (usually parts of it) to make machine code and execute that.

There’s a lot we could talk about here, but now we’ve painted the scene let’s assume you have some bytecode and want ti interpret it in the simplest way possible.

Instruction streams

There’s one more topic to discuss first, what bytecode actually is / looks like.

Instructions, either in machine code, assembly language, or a bytecode, tell the machine what to do. They’re made up of an opcode (operation code I guess?) and optionally some operands (parameters for the opcode). The opcode of an instruction tells the machine what to do, and the operands tell it what to do it with. I was going to use some x86 as an example here, however x86 has opcodes from 1-3 bytes long, optional prefix bytes and about 4 different types of operands (also of different lengths) instructions are anywhere from 1 to 15 bytes long, and encoding is very complex, sometimes operands are actually implied by the opcode, other times parts of them (like a high bit) is given as part of the prefix (for the extra registers in x86_64). The point I was going to make here, with a nice tidy example, was that real machines and virtual machines use the same opcode followed by operands paradigm.

In some CPU architectures, and many bytecode formats, the opcode is a fixed number of bytes (usually 1). Followed by the operands. This is true in the JVM. In the JVM an increment instruction looks like this:

84 02 01

This increments the 2nd local by 1. 84 is the opcode meaning 'increment' and 02 and 01 are each of the two operands, 02 selects the 2nd local variable, and 01 is how much to increment by. In this case 02 can be considered a selector, it tells the machine what to perform the operation on (the 2nd local variable), in a register machine this may refer to one of the registers. While 01 can be considered an immediate value, a value used by the program that is stored in the instruction stream. The JVM, PZ, and many other virtual machines are stack-based machines meaning they have no general-purpose registers and operations are almost-always performed on the values located at the top of the stack.

Plasma currently has two different bytecode formats, one for on disk, and one for in memory. On disk a PZ (that’s the Plasma abstract machine) bytecode instruction has a one byte opcode, two separate one byte operand size bytes, and an optional immediate value (pz_format.h). The opcode says whether the other bytes will be present. An instruction like 'add' has a single operand size byte, this tells the interpreter what size of data 'add' should add eg: 8-bit, 32-bit, pointer-sized etc data. Most instructions require one of these bytes, some don’t (like 'roll' and some require two like 'sign-extend'. 'Load' and 'call' instructions require an immediate value. For 'load' instructions this says what to place on the stack, for 'call' it is the number of the procedure to call.

While Plasma is under development the format is not fixed, so it doesn’t make sense to talk about the constants for particular opcodes and operand sizes, so there is no example here.

These instructions follow one-after-another to form an instruction stream. The program will execute each instruction in the stream one after another unless it encounters something like a 'call', 'jump' or 'return' instruction. For example if in the JVM we incremented local 2 by 1, then local 3 by 2. We would have an instruction stream like

84 02 01 84 03 02

The first three bytes make up the first instruction, and the next three make up the second instruction.

Interpreting Bytecode

Plasma currently has a single bytecode interpreter, a token-based interpreter. It is probably the slowest type of bytecode interpreter, however it is one of the most straight-forward to write and can be done portably. We will add faster but architecture-specific interpreters later, and hopefully support for native code generation.

Many of the techniques for interpreting bytecode transform the bytecode in some way before interpreting it. With token-threaded bytecode execution this is not necessary, however we do do this for PZ to remove the operand size bytes, and instead having a distinct opcodes for each on-disk opcode and operand size byte(s) combination.

A token-based interpreter will have a loop and switch somewhat like this:

uint8_t *ip;
uint8_t  opcode;

while (true)

   // Read the next token from the instruction stream.
   opcode = *ip;

   // Advance to the next byte in the stream.
   ip++;

   // Decide what to do
   switch (opcode) {
      case PZT_ADD_32:
         ...
         break;
      case PZT_SUB_32:
         ...
         break;
      ...
   }
}

The ip variable is the virtual machine’s instruction pointer. It points to the next byte from the instruction stream to be interpreted. The interpreter’s loop reads each byte, advances the pointer and then decides what action to perform depending on the opcode (the opcode is a token).

The next thing we need to handle is immediate values. Some instructions can load data from memory, or sometimes other places like IO ports (in real machines running in supervisor mode), but some will load data from the instruction stream itself. Program constants are usually loaded this way. In Plasma’s in-memory bytecode format these values are aligned (for portability and efficiency).

case PZT_LOAD_IMMEDIATE_32:
   // Align the instruction pointer to the next 4-byte boundary if
   // necessary.
   ip = PZ_ALIGN_UP(ip, 4);

   // Read the value from the instruction stream and push it onto the
   // expression stack.
   expr_stack[++esp].u32 = *(uint32_t *)ip;

   // Move the instruction pointer to point to the next instruction.
   ip += 4;

   break;

We’ve also introduced two new variables here. expr_stack and esp. The PZ machine has two stacks, the return stack and the expression stack (called the parameter stack in Forth literature). The expression stack stores values that we’re currently working with (local variables) and saves them across procedure calls. In the PZ interpreter expr_stack points to the base of the stack and esp is an offset within it. Each stack slot is a C union and u32 is a 32bit field in that union. Treating each stack slot as a C union is a little wasteful, but it simplifies the implementation and helps things remain fairly portable.

typedef union {
    uint8_t   u8;
    int8_t    s8;
    uint16_t  u16;
    int16_t   s16;
    uint32_t  u32;
    int32_t   s32;
    uint64_t  u64;
    int64_t   s64;
    uintptr_t uptr;
    intptr_t  sptr;
    void *    ptr;
} Stack_Value;

Stack_Value     expr_stack[EXPR_STACK_SIZE];
unsigned        esp = 0;

Finally our interpreter needs to handle jumps and calls. These use an immediate value like above, but interpret it as a code address to control the program’s flow.

case PZT_CALL:
    // Align the instruction pointer to read a machine-word sized value from
    // the instruction stream.
    ip = ALIGN_UP(ip, MACHINE_WORD_SIZE);

    // Save the address of the next instruction onto the return stack.
    return_stack[++rsp] = (ip + MACHINE_WORD_SIZE);

    // Read the address to jump to, the address of the next instruction, from
    // the instruction stream and then save it into the instruction pointer.
    ip = *(uint8_t **)ip;
    break;

There are two new variables here once more. The return stack and return stack pointer. Note that in the PZ Machine there are two stacks, the return stack is separated from the expression stack. This allows a little more expressitivity within PZ, which is common to stack machines.

Further Resources

The PZ machine supports many more instructions, only the three most interesting for how to write a stack machine have been shown here. The others can be derived from these or seen in pz_run_generic.c in the Plasma source code.

I also suggest reading the earlier link about forth interpreters for faster (but less portable) interpreters.
Crafting Interpreters was recommended to me and others as a good resource on this topic.