Paul Bone

Avoiding large immediate values

We’re often told that we shouldn’t worry about the small details in optimisation, that either "premature optimisation is the root of all evil" or "the compiler is smarter than you". These things are true, in general. Which is why if you asked me about 10 years ago if I thought I would be using knowledge of machine code (not just assembly!) to improve a browser’s benchmark score by 2.5% I wouldn’t have believed you.

First of, I’m sorry (not sorry) for the gloating, and for what it’s worth the optimisation isn’t really that clever, and wasn’t even my idea. What I’m finding almost funny is that younger-me would not have believed that such low level details mattered this much.

Bump-pointer allocation

SpiderMonkey (Firefox’s JavaScript engine) separates its garbage collector into two areas, the nursery and the tenured heap. New objects are typically allocated first in the nursery, when the nursery is collected the object will be moved into the tenured heap if it is still alive. Collecting the nursery is faster than the whole heap since less data needs to be scanned, and most objects die when they are young. This is a fairly standard way to manage a garbage collector and is called generational garbage collection.

Allocating something in either heap should be fast, but since nursery allocation is more common it needs to be VERY fast. When JITing JavaScript code, allocation code is JITed right into the execution paths in each place it is needed.

I was working on a change to this code, I want to count the number of tenured and nursery allocations. And above all, I have to not add too much of a performance impact. That work is Bug 1473213 and isn’t actually the topic of this post, it’s just what drew my attention. (TL;DR: this work is Bug 1479360.)

The nursery fast-path looked like this, I’ve simplified it for easier reading, mostly by removing unnecessary things.

Register result(...), temp(...);
CompileZone* zone = GetJitContext()->realm->zone();
size_t totalSize = ...
void *ptrNurseryPosition = zone->addressOfNurseryPosition();
const void *ptrNurseryCurrentEnd = zone->addressOfNurseryCurrentEnd();

loadPtr(AbsoluteAddress(ptrNurseryPosition), result);
computeEffectiveAddress(Address(result, totalSize), temp);
branchPtr(Assembler::Below, AbsoluteAddress(ptrNurseryCurrentEnd), temp,
    fail);
storePtr(temp, AbsoluteAddress(ptrNurseryPosition));

That probably didn’t read right for most readers. What we’re looking at here is the code generator of the JIT compiler, this is not the allocation code itself, but the code that creates the machine code that does the allocation. I’ve broken it into two sections, the first five lines prepare some values and have absolutely zero runtime cost. The last five lines generate the code that does the bump pointer allocation. Function calls like loadPtr generate one or more machine code instructions:

loadPtr(AbsoluteAddress(ptrNurseryPosition), result)

Read a pointer-sized value from memory at ptrNurseryPosition and store it in the register result. ptrNurseryPosition points to a pointer that points to the next free cell in the heap. So this places the pointer of the next free cell into the result register.

computeEffectiveAddress(Address(result, totalSize), temp)

Use an lea or similar instruction to add totalSize (a displacement) to the contents of the result register, store the result of this addition into temp. After executing this temp will contain the pointer to the next free cell once we perform the current allocation.

branchPtr(..., AbsoluteAddress(ptrNurseryCurrentEnd), temp, fail)

Compare the temp register’s contents against the contents of the memory at ptrNurseryCurrentEnd and if temp is higher, branch to the fail label. This compares the next value for the allocation pointer to the end of the heap, if the allocation would go beyond the end of the nursery then fail.

storePtr(temp, AbsoluteAddress(ptrNurseryPosition))

Store the new value for the next free cell (temp) into the memory at ptrNurseryPosition.

Unfortunately this isn’t as efficient as it could be.

Immediates and displacements

I’ve recently written about addressing in x86 where I wrote that instructions refer to operands and these operands may be registers, memory locations or immediate values. To recap, there are two main situations where some value can follow the instruction, it’s either as an immediate value or as a displacement for a memory operand.

Displacement

A displacement my be either 8 or 32 bits (on x86 running in 32 or 64 bit mode).

Immediate

An immediate value depends on the size of the operation, and may be 8, 16, 32 or 64 bits.

The point here, is that displacements cannot store a 64 bit value, so:

branchPtr(Assembler::Below, AbsoluteAddress(ptrNurseryCurrentEnd), temp,
    fail);

Cannot directly use 64 bit displacement (ptrNurseryPosition) for its memory operand, and requires an extra instruction to first load this value into a scratch register from an immediate (which can be 64 bit) before doing the comparison. This operation will now need three instructions rather than two (compare and jump are already separate instructions).

Intel provides a special exception to these rules about displacements for move instructions. There are four special opcodes for move that allow it to work with a 64-bit moffset. So:

loadPtr(AbsoluteAddress(ptrNurseryPosition), result);

Can be almost be represented. But these opcodes hard code result to the ax or eax registers, which is not suitable for a 64-bit value as this is. Therefore using 64-bit addresses also makes these loadPtr and storePtr operations use two instructions rather than one.

Here’s the disassembled code that this generates.

movabs $0x7ffff5d1b618,%r11
mov    (%r11),%rbx
lea    0x60(%rbx),%rbp
movabs $0x7ffff5d1b630,%r11
cmp    %rbp,(%r11)
jb     0x1f2f3ed1a351
movabs $0x7ffff5d1b618,%r11
mov    %rbp,(%r11)

This sequence, rather than being five instructions long is now eight instructions long (and 49 bytes) and makes more use of a scratch register (which may impact instruction-level parallelism).

The instruction cache

Instructions aren’t the only cost. This code sequence contains four 64-bit addresses, that’s a total of 32 bytes in the instruction stream (including the target for the jump on failed allocations). That takes up room in the CPU’s caches and other resources in the processor front-end.

The front-end of a processor’s pipeline must fetch and decode instructions before they’re queued, scheduled, executed and retired. Processor front-ends have changed a lot, and there are multiple levels of cacheing and buffering. Let’s use the Intel Core Microarchitecture as an example, it’s new enough to be in common use and things got more complex in the next microarchitecture due to having two different font-end pathways. The resource for this information is Intel’s optimisation reference manual.

Instructions are fetched 16-bytes at a time and immediately following the fetch a pre-decode pass occurs, a fast calculation of instruction lengths, Once the processor knows the lengths (and boundaries) of the instructions within the 16-bytes, they’re written into a buffer (the instruction queue) six at a time, if there are more than six instructions in the 16-byte block, then more cycles are used to pre-decode the remaining instructions. If fewer than six instructions were in the 16 bytes, or a read of less than 16 bytes occurred due to alignment or branching, then the full bandwidth of the pre-decode is not being utilised. If this happens often the instruction queue may starve.

The instruction queue is 18 instructions deep (but I think it’s shared by hyper-threading) instructions are decoded from this queue four or five at a time by the four decoders. One of the decoders is special and can handle some pairs of instructions turning them into a single operation.

Our instruction sequence above contains eight instructions, in 49 bytes. Assuming alignment is in our favour this will take four and pre-decode steps, averaging 2 instructions per pre-decode cycle; less than the CPU is capable of. (I don’t know how this behaves when an instruction crosses then 16-byte boundary, but back-of-the-envelope reasoning tells me it’s not a problem.)

This low instruction density might not be a problem in many situations, such as when the instruction cache already contains plenty of instructions and this bubble does not affect overall throughput. However in a loop or when other things already affect the processor’s pipeline, it could definitely be an issue.

The change

My colleague sfink had left a comment in the nursery string allocation path where he attempted to experiment with this in the past. His solution was eventually removed because it was a little bit fiddly, but it was the inspiration for my eventual change.

The code (tidied up) now looks like:

CheckedInt<int32_t> endOffset = (CheckedInt<uintptr_t>(uintptr_t(curEndAddr)) -
    CheckedInt<uintptr_t>(uintptr_t(posAddr))).toChecked<int32_t>();
MOZ_ASSERT(endOffset.isValid(),
    "Position and end pointers must be nearby");

movePtr(ImmPtr(posAddr), temp);
loadPtr(Address(temp, 0), result);
addPtr(Imm32(totalSize), result);
branchPtr(Assembler::Below, Address(temp, endOffset.value()), result, fail);
storePtr(result, Address(temp, 0));
subPtr(Imm32(size), result);

This loads a 64-bit address once and uses a relative address to describe the end of the nursery (the Address argument to the branchPtr call), then can re-use the original address when updating the current pointer (storePtr). We have to add the object size to result and subtract it later because we can’t easily get guaranteed access to another register with the way the code generator is written. So there are six operations in this sequence, let’s see the machine code:

movabs $0x7ffff5d1b618,%rbp
mov    0x0(%rbp),%rbx
add    $0x60,%rbx
cmp    %rbx,0x18(%rbp)
jb     0x164f300ea154
mov    %rbx,0x0(%rbp)
sub    $0x60,%rbx

Seven instructions long rather than eight, and 36 bytes rather than 49. This can be retrieved in three 16-byte transfers, rather than four. The instructions per fetch is now a 2 1/3 rather than 2.

Results

It doesn’t look like a huge improvement, seven instructions compared with eight?! But now it uses one less 16-byte fetch which means one less cycle to fill the pipeline for these instructions, in the right loop that could make a huge difference. It did make Firefox perform about 2.5% faster on the Speedometer benchmark when tested on my laptop (Intel Core i7-6600U, Skylake). Sadly we didn’t see any noticeable difference in our performance testing infrastructure (arewefastyet or perfherder). This could be because our CI systems have different CPUs that behave differently with regard to instruction lengths/density.

My examples above were for the simpler Core microarchitecture, whereas my testing was on a Skylake CPU and will be quite different. Starting with Sandy Bridge there are two paths for code to take through the CPU front end, and which one is used depends on multiple conditions. To simplify it, on tight enough loops the CPU is able to cache decoded instructions and execute them out of a μop cache.

Macro-fusion

Another difference is that with an absolute address used in the cmp instruction it could behave different with regard to macro-fusion (being fused with the jmp to execute as a single operation). I’m not sure if large displacements affect macro-fusion.

Update 2018-09-18

I received some feedback from Robert O’Callahan, he wrote with three suggestions.

  • Allocate all JIT code and globals within a single 2GB region and use RIP-relative addressing (x86-64), so that addresses will not be larger than 32bits. This is a good idea and I considered this for the jump instruction in that sequence which still uses a 64 bit address (because the jump is created before the label, and so the address is written after, it must leave 64bits of space for now).

  • Using known bit patterns in the nursery address range we could test for overflow by checking the value of the bits, avoiding an extra memory read. This is a great idea but will require some other work first.

  • The final subtraction might be skippable if the caller can handle an address to the end of the structure and use negative offsets, eg by filling in slots in the object using negative offsets. I’m skeptical if this will provide much benefit compared to the effort required to avoid the subtraction, or probably at best delay it.