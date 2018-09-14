Instructions aren’t the only cost. This code sequence contains four 64-bit addresses, that’s a total of 32 bytes in the instruction stream (including the target for the jump on failed allocations). That takes up room in the CPU’s caches and other resources in the processor front-end.

The front-end of a processor’s pipeline must fetch and decode instructions before they’re queued, scheduled, executed and retired. Processor front-ends have changed a lot, and there are multiple levels of cacheing and buffering. Let’s use the Intel Core Microarchitecture as an example, it’s new enough to be in common use and things got more complex in the next microarchitecture due to having two different font-end pathways. The resource for this information is Intel’s optimisation reference manual.

Instructions are fetched 16-bytes at a time and immediately following the fetch a pre-decode pass occurs, a fast calculation of instruction lengths, Once the processor knows the lengths (and boundaries) of the instructions within the 16-bytes, they’re written into a buffer (the instruction queue) six at a time, if there are more than six instructions in the 16-byte block, then more cycles are used to pre-decode the remaining instructions. If fewer than six instructions were in the 16 bytes, or a read of less than 16 bytes occurred due to alignment or branching, then the full bandwidth of the pre-decode is not being utilised. If this happens often the instruction queue may starve.

The instruction queue is 18 instructions deep (but I think it’s shared by hyper-threading) instructions are decoded from this queue four or five at a time by the four decoders. One of the decoders is special and can handle some pairs of instructions turning them into a single operation.

Our instruction sequence above contains eight instructions, in 49 bytes. Assuming alignment is in our favour this will take four and pre-decode steps, averaging 2 instructions per pre-decode cycle; less than the CPU is capable of. (I don’t know how this behaves when an instruction crosses then 16-byte boundary, but back-of-the-envelope reasoning tells me it’s not a problem.)