x86: An evolution of kludges

x86 chip artistic graphic

Dr. Paul Bone
paul.bone.id.au

How to view these slides

This slide deck is made with reveal.js, you can:

  • Press spacebar to go to the next slide, this is simplest and recommended if you just want to read front to back.
  • Press right arrow to go to the next section.
  • Press down arrow to view slides within that section. Some sections, like this one, have only a single slide, down arrow will do nothing.
  • Press Escape to zoom out and get your bearings.

The slides make more sense with explanation but I'm afraid there was no recording.

Some history - 1978

  • Wuthering Heights - Kate Bush
  • First GPS satellite launches
  • Hitch Hiker's Guide to the Galaxy (first episode)
  • Grease (musical)
  • Charon discovered
  • Rainbow flag for LGBTQIA+ flies for the first time
  • Ford Pinto recall
  • Space Invaders
  • Intel 8086 CPU

Birth of x86

Intel's first 16bit processors (but not world-first):

An 8086
An 80186
1978: 8086
16-bit processor with 16-bit data bus & 20-bit address bus (1MB RAM/Mapped IO)
1979: 8088 (1st IBM PC)
As above except 8-bit data bus
1982: 80186
Intended for embedded devices
1982: 80286
24-bit addressing (16MB RAM), memory protection & multi tasking

x86

1985: 80386
32-bit w/ paging, used in 1st notable non-IBM PC-compatible, 32-bit addressing (4GB)
1989: i486
pipelined design, onchip cache, onchip FPU
1989: P5 microarchitecture (Pentium, i586 etc)
superscalar architecture

AFAIK none of these are world-firsts, they're just firsts for Intel and x86.

x86

1995: P6 microarchitecture
Pentium Pro — Pentium III
Speculative execution, out-of-order execution, register renaming. 36-bit (64GB) physical memory, CMOV, etc.
2000: Netburst microarchitecture
Pentium 4 etc
Hyper-threading, pipeline "improvements".
2003: AMD Opteron
AMD64 (aka x86-64, aka Intel 64, IA-64, x64), 48-bit (256TB) virtual memory.

About CPUs

What is a CPU really?

What do CPUs do?

8d 0c ff          lea    (%rdi,%rdi,8),%ecx
ba 67 66 66 66    mov    $0x66666667,%edx
89 c8             mov    %ecx,%eax
f7 ea             imul   %edx
d1 fa             sar    %edx
c1 f9 1f          sar    $0x1f,%ecx
29 ca             sub    %ecx,%edx
8d 42 20          lea    0x20(%rdx),%eax
c3                retq

AT&T syntax

More than you ever wanted to know about operand encoding

(huh?)

Huh?

32-bit x86 can add 1 to the value at a 32-bit address:

addl   $1, (0xABCDEF08)

A 64-bit x86-64 cannot add 1 to the value at a 64-bit address:

addq   $1, (0xABCDEF08ABCDEF01)

But it can use a 64-bit immediate.

addq   $0xABCDEF08ABCDEF08, %rax

So we have to get around this using two instructions.

movq   $0xABCDEF08ABCEEF08, %rbx
addq   $1, (%rbx)

Huh? All I want to do is increment a counter!

Hypothetical instruction encoding

Opcodesourcedestination
Machine code 12 01 02
Assembly add %r1 %r2

The operands can be:

  • registers,
  • literal values,
  • memory locations taken from registers,
  • literal memory locations,
  • etc.

Understanding a CPU

// 16-bit add
add   %ax, %dx

// Also a 16-bit add
addw  %ax, %dx

// 32-bit add, size is determined by the
// register name.
add   %eax, %edx

// or
addl  %eax, %edx
            

8086 Instruction encoding

Mode:
Meaning of operand 2
Reg:
Operand 1
R/M:
Operand 2, register or memory pointed to by register with optional displacement

This allows memory operands of the form:

address = base + index + displacement

Good for looking up a field in a struct or an array, or a struct in an array.

80386 instruction encoding

If R/M = 100 then the SIB byte is used to specify more addressing forms

address = base + index × scale + displacement
scale = 1, 2, 4 or 8

x86-64 instruction encoding

x86-64 adds 8 new registers. But register fields are only 3 bits long, how will we solve this?

Instruction prefixes!

Instruction prefix with SIB byte

I need two instructions on x86-64 to do the job of one on x86.

// 32-bit.
addl $1, disp32

The $1 is the immediate value in this instruction, and disp32 is a 32-bit address.

// 64-bit
movq imm64,  %r11
addl $1,     (%r11)

On x86-64 the pointer is too big to fit in a memory displacement. It must now be in imm64.

History we mostly left behind but it's kinda still lurking there:

Segmentation

Processor modes

  • Real mode (16-bit)
  • Protected mode (16-bit with protection on segments, multi-tasking)
  • 32-bit protected mode (paging)
  • Virtual 8086 mode (16-bit dosbox running in host OS)
  • Unreal mode (weird unofficial hack)
  • 64-bit
  • 64-bit features with 32-bit address (I forget the name)

Real mode segments

Why have a colon in a memory address?

The 8086 can address 20-bits of memory, with only 16-bit registers.

1234:0023
segment:address

Segment size and alignment

Segments are 64k large and align on 16 bit boundaries

This means segments can overlap

Wrapping

With 16 bit segment bases, shifted up four bits, plus another 16 bit address, we can generate addresses such as:

0xFFFF << 4 + 0x0011
= 0xFFFF0 + 0x0011
= 0x100001

But 0x100001 needs more than the 20 bits that 8086 supported!

It gets truncated
= 0x1

Some software relied on this "feature".

Backwards compatibility

286 supports 24-bit physical addresses, but wants to run software made for 8086 even software that abuses the 20-bit physical address space.

A20 gate

At boot wrapping is enabled and an x86 can only access half its memory

Electrical diagram of A20 gate

A20 gate

Electrical diagram of A20 gate

A20 gate

  • BIOS-based PCs needed to handle this until recently
  • I suspect that if you boot DOS then you'll still get the older wrapping behaviour
  • IBM's fault, not Intel's

Guess what else the keyboard controller does?

Intel didn't include a way to switch from one of its protected modes back to real mode. This made systems that used both, eg: calls into a real-mode BIOS routine really slow.

The "hack" was to triple-fault the CPU to reset it, get it to execute the BIOS code, then switch back into protected mode.

Later a faster method was created that used the keyboard controller to hard-reset the CPU, avoiding the triple-fault.

What do modern CPUs do?

Instruction delays (then)

  • Instructions take varying numbers of cycles.
  • mul/div could take varying numbers of cycles.
  • Each new processor generation changed the cycles required for many instructions.

Optimising compute-bound tasks meant counting cycles. Reference manuals included tables showing the number of cycles for each instruction on each processor.

This is not (as) useful on modern processors. Two other effects dominate performance...

Memory access

CPU performance increased but memory latency and bandwidth lagged. Caches gave CPUs access to recently used data sooner.

Latency
One Cycle0.4ns
L1 Access0.9ns
L2 Access2.8ns
L2 Access28ns
Miss~100ns

Pipelines

Used earlier but saw slightly heavier usage with the 486

Like a factory assembly line, While one instruction is executed the next one is already being decoded.

Deep pipelines

Depth
4865
P5 (Pentium)5Super-scalar
P6 (Pentium II)10 or 12-14 Speculative & out of order
NetBurst (Pentium IV)20 or 31SMT
Core12-14 or 20-24

Super-scalar (P5)

Stalls

A delay in the pipeline can occur for:

  • Cache miss
  • Unknown branch
  • Fetch and decode related delay

Optimising for modern machines means reducing mostly the first two types of stalls.

Out-of-order (P6)

Speculative (P6)

CPUs guess which branch might be taken and start processing instructions along the most likely branch.

Instructions are only committed (retired) when we know that branch is taken.

A miss-predicted branch means a pipeline flush. The longer the pipeline, the bigger the flush.

The Spectre in the room

A CPU must not give private/kernel information to userspace processes.

It will issue a General Protection Fault before divulging secrets!

If:

  • Code that accesses a secret is behind some branch.
  • The branch is predicted as taken.
  • The secret is used to index an array, causing that cache line to be loaded into the cache.
if ...:
    array[secret]
else:
    // see which parts of array are in cache

Then:

  • The pipeline is flushed, loading the cache-line is not.
  • Access times to the array (cache hit/miss) now allow the attacker to infer the secret.

Modern pipeline (Skylake)

Thank you

paul.bone.id.au

Common opcodes

  • Arithmetic and logic: add, subtract, multiply, divide, and, or, xor, shift, etc.
  • Loads: load, store, push, pop, exchange, (kinda: in, out).
  • Control flow: jump, conditional jump, call, return.
  • System: sti, lgdt (etc, very architecture-specific).

Opcodes often encode the type of data to work with. (8-bit, 16-bit etc).

Registers

x86 has 8 kinda-general purpose registers, kinda because most have special powers.

NameDescriptionSpecial powers
axAccumulatorImplicit for mul/div
bxBase
cxCounterUsed in string ops, shifts and rotates
dxDataImplicit for mul/div

Registers

spStack pointerthe call stack pointer
bpBase pointer
siSource indexString operations
diDestination indexString operations

String operations: An instruction can be repeated advancing a counter (cx) until some condition is met.

The CPU also has an instruction pointer (ip) and a flags word and some segment registers.

Opcodes and mnemonics

In machine code the what to do is called the opcode, but in assembler it is called a mnemonic. They don't map 1-to-1.

Sometimes opcodes encode the data type to work on, which is not always part of the mnemonic.

32-bit register file

eax is the Extended AX register on 32-bit x86 processors. Its low 16 bits are shared with ax.

rax is the Register A eXtended on x86-64 processors. Its low 32 bits are shard with eax.

Example of mnemonics

What does it mean to exchange the contents of a register with itself?

// assembles to opcode 90 (x86-64)
xchg rax, rax

Disassembling this opcode is often shown as the mnemonic nop meaning no-operation. CPUs recognize opcodes like this and know they don't have to do anything.

Compiling nop also produces the opcode 90.

CPUs also recognise other instructions

What does this do?

xor eax, eax

exclusive-or something with itself results in zero.

This is the shortest instruction to clear the eax register, so developers used it to save space.

Because developers used it; Intel, AMD etc, begun to optimise for it.

Favourite instructions

// Swap two values
xchg eax, ebx
xchg eax, [edx]

// population count (count the number of bits
// set to 1).
popcnt ecx, eax

// strcpy, copy ecx bytes from ds:[esi]
// to es:[edi]
rep movsl