Dr. Paul Bone
paul.bone.id.au
This slide deck is made with reveal.js, you can:
The slides make more sense with explanation but I'm afraid there was no recording.
Intel's first 16bit processors (but not world-first):
AFAIK none of these are world-firsts, they're just firsts for Intel and x86.
8d 0c ff lea (%rdi,%rdi,8),%ecx
ba 67 66 66 66 mov $0x66666667,%edx
89 c8 mov %ecx,%eax
f7 ea imul %edx
d1 fa sar %edx
c1 f9 1f sar $0x1f,%ecx
29 ca sub %ecx,%edx
8d 42 20 lea 0x20(%rdx),%eax
c3 retq
AT&T syntax
32-bit x86 can add 1 to the value at a 32-bit address:
addl $1, (0xABCDEF08)
A 64-bit x86-64 cannot add 1 to the value at a 64-bit address:
addq $1, (0xABCDEF08ABCDEF01)
But it can use a 64-bit immediate.
addq $0xABCDEF08ABCDEF08, %rax
So we have to get around this using two instructions.
movq $0xABCDEF08ABCEEF08, %rbx
addq $1, (%rbx)
Huh? All I want to do is increment a counter!
Opcode | source | destination | |
---|---|---|---|
Machine code | 12 |
01 |
02 |
Assembly | add |
%r1 |
%r2 |
The operands can be:
// 16-bit add
add %ax, %dx
// Also a 16-bit add
addw %ax, %dx
// 32-bit add, size is determined by the
// register name.
add %eax, %edx
// or
addl %eax, %edx
This allows memory operands of the form:
address = base + index + displacement
Good for looking up a field in a struct or an array, or a struct in an array.
If R/M = 100 then the SIB byte is used to specify more addressing forms
address =
base +
index ×
scale +
displacement
scale = 1, 2, 4 or 8
x86-64 adds 8 new registers. But register fields are only 3 bits long, how will we solve this?
Instruction prefixes!
I need two instructions on x86-64 to do the job of one on x86.
// 32-bit.
addl $1, disp32
The $1
is the immediate value in this instruction,
and disp32
is a 32-bit address.
// 64-bit
movq imm64, %r11
addl $1, (%r11)
On x86-64 the pointer is too big to fit in a memory displacement. It must now be in imm64.
Why have a colon in a memory address?
The 8086 can address 20-bits of memory, with only 16-bit registers.
1234:0023
segment:address
Segments are 64k large and align on 16 bit boundaries
This means segments can overlap
With 16 bit segment bases, shifted up four bits, plus another 16 bit address, we can generate addresses such as:
0xFFFF << 4 + 0x0011
= 0xFFFF0 + 0x0011
= 0x100001
But 0x100001 needs more than the 20 bits that 8086 supported!
It gets truncated
= 0x1
Some software relied on this "feature".
286 supports 24-bit physical addresses, but wants to run software made for 8086 even software that abuses the 20-bit physical address space.
At boot wrapping is enabled and an x86 can only access half its memory
Intel didn't include a way to switch from one of its protected modes back to real mode. This made systems that used both, eg: calls into a real-mode BIOS routine really slow.
The "hack" was to triple-fault the CPU to reset it, get it to execute the BIOS code, then switch back into protected mode.
Later a faster method was created that used the keyboard controller to hard-reset the CPU, avoiding the triple-fault.
Optimising compute-bound tasks meant counting cycles. Reference manuals included tables showing the number of cycles for each instruction on each processor.
This is not (as) useful on modern processors. Two other effects dominate performance...
CPU performance increased but memory latency and bandwidth lagged. Caches gave CPUs access to recently used data sooner.
Latency | |
---|---|
One Cycle | 0.4ns |
L1 Access | 0.9ns |
L2 Access | 2.8ns |
L2 Access | 28ns |
Miss | ~100ns |
Used earlier but saw slightly heavier usage with the 486
Like a factory assembly line, While one instruction is executed the next one is already being decoded.
Depth | ||
---|---|---|
486 | 5 | |
P5 (Pentium) | 5 | Super-scalar |
P6 (Pentium II) | 10 or 12-14 | Speculative & out of order |
NetBurst (Pentium IV) | 20 or 31 | SMT |
Core | 12-14 or 20-24 |
A delay in the pipeline can occur for:
Optimising for modern machines means reducing mostly the first two types of stalls.
CPUs guess which branch might be taken and start processing instructions along the most likely branch.
Instructions are only committed (retired) when we know that branch is taken.
A miss-predicted branch means a pipeline flush. The longer the pipeline, the bigger the flush.
A CPU must not give private/kernel information to userspace processes.
It will issue a General Protection Fault before divulging secrets!
If:
if ...:
array[secret]
else:
// see which parts of array are in cache
Then:
Opcodes often encode the type of data to work with. (8-bit, 16-bit etc).
x86 has 8 kinda-general purpose registers, kinda because most have special powers.
Name | Description | Special powers |
---|---|---|
ax | Accumulator | Implicit for mul/div |
bx | Base | |
cx | Counter | Used in string ops, shifts and rotates |
dx | Data | Implicit for mul/div |
sp | Stack pointer | the call stack pointer |
bp | Base pointer | |
si | Source index | String operations |
di | Destination index | String operations |
String operations:
An instruction can be repeated advancing a counter
(cx
) until some condition is met.
The CPU also has an instruction pointer (ip
)
and a flags word and some segment registers.
In machine code the what to do is called the opcode, but in assembler it is called a mnemonic. They don't map 1-to-1.
Sometimes opcodes encode the data type to work on, which is not always part of the mnemonic.
eax
is the Extended AX register on 32-bit x86
processors. Its low 16 bits are shared with
ax
.
rax
is the Register A eXtended on x86-64
processors. Its low 32 bits are shard with
eax
.
What does it mean to exchange the contents of a register with itself?
// assembles to opcode 90 (x86-64)
xchg rax, rax
Disassembling this opcode is often shown as the mnemonic
nop
meaning no-operation.
CPUs recognize opcodes like this and know they don't have to
do anything.
Compiling nop
also produces the opcode
90
.
What does this do?
xor eax, eax
exclusive-or something with itself results in zero.
This is the shortest instruction to clear the
eax
register, so developers used it to save
space.
Because developers used it; Intel, AMD etc, begun to optimise for it.
// Swap two values
xchg eax, ebx
xchg eax, [edx]
// population count (count the number of bits
// set to 1).
popcnt ecx, eax
// strcpy, copy ecx bytes from ds:[esi]
// to es:[edi]
rep movsl