Dr. Paul Bone
This slide deck is made with reveal.js, you can:
The slides make more sense with explanation but I'm afraid there was no recording.
Intel's first 16bit processors (but not world-first):
AFAIK none of these are world-firsts, they're just firsts for Intel and x86.
8d 0c ff lea (%rdi,%rdi,8),%ecx ba 67 66 66 66 mov $0x66666667,%edx 89 c8 mov %ecx,%eax f7 ea imul %edx d1 fa sar %edx c1 f9 1f sar $0x1f,%ecx 29 ca sub %ecx,%edx 8d 42 20 lea 0x20(%rdx),%eax c3 retq
32-bit x86 can add 1 to the value at a 32-bit address:
addl $1, (0xABCDEF08)
A 64-bit x86-64 cannot add 1 to the value at a 64-bit address:
addq $1, (0xABCDEF08ABCDEF01)
But it can use a 64-bit immediate.
addq $0xABCDEF08ABCDEF08, %rax
So we have to get around this using two instructions.
movq $0xABCDEF08ABCEEF08, %rbx addq $1, (%rbx)
Huh? All I want to do is increment a counter!
The operands can be:
// 16-bit add add %ax, %dx // Also a 16-bit add addw %ax, %dx // 32-bit add, size is determined by the // register name. add %eax, %edx // or addl %eax, %edx
This allows memory operands of the form:
address = base + index + displacement
Good for looking up a field in a struct or an array, or a struct in an array.
If R/M = 100 then the SIB byte is used to specify more addressing forms
scale = 1, 2, 4 or 8
x86-64 adds 8 new registers. But register fields are only 3 bits long, how will we solve this?
I need two instructions on x86-64 to do the job of one on x86.
// 32-bit. addl $1, disp32
$1 is the immediate value in this instruction,
disp32 is a 32-bit address.
// 64-bit movq imm64, %r11 addl $1, (%r11)
On x86-64 the pointer is too big to fit in a memory displacement. It must now be in imm64.
Why have a colon in a memory address?
The 8086 can address 20-bits of memory, with only 16-bit registers.
Segments are 64k large and align on 16 bit boundaries
This means segments can overlap
With 16 bit segment bases, shifted up four bits, plus another 16 bit address, we can generate addresses such as:
0xFFFF << 4 + 0x0011
= 0xFFFF0 + 0x0011
But 0x100001 needs more than the 20 bits that 8086 supported!
It gets truncated
Some software relied on this "feature".
286 supports 24-bit physical addresses, but wants to run software made for 8086 even software that abuses the 20-bit physical address space.
At boot wrapping is enabled and an x86 can only access half its memory
Intel didn't include a way to switch from one of its protected modes back to real mode. This made systems that used both, eg: calls into a real-mode BIOS routine really slow.
The "hack" was to triple-fault the CPU to reset it, get it to execute the BIOS code, then switch back into protected mode.
Later a faster method was created that used the keyboard controller to hard-reset the CPU, avoiding the triple-fault.
Optimising compute-bound tasks meant counting cycles. Reference manuals included tables showing the number of cycles for each instruction on each processor.
This is not (as) useful on modern processors. Two other effects dominate performance...
CPU performance increased but memory latency and bandwidth lagged. Caches gave CPUs access to recently used data sooner.
Used earlier but saw slightly heavier usage with the 486
Like a factory assembly line, While one instruction is executed the next one is already being decoded.
|P6 (Pentium II)||10 or 12-14||Speculative & out of order|
|NetBurst (Pentium IV)||20 or 31||SMT|
|Core||12-14 or 20-24|
A delay in the pipeline can occur for:
Optimising for modern machines means reducing mostly the first two types of stalls.
CPUs guess which branch might be taken and start processing instructions along the most likely branch.
Instructions are only committed (retired) when we know that branch is taken.
A miss-predicted branch means a pipeline flush. The longer the pipeline, the bigger the flush.
A CPU must not give private/kernel information to userspace processes.
It will issue a General Protection Fault before divulging secrets!
if ...: array[secret] else: // see which parts of array are in cache
Opcodes often encode the type of data to work with. (8-bit, 16-bit etc).
x86 has 8 kinda-general purpose registers, kinda because most have special powers.
|ax||Accumulator||Implicit for mul/div|
|cx||Counter||Used in string ops, shifts and rotates|
|dx||Data||Implicit for mul/div|
|sp||Stack pointer||the call stack pointer|
|si||Source index||String operations|
|di||Destination index||String operations|
An instruction can be repeated advancing a counter
cx) until some condition is met.
The CPU also has an instruction pointer (
and a flags word and some segment registers.
In machine code the what to do is called the opcode, but in assembler it is called a mnemonic. They don't map 1-to-1.
Sometimes opcodes encode the data type to work on, which is not always part of the mnemonic.
eax is the Extended AX register on 32-bit x86
processors. Its low 16 bits are shared with
rax is the Register A eXtended on x86-64
processors. Its low 32 bits are shard with
What does it mean to exchange the contents of a register with itself?
// assembles to opcode 90 (x86-64) xchg rax, rax
Disassembling this opcode is often shown as the mnemonic
nop meaning no-operation.
CPUs recognize opcodes like this and know they don't have to
nop also produces the opcode
What does this do?
xor eax, eax
exclusive-or something with itself results in zero.
This is the shortest instruction to clear the
eax register, so developers used it to save
Because developers used it; Intel, AMD etc, begun to optimise for it.
// Swap two values xchg eax, ebx xchg eax, [edx] // population count (count the number of bits // set to 1). popcnt ecx, eax // strcpy, copy ecx bytes from ds:[esi] // to es:[edi] rep movsl