Email or username:

Password:

Forgot your password?
Top-level
Dougall

The Rosetta 2 instruction size expansion factor for an sqlite3 binary is ~1.64x (1.05MB of x86 instructions vs 1.72MB of ARM instructions). Surprisingly good, especially given Firestorm cores have six-times the instruction cache of Ice Lake (192KiB vs 32KiB).

Something I'm not sure I said is that the goal is to have a single, equivalent ARM instruction for each x86 instruction. And, in real-world code, combining all those tricks allows Rosetta 2 to achieve that surprisingly often.

4 comments
Dougall

I said that typically, converting an x86 instruction to ARM will require an expansion, and I stand by it, but some of the counter-examples are rather entertaining. A lot of x86 instructions have 32-bit immediates, which become much more compact when most of those bits are unused.

For example, this instruction has two 32-bit immediates:
48 C7 83 D8 01 00 00 00 00 00 00 | mov qword ptr [rbx+1D8h], 0

And gets translated to:
7F EC 00 F9 | str xzr, [x3,#0x1D8]

I said that typically, converting an x86 instruction to ARM will require an expansion, and I stand by it, but some of the counter-examples are rather entertaining. A lot of x86 instructions have 32-bit immediates, which become much more compact when most of those bits are unused.

For example, this instruction has two 32-bit immediates:
48 C7 83 D8 01 00 00 00 00 00 00 | mov qword ptr [rbx+1D8h], 0

Dougall

A correction:

I've discovered one more inter-instruction optimisation: prologue and epilogue combining, equivalent to the "stack engine" in hardware implementations of x86. This pairs loads and stores, and delays stack-pointer updates:

push rbp
mov rbp, rsp
push rbx
push rax

Becomes:

stur x5, [x4,#-8]
sub x5, x4, #8
stp x0, x3, [x4,#-0x18]!

(If I'd realised how popular this post would be, I'd have been a bit more thorough.)

A correction:

I've discovered one more inter-instruction optimisation: prologue and epilogue combining, equivalent to the "stack engine" in hardware implementations of x86. This pairs loads and stores, and delays stack-pointer updates:

push rbp
mov rbp, rsp
push rbx
push rax

Becomes:

stur x5, [x4,#-8]
sub x5, x4, #8
stp x0, x3, [x4,#-0x18]!

Giovanni Mascellani

@dougall I wonder how hard it is to do proper inter-instruction optimization while at the same time retain enough bookkeeping so that you can still do jumps and interrupts.

Blake Patterson

@dougall Could you expand a bit on this, from your Rosetta 2 post?

"The Apple M1 has an undocumented extension that, when enabled, ensures instructions like ADDS, SUBS and CMP compute PF and AF and store them as bits 26 and 27 of NZCV respectively, providing accurate emulation with no performance penalty."

I see PF has to do with data parity and AF is sometimes used with writes to devices (serial port, etc.) -- but I'm not capturing what you're conveying here. Thanks.

Go Up