Dougall

All posts Dougall's posts Post Back to profile

Dougall

New blog post: "Why is Rosetta 2 fast?"

https://dougallj.wordpress.com/2022/11/09/why-is-rosetta-2-fast/

Like 9 Nov 2022 at 15:21 | Open on mastodon.social

6 comments

Dougall

The Rosetta 2 instruction size expansion factor for an sqlite3 binary is ~1.64x (1.05MB of x86 instructions vs 1.72MB of ARM instructions). Surprisingly good, especially given Firestorm cores have six-times the instruction cache of Ice Lake (192KiB vs 32KiB).

Something I'm not sure I said is that the goal is to have a single, equivalent ARM instruction for each x86 instruction. And, in real-world code, combining all those tricks allows Rosetta 2 to achieve that surprisingly often.

10 Nov 2022 at 7:34 | Open on mastodon.social

Dougall

I said that typically, converting an x86 instruction to ARM will require an expansion, and I stand by it, but some of the counter-examples are rather entertaining. A lot of x86 instructions have 32-bit immediates, which become much more compact when most of those bits are unused.

For example, this instruction has two 32-bit immediates:
48 C7 83 D8 01 00 00 00 00 00 00 | mov qword ptr [rbx+1D8h], 0

And gets translated to:
7F EC 00 F9 | str xzr, [x3,#0x1D8]

For example, this instruction has two 32-bit immediates:
48 C7 83 D8 01 00 00 00 00 00 00 | mov qword ptr [rbx+1D8h], 0

Expand text...

10 Nov 2022 at 8:46 | Open on mastodon.social

Dougall

A correction:

I've discovered one more inter-instruction optimisation: prologue and epilogue combining, equivalent to the "stack engine" in hardware implementations of x86. This pairs loads and stores, and delays stack-pointer updates:

push rbp
mov rbp, rsp
push rbx
push rax

Becomes:

stur x5, [x4,#-8]
sub x5, x4, #8
stp x0, x3, [x4,#-0x18]!

(If I'd realised how popular this post would be, I'd have been a bit more thorough.)

A correction:

push rbp
mov rbp, rsp
push rbx
push rax

Becomes:

stur x5, [x4,#-8]
sub x5, x4, #8
stp x0, x3, [x4,#-0x18]!

Expand text...

10 Nov 2022 at 10:17 | Open on mastodon.social

Giovanni Mascellani

@dougall I wonder how hard it is to do proper inter-instruction optimization while at the same time retain enough bookkeeping so that you can still do jumps and interrupts.

10 Nov 2022 at 12:56 | Open on mastodon.social

Blake Patterson

@dougall Could you expand a bit on this, from your Rosetta 2 post?

"The Apple M1 has an undocumented extension that, when enabled, ensures instructions like ADDS, SUBS and CMP compute PF and AF and store them as bits 26 and 27 of NZCV respectively, providing accurate emulation with no performance penalty."

I see PF has to do with data parity and AF is sometimes used with writes to devices (serial port, etc.) -- but I'm not capturing what you're conveying here. Thanks.

10 Nov 2022 at 19:32 | Open on oldbytes.space

Anisse :unverified:

@dougall I see you mention Windows on ARM's emulator, did you look at another high performance emulator like FEX ?

10 Nov 2022 at 14:48 | Open on octodon.social