Email or username:

Password:

Forgot your password?
leah & asm & forth, oh my!

it's becoming depressingly clear that speculative execution is an inherently insecure - and unsecurable - feature. it speeds single threaded systems up massively, and obliterates any hope of keeping processes safe from each other in multiprogrammed environments.

14 comments
leah & asm & forth, oh my!

"but we can't go back to the days when computers ran in lockstep with memory! how slow would things be if we did that?!" - well, all the mitigations for speculative execution are going to slow things down to that point anyway. and hey, now that CPU speed has hit a wall even with all our architectural hacks, maybe now the semiconductor companies can go and spend the money where it really matters - on RAM that can keep up with modern processors, rather than on ensuring processors only rarely have to slow down for RAM

leah & asm & forth, oh my!

maybe it's time to forget about DRAM, which in architectural terms is like using magnetostrictive delay lines for the memory in a PDP-8

Adrian Cochrane

@millihertz When I thought through reengineering an OS/browser from the hardware on up what made sense to me:

Have large amount of memory with fast burst-transfers, processing the data in it linearly. So we can still process reasonably large files!

And a smaller amount of memory in which we can arbitrarily rearrange data.

Vertigo #$FF

@alcinnz @millihertz That's what synchronous RAM is. Access times are still around the 14 to 20 MIPS range (so, figure, 70 to 50ns), so to go faster than that, an SDRAM will fetch a huge word of memory at once, and then parcel it out to the bus one smaller word at a time. A burst size of 8 to 16 isn't atypical these days, so with a 64-bit data path, you're looking at a (64*16)=1024-bit word inside the RAM chip itself.

And, yes, if you can arrange your data to be serialized in such a manner, you can get native throughputs without the need for caches. The problem is that, unless you deal almost exclusively with vectors, it is almost never the case that you can stream data for that long. Consider that research into compilers shows that nearly all programs have a "basic block" size of 8. Meaning, the computer will run at most up to 8 instructions before it needs to process a conditional or unconditional branch.

This is why loop unrolling and function in lining are so significant an optimization, and is a contributing factor why code tends to get larger over time, even for the same input source listing.

BTW,, when I set my Kestrel Projects CPU and main bus speed from 16 to 25 MHz, it was largely due to the access speed of external memory. Going faster all but requires a cache, and those in turn are optimized for synchronous memory.

@alcinnz @millihertz That's what synchronous RAM is. Access times are still around the 14 to 20 MIPS range (so, figure, 70 to 50ns), so to go faster than that, an SDRAM will fetch a huge word of memory at once, and then parcel it out to the bus one smaller word at a time. A burst size of 8 to 16 isn't atypical these days, so with a 64-bit data path, you're looking at a (64*16)=1024-bit word inside the RAM chip itself.

Adrian Cochrane

@vertigo @millihertz In my case... I was discussing a "string-centric" system primarily for decoding HTML, audio, images, video, HTTP, etc for display.

Though it probably helped that I described a hypothetical where I was rewriting everything!

Adrian Cochrane

@vertigo @millihertz Regarding that average basic-block size I had an interesting at-least-to-me solution for this usecase of parsing (which probably brings the average down), though I'm not sure how well it generalises.

What if we split the processor in 2 so half executes machine code that's near-entirely branches, thus relying mainly code density? And the other half primarily deals in straight-line code?

I saw a parser generator which included a tight-loop interpreter for such a machine.

Adrian Cochrane

@vertigo @millihertz In otherwords: Yes, my hypothetical did rely on code-cache.

Even if I toyed with an alternate way of handling it!

Vertigo #$FF

@alcinnz @millihertz It's funny that you mentioned that. I have on several occasions exactly this, but we can actually generalize this. We can have one processor whose job it is to control computations from various "thread units". Each thread unit processes instructions in a straight ahead manner for as long as it can. The control processor then serves as a job coordinator. Performance can be enhanced by throwing more straight ahead thread units into the mix.

If the control program tries to launch more threads than or available in hardware, it'll block until a thread is completed its task. In this way, the control processor itself always appears to be single threaded.

@alcinnz @millihertz It's funny that you mentioned that. I have on several occasions exactly this, but we can actually generalize this. We can have one processor whose job it is to control computations from various "thread units". Each thread unit processes instructions in a straight ahead manner for as long as it can. The control processor then serves as a job coordinator. Performance can be enhanced by throwing more straight ahead thread units into the mix.

leah & asm & forth, oh my!

@vertigo @alcinnz there's also the problem that even if we do solve the memory speed problem, there's still the problem of signals only going so fast across a circuit board before they end up getting out of sync, corrupted by noise, etc. it'd be far better to have static RAM and a little processor on the same chip, where they could keep up with each other - but that limits the size of RAM (and also the complexity of the processor, but that's a good thing). it also means that adding more RAM would add more processing power... which could only be used if your system were sufficiently parallel to just start doing that already

@vertigo @alcinnz there's also the problem that even if we do solve the memory speed problem, there's still the problem of signals only going so fast across a circuit board before they end up getting out of sync, corrupted by noise, etc. it'd be far better to have static RAM and a little processor on the same chip, where they could keep up with each other - but that limits the size of RAM (and also the complexity of the processor, but that's a good thing). it also means that adding more RAM would...

leah & asm & forth, oh my!

@vertigo @alcinnz basically, the transputer is the biggest missed boat in the history of computing

Adrian Cochrane

@millihertz @vertigo What I can say is: I enjoyed thoroughly thinking through reengineering an app from the hardware-on-up, & found it quite educational to write! I'm getting the impression this imagination could be quite valuable to the future of computing!

I'm keen to do so again, & would love to see others' takes!

That said I don't consider myself a hardware designer...

Vertigo #$FF

@millihertz @alcinnz It is unfortunate that fabrication of logic and of RAM on the same die is exceptionally expensive to do.

But now that we've moved into the era of "chiplets", maybe we should revisit this architecture.

Adrian Cochrane

@millihertz Besides, how about our software relied less on fast hardware? If we removed a couple layers of abstractions?

Some of these abstraction layers seem to be running around in circles...

A more nuanced memory model could also help...

Chip

@millihertz I’ve kind of come to that conclusion too. And relatedly, that there are probably a lot of vulnerabilities we’re not seeing because computer security researchers are rarely electrical engineers.

Go Up