@alcinnz @millihertz That's what synchronous RAM is. Access times are still around the 14 to 20 MIPS range (so, figure, 70 to 50ns), so to go faster than that, an SDRAM will fetch a huge word of memory at once, and then parcel it out to the bus one smaller word at a time. A burst size of 8 to 16 isn't atypical these days, so with a 64-bit data path, you're looking at a (64*16)=1024-bit word inside the RAM chip itself.
And, yes, if you can arrange your data to be serialized in such a manner, you can get native throughputs without the need for caches. The problem is that, unless you deal almost exclusively with vectors, it is almost never the case that you can stream data for that long. Consider that research into compilers shows that nearly all programs have a "basic block" size of 8. Meaning, the computer will run at most up to 8 instructions before it needs to process a conditional or unconditional branch.
This is why loop unrolling and function in lining are so significant an optimization, and is a contributing factor why code tends to get larger over time, even for the same input source listing.
BTW,, when I set my Kestrel Projects CPU and main bus speed from 16 to 25 MHz, it was largely due to the access speed of external memory. Going faster all but requires a cache, and those in turn are optimized for synchronous memory.
@vertigo @millihertz In my case... I was discussing a "string-centric" system primarily for decoding HTML, audio, images, video, HTTP, etc for display.
Though it probably helped that I described a hypothetical where I was rewriting everything!