I've been thinking quite a bit about a "Good Parallel Computer" which would overcome the worst limitations of GPUs, which I find increasingly frustrating. Basically, instead of launching compute shaders in an (x, y, z) cube, you have a programmable controller which launches workgroups (multiple kinds) when inputs are ready, enabling queues and other things.
I know of the GRAMPS paper, Vortex, and Tenstorrent. What other things are out there? Who should I be talking to?
Blog post before long.
@raph
How close to theoretical peak do you want to get? What shape of problem are you trying to optimize for?
Have you heard of Epic's Verse language project?