I've been thinking quite a bit about a "Good Parallel...

I've been thinking quite a bit about a "Good Parallel Computer" which would overcome the worst limitations of GPUs, which I find increasingly frustrating. Basically, instead of launching compute shaders in an (x, y, z) cube, you have a programmable controller which launches workgroups (multiple kinds) when inputs are ready, enabling queues and other things.

I know of the GRAMPS paper, Vortex, and Tenstorrent. What other things are out there? Who should I be talking to?

Blog post before long.

Like 14 Sep 2023 at 1:25 | Open on mastodon.online

8 comments

David Neto

@raph
How close to theoretical peak do you want to get? What shape of problem are you trying to optimize for?
Have you heard of Epic's Verse language project?

14 Sep 2023 at 2:56 | Open on mastodon.gamedev.place

Raph Levien

@dneto Ideally very close to theoretical peak. Basically problems with "interesting" parallelism including 2D rendering of course but also including sparse matrix multiplication, parsers, and so on. Yes, but that's at a much higher level than what I'm talking about; Futhark and Taichi are probably closer to the mark as far as languages that might compile down to run on GPC (as of course would be PyTorch and the MLIR ecosystem).

14 Sep 2023 at 2:59 | Open on mastodon.online

Raph Levien

@dneto I realize what I posted here is very clear exactly what I'm going for. There's more detail of my current thinking in a Zulip thread: https://xi.zulipchat.com/#narrow/stream/197075-gpu/topic/Vello-like.20pipeline.20on.20parallel.20CPU

14 Sep 2023 at 3:03 | Open on mastodon.online

David Neto

@raph
This made me think of the parallel kernels connected with real channels that Altera made about 10 years ago. It's in their OpenCL FPGA optimization guide. I don't recall whether the channel operations synchronized global memory writes, my hunch is they don't/didn't.

14 Sep 2023 at 4:11 | Open on mastodon.gamedev.place

Raph Levien

@dneto I'm interested in prior art, so pointers are welcome (I'll look into this). There's also CUDA streams, which is maybe the closest existing thing, though I haven't yet carefully studied the alternatives in CUDA world.

14 Sep 2023 at 4:15 | Open on mastodon.online

David Neto

@raph
Intel pre seemed some overviews of this at IWOCL.

E.g.

https://www.iwocl.org/wp-content/uploads/iwocl2017-andrew-ling-fpga-sdk.pdf

The CNN work was published more formally too.

14 Sep 2023 at 4:35 | Open on mastodon.gamedev.place

Aras Pranckevičius

@raph thoughts on D3D “work graphs” that are already out there? https://devblogs.microsoft.com/directx/d3d12-work-graphs-preview/

14 Sep 2023 at 4:01 | Open on mastodon.gamedev.place

Raph Levien

@aras Right, that's what basically started the exploration. I'm unhappy with them because of three limitations: no joins (yet, though that's planned), fixed size data only, and no ordering guarantees. That makes them pretty much unsuitable for Vello. The centerpiece of my idea is that the launch logic is programmable, so you can express joins and other things.

14 Sep 2023 at 4:03 | Open on mastodon.online