Email or username:

Password:

Forgot your password?
Raph Levien

I've uploaded a video of my "I want a good parallel computer" talk: youtube.com/watch?v=c52ziyKOAr

I also have a half-finished blog post where I intended to go into more detail, but I'm posting this now because I hope it can provoke some interesting discussion. I'll put a little more context here, and I'm also happy to answer questions.

(1/4)

12 comments
Raph Levien

One of the main things I can't do in the current graphics API is run a 2D renderer within bounded memory, at least without having a fence and readback to the CPU, which could tank performance. The underlying problem can do it, but you need to be able to dynamically dispatch the various parts of the problem and use queues to connect the pieces, which compute shaders can't do. The recent development of work graphs can do queues and bounded memory, but...

(2/4)

Raph Levien

...can't sustain the ordering guarantees you need for correct 2D rendering.

I'm more bullish on work graphs now, especially after attending HPG (highperformancegraphics.org/20) and seeing the two work graph talks there. However, they need more baking; the current version is still pretty limited.

(3/4)

Raph Levien

Another thing missing from this talk is to go deeper into the Cell architecture. I think that's every bit as relevant as Larrabee. A great introduction is Copetti's site (copetti.org/writings/consoles/).

(4/4)

Tom Forsyth

@raph Well now - I may have some very strong views on this! The mantra I constantly yelled at people when developing Larrabee was "don't build the Cell". I think we very much succeeded in that goal.

Raph Levien

@TomF I'd love to hear you expand on that, in any format you'd like. As I say, I have a half-finished blog post, and polishing that might be an opportunity to include your perspective, if nothing else by linking to something.

Tom Forsyth

@raph Just watching the talk now. I very much disagree with the "AVX-512" was too power hungry" statement - that was not a factor.

There's a robust argument to be made that requiring a 100% coherent memory fabric took more power than the GPU's much weaker fabric. On the other hand, your whole lecture is kinda wishing they HAD that fabric, so... :-)

The real problem was it was 20 years too early. Ironically what it absolutely destroyed contemporary GPUs at was very short AA lines and splines.

Raph Levien

@TomF I will be happy to update and correct the talk. There *might* be a bit of an element of Cunningham's Law there.

To respond though, I don't think I need a coherent memory fabric, for the stuff I'm doing I'm fairly happy using atomics to indicate explicit communication between workgroups.

Interesting correction re texture queries. From my perspective I don't see a huge difference between "CISC instruction" and "send a packet" but from yours I can see it's pretty different.

Tom Forsyth

@raph Yup - unless you literally need the machine to run an off-the-shelf OS with very few changes, you clearly want to be able to bypass the coherent fabric for all sorts of traffic. It burns a lot of power and limits your bandwidth.

We had lots of plans for turning it off for certain areas of memory, and make the traffic look more like a GPU, but we never got the chance to implement those. Ah well.

Tom Forsyth

@raph BY the way, enjoying the whole discussion of your renderer because it sounds like all the same problems we had with the Larrabee renderer.

We had a tile-based renderer, also for load-balancing reasons, and we also had problems with potentially massive intermediate buffers. Even though the cores were general x86 cores that could indeed call malloc() in the middle of things (though note there's no backing store if you run out!), the overhead of that is huge, so you try to avoid it.

Tom Forsyth

@raph Thanks - happy to collaborate in any way you want - feel free to send me email or whatever.

zellyn

@raph Hey Raph, just watched/listened to the talk. Amazing stuff. I have a comment, and a few questions, if you have time 🙂

Comment: Array of low-powered CPUs? TIS-100 did it first!!! :-)

Questions:
- What do you think of Mojo? IIUC, it's a big advance in the "lowering computation graphs to heterogeneous hardware, currently Python + SIMD, potentially arbitrary computation graphs"
- I was going to ask about your instincts for what the "right" language might look like, but you covered it in Q&A

Raph Levien

@zellyn TIS-100 wasn't first, I think of CM but there were others.

I'm cautiously optimistic about Mojo, but we haven't seen their GPU game yet. If you can really express parallel algorithms, then I think it takes the game. But so far MLIR has been disappointing for non-AI tasks; they need to come up with better dialects.

There are really no other good candidates for general purpose languages that compile to GPU (though some compelling research prototypes, and I'm also liking Slang).

Go Up