Email or username:

Password:

Forgot your password?
Top-level
Raph Levien

One of the main things I can't do in the current graphics API is run a 2D renderer within bounded memory, at least without having a fence and readback to the CPU, which could tank performance. The underlying problem can do it, but you need to be able to dynamically dispatch the various parts of the problem and use queues to connect the pieces, which compute shaders can't do. The recent development of work graphs can do queues and bounded memory, but...

(2/4)

9 comments
Raph Levien

...can't sustain the ordering guarantees you need for correct 2D rendering.

I'm more bullish on work graphs now, especially after attending HPG (highperformancegraphics.org/20) and seeing the two work graph talks there. However, they need more baking; the current version is still pretty limited.

(3/4)

Raph Levien

Another thing missing from this talk is to go deeper into the Cell architecture. I think that's every bit as relevant as Larrabee. A great introduction is Copetti's site (copetti.org/writings/consoles/).

(4/4)

Tom Forsyth

@raph Well now - I may have some very strong views on this! The mantra I constantly yelled at people when developing Larrabee was "don't build the Cell". I think we very much succeeded in that goal.

Raph Levien

@TomF I'd love to hear you expand on that, in any format you'd like. As I say, I have a half-finished blog post, and polishing that might be an opportunity to include your perspective, if nothing else by linking to something.

Tom Forsyth

@raph Just watching the talk now. I very much disagree with the "AVX-512" was too power hungry" statement - that was not a factor.

There's a robust argument to be made that requiring a 100% coherent memory fabric took more power than the GPU's much weaker fabric. On the other hand, your whole lecture is kinda wishing they HAD that fabric, so... :-)

The real problem was it was 20 years too early. Ironically what it absolutely destroyed contemporary GPUs at was very short AA lines and splines.

Raph Levien

@TomF I will be happy to update and correct the talk. There *might* be a bit of an element of Cunningham's Law there.

To respond though, I don't think I need a coherent memory fabric, for the stuff I'm doing I'm fairly happy using atomics to indicate explicit communication between workgroups.

Interesting correction re texture queries. From my perspective I don't see a huge difference between "CISC instruction" and "send a packet" but from yours I can see it's pretty different.

Tom Forsyth

@raph Yup - unless you literally need the machine to run an off-the-shelf OS with very few changes, you clearly want to be able to bypass the coherent fabric for all sorts of traffic. It burns a lot of power and limits your bandwidth.

We had lots of plans for turning it off for certain areas of memory, and make the traffic look more like a GPU, but we never got the chance to implement those. Ah well.

Tom Forsyth

@raph BY the way, enjoying the whole discussion of your renderer because it sounds like all the same problems we had with the Larrabee renderer.

We had a tile-based renderer, also for load-balancing reasons, and we also had problems with potentially massive intermediate buffers. Even though the cores were general x86 cores that could indeed call malloc() in the middle of things (though note there's no backing store if you run out!), the overhead of that is huge, so you try to avoid it.

Tom Forsyth

@raph Thanks - happy to collaborate in any way you want - feel free to send me email or whatever.

Go Up