Email or username:

Password:

Forgot your password?
Top-level
Az

@raph The post optimizes with workgroup shared memory when possible but fallbacks on a global implementation if not possible, depending on the step size. In my case I'm sorting 1M floats (not necessarily a power of 2) with a few thousand workgroups.

2 comments
Raph Levien

@Az Right, I've read the blog post more closely since last replying to you, and see pretty clearly how it works. Do you have any kind of performance numbers for that?

The other thing I'm looking at (based on another tip) is OneSweep, which is state of the art from Nvidia:

arxiv.org/pdf/2206.01784.pdf

Single pass might work here, even on WebGPU, because the value plus flags can fit in 32 bits.

Az

@raph I'm still very new to this so I won't be able to say much, and my implementation doesn't exploit locality yet. (which also helps diminishing dispatches). Still my rough implementation seems to take around 5ms for 1M elements on a 3080Ti. Take it with a grain of salt though, I'm sure it will do way better once I find the time to improve it (and my profiling isn't the most accurate).

Go Up