Email or username:

Password:

Forgot your password?
Raph Levien

Are you interested in working on parallel sorting algorithms in #WebGPU? If so, get in touch. I'm especially interested in segmented merge sort (moderngpu.github.io/segsort.ht), but other classic parallel sort algorithms including bitonic and radix are on the table.

I would find this fascinating to work on myself (and that still might happen), but my time is pretty packed. Even so, I'm excited about the prospects to where I might be able to fund some work out of my own pocket. Let's discuss.

12 comments
synlogic

@raph I might be interested. perf/scalability is a specialty of mine and fairly good at concurrency/multi-threading work in past. author of a FOSS Golang lib for latency instrum, controls and reporting (mkramlich/LatLearn on Github). I'd be a newb at Rust, if required, but decades in other langs prior including C. wrote my own 2D/realtime graphics game/GUI engines many times as well

Will Usher

@raph This would be a great effort, a data-parallel algorithms library like Thrust/CUB for WebGPU would make developing new algorithms a lot easier. In a previous publication we had to make our own radix sort by key, which might be interesting to look at: github.com/Twinklebear/webgpu- . It predates WGSL, so the shaders are GLSL and would need to be ported. I know @mighdoll has been working on some stuff in this area too, on re-usable GPU compute primitives.

Az

@raph Funny, I wrote a (naive) bitonic sort in #wgpu no later than on Tuesday! Two resources that helped me were the webgpu sample (in typescript) webgpu.github.io/webgpu-sample and a great blogpost by @tgfrerer poniesandlight.co.uk/reflect/b
Have fun, seeing those pixels getting sorted is really satisfying.

Raph Levien

@Az @tgfrerer Very cool! My understanding is that this implements only a single workgroup, while one of my constraints is dealing with a large and potentially quite variable number of elements. Also, from what I'm seeing so far, bitonic is not a great match from this application, but at the end of the day you have to measure and compare.

Az

@raph The post optimizes with workgroup shared memory when possible but fallbacks on a global implementation if not possible, depending on the step size. In my case I'm sorting 1M floats (not necessarily a power of 2) with a few thousand workgroups.

Raph Levien

@Az Right, I've read the blog post more closely since last replying to you, and see pretty clearly how it works. Do you have any kind of performance numbers for that?

The other thing I'm looking at (based on another tip) is OneSweep, which is state of the art from Nvidia:

arxiv.org/pdf/2206.01784.pdf

Single pass might work here, even on WebGPU, because the value plus flags can fit in 32 bits.

Az

@raph I'm still very new to this so I won't be able to say much, and my implementation doesn't exploit locality yet. (which also helps diminishing dispatches). Still my rough implementation seems to take around 5ms for 1M elements on a 3080Ti. Take it with a grain of salt though, I'm sure it will do way better once I find the time to improve it (and my profiling isn't the most accurate).

David Neto

@raph it would be awesome to have a bunch of mostly adaptable demos of key algorithms.
Matrix multiply seems important.
And performance tuning across devices.

There is a nifty little project that does a few things like this for Vulkan compute.

github.com/google/uVkCompute

And a nice demo of NVidia's cooperative matrix Vulkan extension.

github.com/jeffbolznv/vk_coope

@raph it would be awesome to have a bunch of mostly adaptable demos of key algorithms.
Matrix multiply seems important.
And performance tuning across devices.

There is a nifty little project that does a few things like this for Vulkan compute.

github.com/google/uVkCompute

And a nice demo of NVidia's cooperative matrix Vulkan extension.

Raph Levien

@dneto Ah, uVkCompute looks good, I agree an analog of that for WebGPU would be great.

I've thought seriously about doing the prefix sum part of that (and dipped my toe into it in the piet-gpu days), and could possibly be cajoled if someone else would run the project.

Now I'm reading up on the sort literature, and it's a pretty deep rabbithole. On CUDA, Onesweep looks very good, but I might be finding out that, for this algorithm, the gap between CUDA and WebGPU is like a yawning chasm.

Raph Levien

I got a bunch of responses to this, some private. Thanks for those!

I also did a bit of an exploration myself, including a bit of a fusion of FidelityFX sort and Onesweep ported to WebGPU, with the warp-local multi-split adapted to use shared memory instead of subgroups (warp operations). Details are in this Zulip thread: xi.zulipchat.com/#narrow/strea

This is a preliminary investigation, which shows that performant WebGPU sorting is likely feasible. I hope it gets that conversation started.

I got a bunch of responses to this, some private. Thanks for those!

I also did a bit of an exploration myself, including a bit of a fusion of FidelityFX sort and Onesweep ported to WebGPU, with the warp-local multi-split adapted to use shared memory instead of subgroups (warp operations). Details are in this Zulip thread: xi.zulipchat.com/#narrow/strea

Jed Brown

@raph Thanks for this deep dive. It has me one again thinking about structural mechanics simulation on WebGPU.

Jonathan Olson

@raph I’ve been working a bit on similar things, have a radix sort working, and have most things set for a merge sort (was planning bitonic for raked workgroups, then merge for the rest, but I’m seeing comments that bitonic approaches work for the whole thing it seems)

Go Up