@raph it would be awesome to have a bunch of mostly adaptable demos of key algorithms.
Matrix multiply seems important.
And performance tuning across devices.
There is a nifty little project that does a few things like this for Vulkan compute.
https://github.com/google/uVkCompute
And a nice demo of NVidia's cooperative matrix Vulkan extension.
@dneto Ah, uVkCompute looks good, I agree an analog of that for WebGPU would be great.
I've thought seriously about doing the prefix sum part of that (and dipped my toe into it in the piet-gpu days), and could possibly be cajoled if someone else would run the project.
Now I'm reading up on the sort literature, and it's a pretty deep rabbithole. On CUDA, Onesweep looks very good, but I might be finding out that, for this algorithm, the gap between CUDA and WebGPU is like a yawning chasm.