@Az Right, I've read the blog post more closely since last replying to you, and see pretty clearly how it works. Do you have any kind of performance numbers for that?
The other thing I'm looking at (based on another tip) is OneSweep, which is state of the art from Nvidia:
https://arxiv.org/pdf/2206.01784.pdf
Single pass might work here, even on WebGPU, because the value plus flags can fit in 32 bits.
@raph I'm still very new to this so I won't be able to say much, and my implementation doesn't exploit locality yet. (which also helps diminishing dispatches). Still my rough implementation seems to take around 5ms for 1M elements on a 3080Ti. Take it with a grain of salt though, I'm sure it will do way better once I find the time to improve it (and my profiling isn't the most accurate).