@dneto I realize what I posted here is very clear exactly what I'm going for. There's more detail of my current thinking in a Zulip thread: https://xi.zulipchat.com/#narrow/stream/197075-gpu/topic/Vello-like.20pipeline.20on.20parallel.20CPU
Top-level
@dneto I realize what I posted here is very clear exactly what I'm going for. There's more detail of my current thinking in a Zulip thread: https://xi.zulipchat.com/#narrow/stream/197075-gpu/topic/Vello-like.20pipeline.20on.20parallel.20CPU 3 comments
@dneto I'm interested in prior art, so pointers are welcome (I'll look into this). There's also CUDA streams, which is maybe the closest existing thing, though I haven't yet carefully studied the alternatives in CUDA world. @raph E.g. https://www.iwocl.org/wp-content/uploads/iwocl2017-andrew-ling-fpga-sdk.pdf The CNN work was published more formally too. |
@raph
This made me think of the parallel kernels connected with real channels that Altera made about 10 years ago. It's in their OpenCL FPGA optimization guide. I don't recall whether the channel operations synchronized global memory writes, my hunch is they don't/didn't.