@alcinnz @millihertz It's funny that you mentioned that. I have on several occasions exactly this, but we can actually generalize this. We can have one processor whose job it is to control computations from various "thread units". Each thread unit processes instructions in a straight ahead manner for as long as it can. The control processor then serves as a job coordinator. Performance can be enhanced by throwing more straight ahead thread units into the mix.

If the control program tries to launch more threads than or available in hardware, it'll block until a thread is completed its task. In this way, the control processor itself always appears to be single threaded.