Email or username:

Password:

Forgot your password?
Devil Lu Linvega

Let's say you're implementing a map function to run in parallel, how do you decide how many threads to spawn. There seems to be a pretty heavy cost to spawning a thread per cell when the list is very long, instead of having each thread process a few cells each. What's this topic called? What do I need to search for best practices?

17 comments | Expand all CWs
wakame

@neauoire

IMHO this goes deep into algorithm engineering. And likely depends highly on your architecture and even the function you are applying.

E.g. for sorting algorithms, quicksort performs badly if the size of your data exceeds the CPU cache (and then again if it exceeds a memory page etc.).
This is because quicksort jumps around in memory a lot.

Merge sort on the other hand work mostly linear, so it performs better independent of the concrete numbers for cache and mem sizes.

That said: Depending on the kind of calculation you are doing, it might be easiest to just test the performance of the code with different numbers of threads.

But I am curious about other answers :blobcatgiggle:

@neauoire

IMHO this goes deep into algorithm engineering. And likely depends highly on your architecture and even the function you are applying.

E.g. for sorting algorithms, quicksort performs badly if the size of your data exceeds the CPU cache (and then again if it exceeds a memory page etc.).
This is because quicksort jumps around in memory a lot.

Devil Lu Linvega

@wakame In this case I'm just modifying each cell, not even moving data. I was trying to figure in how many threads I should split the job in. I've tried all sorts of things, and for some reason, a bit more more threads than the number of cores my laptop has, is faster than exactly the number that I have. So I thought maybe there was some logic to it.

wakame

@neauoire I am sure there is. Reminds me of some work a friend of mine did many years ago. He was testing a graphics pipeline with several threads and... the results were mostly counter-intuitive.

Since your computer likely also use threads for UI etc. The cores likely benefit from doing "two things at (almost) once".

There is at least some memory management involved, for getting the code into the CPU, and moving data around means "waiting" for the CPU.

So
thread_count = 2*cores - os_thread_count
seems like a good hypothesis to me.

@neauoire I am sure there is. Reminds me of some work a friend of mine did many years ago. He was testing a graphics pipeline with several threads and... the results were mostly counter-intuitive.

Since your computer likely also use threads for UI etc. The cores likely benefit from doing "two things at (almost) once".

Kartik Agaram

@neauoire @wakame Aaah, I really need to click on the top-level message before reading.

The rough reasoning is: your CPU is never perfectly busy. Lots of things can cause threads to stall. An extra thread or two can keep the CPU going when other threads stall.

But too many threads and the CPU wastes time trying to decide which one to do next.

Paul Lalonde

@neauoire OpenMP calls this a parallel-for; it detects non-dependencies between elements and unrolls/parallelizes as appropriate.
Usual advice applies: use a library first, unless you want the rabbit hole, in which case measure your single-thread perf, thread launch cost, and back-of-the-envelope tradeoffs, then measure again. And again.

mcc

@neauoire I would think spawning one thread per CPU and then having them sleep until work is ready to be sent to them would be the regular way to do things

Devil Lu Linvega

@mcc I hadn't considered to poll the number of cores to make my implementation responsive :)

Mauve 👁💜

@neauoire @mcc I usually do n-1 to leave a CPU for my OS's renderer. Else it can lock up a system something fierce if your processing is heavy enough. 😅

Renaud Bédard

@neauoire that's called data parallelism!
en.wikipedia.org/wiki/Data_par

The rule of thumb (afaik) is to partition the array such that each hardware thread on your CPU has an equal portion of the work to do.

Devil Lu Linvega

@renaudbedard YES! That's it! Thank you, that's what I was looking for.

PsySal

@neauoire Usual guidance is 2x number of CPUs. Split the data up beforehamf or, depending, have a queue of work and workers peel the work off when they are free (second option makes more sense when workers sometimes block on i/o, mutexes, etc.)

John Kaniarz

@neauoire it’s usually solved with a thread pool. Typically you’d spawn 1 thread per core at the start of your program and have them ready to go. Then for your map you divide the work into chunks.

If every item takes the same effort then for N cores you give each core 1/N items, and each thread does their share in a for loop. If the work is in uneven, you divide the work into 2-3x smaller chunks on a queue and the threads will grab as the go.

Trammell Hudson

@neauoire the Cilk parallel language is built around this idea and uses a work-stealing queue that tries to maximise data cache hits for the processing. en.wikipedia.org/wiki/Cilk

tinspin

@neauoire If you have a perfectly async. system you never want more threads than cores. There simply is no point. Unfortunately the world is sync. even processors are sync. with a clock. Now if you really want to think about something that will tickle your brain try to imagine an async. processor with no clock. It has been done apparently but I still fail to imagine it.

meejah

@neauoire Absolutely not a boring question :) that's like half of all programming.

Go Up