Devine Lu Linvega

Devine Lu Linvega's posts Post Back to profile

Boring Programming Question

Let's say you're implementing a map function to run in parallel, how do you decide how many threads to spawn. There seems to be a pretty heavy cost to spawning a thread per cell when the list is very long, instead of having each thread process a few cells each. What's this topic called? What do I need to search for best practices?

Like 4 Jul 2023 at 2:06 | Open on merveilles.town

17 comments | Expand all CWs

wakame

Boring Programming Question

@neauoire

IMHO this goes deep into algorithm engineering. And likely depends highly on your architecture and even the function you are applying.

E.g. for sorting algorithms, quicksort performs badly if the size of your data exceeds the CPU cache (and then again if it exceeds a memory page etc.).
This is because quicksort jumps around in memory a lot.

Merge sort on the other hand work mostly linear, so it performs better independent of the concrete numbers for cache and mem sizes.

That said: Depending on the kind of calculation you are doing, it might be easiest to just test the performance of the code with different numbers of threads.

But I am curious about other answers :blobcatgiggle:

@neauoire

IMHO this goes deep into algorithm engineering. And likely depends highly on your architecture and even the function you are applying.

Expand text...

4 Jul 2023 at 2:18 | Open on tech.lgbt

Devine Lu Linvega

Boring Programming Question

@wakame In this case I'm just modifying each cell, not even moving data. I was trying to figure in how many threads I should split the job in. I've tried all sorts of things, and for some reason, a bit more more threads than the number of cores my laptop has, is faster than exactly the number that I have. So I thought maybe there was some logic to it.

4 Jul 2023 at 2:21 | Open on merveilles.town

wakame

Boring Programming Question

@neauoire I am sure there is. Reminds me of some work a friend of mine did many years ago. He was testing a graphics pipeline with several threads and... the results were mostly counter-intuitive.

Since your computer likely also use threads for UI etc. The cores likely benefit from doing "two things at (almost) once".

There is at least some memory management involved, for getting the code into the CPU, and moving data around means "waiting" for the CPU.

So
thread_count = 2*cores - os_thread_count
seems like a good hypothesis to me.

@neauoire I am sure there is. Reminds me of some work a friend of mine did many years ago. He was testing a graphics pipeline with several threads and... the results were mostly counter-intuitive.

Since your computer likely also use threads for UI etc. The cores likely benefit from doing "two things at (almost) once".

Expand text...

4 Jul 2023 at 2:30 | Open on tech.lgbt

Kartik Agaram

Boring Programming Question

@neauoire @wakame Aaah, I really need to click on the top-level message before reading.

The rough reasoning is: your CPU is never perfectly busy. Lots of things can cause threads to stall. An extra thread or two can keep the CPU going when other threads stall.

But too many threads and the CPU wastes time trying to decide which one to do next.

4 Jul 2023 at 3:06 | Open on merveilles.town

Paul Lalonde

Boring Programming Question

@neauoire OpenMP calls this a parallel-for; it detects non-dependencies between elements and unrolls/parallelizes as appropriate.
Usual advice applies: use a library first, unless you want the rabbit hole, in which case measure your single-thread perf, thread launch cost, and back-of-the-envelope tradeoffs, then measure again. And again.

4 Jul 2023 at 3:06 | Open on wandering.shop

mcc

Boring Programming Question

@neauoire I would think spawning one thread per CPU and then having them sleep until work is ready to be sent to them would be the regular way to do things

4 Jul 2023 at 3:10 | Open on mastodon.social

Devine Lu Linvega

Boring Programming Question

@mcc I hadn't considered to poll the number of cores to make my implementation responsive :)

4 Jul 2023 at 4:34 | Open on merveilles.town

Mauve 👁💜

Boring Programming Question

@neauoire @mcc I usually do n-1 to leave a CPU for my OS's renderer. Else it can lock up a system something fierce if your processing is heavy enough. 😅

4 Jul 2023 at 6:42 | Open on mastodon.mauve.moe

Renaud Bédard

Boring Programming Question

@neauoire that's called data parallelism!
https://en.wikipedia.org/wiki/Data_parallelism

The rule of thumb (afaik) is to partition the array such that each hardware thread on your CPU has an equal portion of the work to do.

4 Jul 2023 at 3:19 | Open on merveilles.town

Devine Lu Linvega

Boring Programming Question

@renaudbedard YES! That's it! Thank you, that's what I was looking for.

4 Jul 2023 at 4:26 | Open on merveilles.town

ティージェーグレェ

Boring Programming Question

@neauoire @renaudbedard

Meanwhile here's the pertinent man section for xargs:

https://www.gnu.org/software/findutils/manual/html_node/find_html/Controlling-Parallelism.html

4 Jul 2023 at 4:36 | Open on rap.social

PsySal

Boring Programming Question

@neauoire Usual guidance is 2x number of CPUs. Split the data up beforehamf or, depending, have a queue of work and workers peel the work off when they are free (second option makes more sense when workers sometimes block on i/o, mutexes, etc.)

4 Jul 2023 at 3:47 | Open on mastodon.social

John Kaniarz

Boring Programming Question

@neauoire it’s usually solved with a thread pool. Typically you’d spawn 1 thread per core at the start of your program and have them ready to go. Then for your map you divide the work into chunks.

If every item takes the same effort then for N cores you give each core 1/N items, and each thread does their share in a for loop. If the work is in uneven, you divide the work into 2-3x smaller chunks on a queue and the threads will grab as the go.

4 Jul 2023 at 4:22 | Open on mastodon.gamedev.place

Trammell Hudson

Boring Programming Question

@neauoire the Cilk parallel language is built around this idea and uses a work-stealing queue that tries to maximise data cache hits for the processing. https://en.wikipedia.org/wiki/Cilk

4 Jul 2023 at 6:15 | Open on social.v.st

David JONES

Boring Programming Question

@neauoire 4

4 Jul 2023 at 6:42 | Open on typo.social

tinspin

Boring Programming Question

@neauoire If you have a perfectly async. system you never want more threads than cores. There simply is no point. Unfortunately the world is sync. even processors are sync. with a clock. Now if you really want to think about something that will tickle your brain try to imagine an async. processor with no clock. It has been done apparently but I still fail to imagine it.

4 Jul 2023 at 6:47 | Open on chaos.social

meejah

Boring Programming Question

@neauoire Absolutely not a boring question :) that's like half of all programming.

4 Jul 2023 at 8:11 | Open on mastodon.social