Does the number of threads increase the CPU execution time?


I was doing an exercise in C of matrix multiplication using threads and I realized that as I increased the number of threads the same happened with the CPU time. My computer has 2 cores, each of which can run two threads (Intel Hyper-Threading), but why exactly does this happen?

In: Technology

There’s a lot of work being done behind the scenes to share time on your CPU. The OS and every background app running get time to run code on the CPU, even as your main app runs. If your CPU only has one thread, there has to be work done to check on each application to see if they have code to run.

With multiple threads, you can keep active code running, but if it ever performs a task that needs to wait for something outside of the CPU (like reading stuff off of a hard drive, or downloading a file over network), the second thread can pick up that idle time without a lot of work being done to park the one thread and pick up the new one.

When you start creating multiple threads in an application, you get to have more things being done while others are in those idle cycles. If you don’t have a lot of idle time in your application, then you won’t get much benefit of having more threads than CPU’s.

Did you make the same observation when you went from 1 to 2 threads, or just from 2 to 4 threads?

Going from 2 to 4, this would be expected. Because hyperthreading just allows two threads to use the same computing units at the same time. This can be more efficient when one thread stalls out – for example when it has to wait for some data to be read from memory before it can continue. Without hyperthreading, the core would be twiddling its thumbs for a few clock cycles. With hyperthreading, the other thread can take over full time until the other thread can resume working.

So if your code is running without any stalling issues, hyperthreading doesn’t give you any performance benefits.

Dividing up a task so it can use multiple threads is not free. So you need to do som extra stuff to split it apart and collect the result.
If the matrix you multiply is small the overhead can be greater then the time you gain by doing the calculation in parallel. So splitting it up can increase the time

Try it with large matrices and you will likely se a decrease in time.

Hyperthreading is sharing the same core for two threads. You can get a bit of extra preformence if there is a lot of waiting for data from memory. So going from 2 to 4 threads likely only increase preformence if the matrix size is larger then the CPU cache size.

You’re paying a cost for allocating the kernel resources and scheduling the threads. If you’re naive about your implementation, threads can end up costing you more than they can save, if there’s concurrency to even be had in the first place. Matrix multiplication can be parallelized, but there are lower level implementation details you’re going to want to do first. Maybe the Intel compiler ($$$) will generate SSE instructions, but typically you’ll have to use intrinsics yourself to explicitly generate those instructions. This is instruction level parallelization. After you do that, then you’ll want to consider thread level parallelization to utilize all your cores. Don’t be too impressed with hyperthreading – this is a technology to use underutilized pipeline paths to squeeze in a few extra work units, but you’re going to have already saturated your SSE pipelines, so you won’t see any hyperthreading anyway. Thread level parallelism has to pay for the cost of the threads, amortized them over time. This means it’s only worth while if your datasets are large enough, otherwise, you could have computed your results faster than it takes to get your threads setup.

Creating threads is really computationally expensive. The OS does a lot of work that you’re not seeing directly. This is why most multi threaded applications use something called a thread pool where worker threads are already created and just waiting to be assigned work.

In your case you should create a thread pool on program startup and then feed it the matrix calculations as jobs to do. And it’s not going to be particularly useful unless you plan on doing a lot of calculations.

Other thing to keep in mind is that your dual core and hyperthreaded CPU has quirks that impact performance. Dual core means it has 2 hardware threads, which means the CPU can actually do 2 things concurrently. But hyperthreaded semi-doubles it and lets you do 4 things at once. It semi-doubles it because hyperthreaded cores act like 2 cores but the 2 cores share some parts. One of those parts is the ALU which does the actual computation. In your case you’re doing arithmetic with RAM so it’s very likely that hyperthreading actually slows you down.