Why can’t we just pack more and more ALUs in a CPU to increase processing throughput instead of increasing clockspeeds? Wouldn’t the gain be just as significant?

1.04K views

Why can’t we just pack more and more ALUs in a CPU to increase processing throughput instead of increasing clockspeeds? Wouldn’t the gain be just as significant?

In: Engineering

6 Answers

Anonymous 0 Comments

Same reason why 9 pregnant women can’t make a baby in 1 month.

Say you want to tell a CPU to calculate this expression: **(2 * 3) + (5 * 6)**

That’s 3 operations, so if you have 1 ALU this would take 3 steps, but if you have 2 ALUs you can compute **2 * 3** and **5 * 6** at the same time, and finish it in 2 steps.

BUT, if you increased ALU count to 3, you’d still need 2 steps to finish the calculation, because the 3rd step requires the results of the first two.

Alternatively, if you had something that can be computed in parallel, then you’d write multithreaded code to take advantage of multiple cores, or GPUs which have even more cores.

Anonymous 0 Comments

It’s a balance between area/power requirements and actual boost in performance.

And in a way that’s what a GPU is except they call them SIMD units not ALUs but it’s the same idea.

The difference between a CPU and a GPU though is what they are meant to run. A CPU is meant to execute programs that are largely serial in nature (e.g. compile a program, etc…) whereas a GPU is meant to execute programs that are highly parallel like compute shader expressions over millions of pixels.

Anonymous 0 Comments

You can to a point. That’s why processors now have more cores then they used to. But there are limits. For starters there’s speed-cost in coordinating between multiple cores.

The other half is many programs aren’t written to take good advantage of multiple cores. Unless the program is written in a way to take really good advantage of multiple cores, having access to extra cores won’t help. Without any specially written code, each program will only run on one processor at a time.

I can go into a lot more detail, just ask.

Anonymous 0 Comments

That’s what manufacturers *are* doing, by and large. That’s why every fabrication generation with a smaller process is such a big deal – they can now pack more transistors into the same size package. CPUs haven’t really gotten much faster in clock speed in a long time. 4ghz processors hit the consumer market 8 years ago and that’s still approximately the same top end speed today.

Anonymous 0 Comments

A 5 year old doesn’t know what an ALU is, but here’s an ELI5 explanation: I want you to color this picture, and we are going to time how long it takes. Now I want you to color this picture, but use both hands. Why did you not get it done in half the time? Sometimes the two spots you want to color are close together, and your hands are trying to take up the same space. Sometimes they are far apart, and you can’t look at both of them at the same time. Sometimes all the parts that are left need to be colored blue, and you only have one blue crayon. All the time, it’s very challenging for your brain to move two crayons at the same time and stay inside the lines.

A more technical explanation: It sounds like you are familiar with a basic CPU model with one register file, one arithmetic-logic unit, one load-store unit, and some miscellaneous parts. No modern processor that I am familiar with is like that — they all have redundant parts to support parallelization. But however many ALUs a processor has, they could always have more, right? So let’s talk about why we get to diminishing returns rather quickly.

Another commenter already explained the concept of data dependence, but it is important, so I’m going to do so also. If your first instruction is “c = a + b” and your second instruction is “d = c – 4”, you can’t start working on the second instruction until you know the answer to the first instruction. Now you are only using one of your ALUs at a time. Maybe your instructions can be re-ordered so that some later instruction “e = f + g” can be done at the same time as “c = a + b”. In fact, modern processors all do this. But the hardware necessary to logically determine which instructions can be started when and how to forward results between instructions is very big and complicated, and big and complicated electronics are also (relatively) slow, power-hungry, and heat-producing.

Even if we didn’t have the headache of trying to figure out how to utilize all of our hardware as much as possible without getting invalid results, just the fact that the processor has more parts is already a problem. More total parts means a longer distance between the two furthest away parts, which means that it takes longer for an electron to get from one to the other, which forces your clock to slow down. As mentioned earlier, more parts means more energy consumed, which means more heat produced. And heat is a big problem, because you don’t want your CPU to melt.

So the trend over the last decade+ has neither been for higher clock speeds nor more complex processors, but instead for a larger number of relatively slow, relatively simple processors all working on independent things. Going back to the ELI5 explanation, you are giving another page and another box of crayons to a friend. Since you are working on independent problems using independent tools, you now have double the coloring speed whereas you trying to use both hands probably were even slower than you with one hand.

Anonymous 0 Comments

That is done, it plays a large role in the recent performance increases.

For example, since Pentium 1 (earlier if you count exotic processors) desktop CPUs have been able to execute more than one instruction per cycle (normally you need more than one ALU for this). That is not as effective as a 2x speed increase would be, it has some limitations after all. For example if you had `a + b + c` then at 2x speed you could compute it in half the time, but with 2 simultaneous instructions .. you can’t do it because the second operation needs the result of the first operation, the second instruction has to wait for that result.

For an other example, you can make some special new instructions that do a lot of work in one go, for example “add these 4 numbers to these other 4 numbers”. You can use multiple ALUs to make operations like that fast. This is what SIMD is. This has even more limitations in addition to only working for independent operations. Eg maybe you can do 4 additions, or 4 multiplications, but you can’t mix&match, and if you only needed 3 additions then the 4th still happens but it’s wasted. As an analogy, it’s like having a bunch of clones of you to help you out, and they exactly copy your movements – you can’t use them to do different chores for you at the same time, and they don’t take out the trash any faster than you would on your own, but they would be great for raking the lawn.