Concurrency Matters. A lot.

Note: Some readers have suggested that I am conflating concurrency and parallelism. As Rob Pike discusses here, concurrency allows you to structure a problem so that it can be solved in parallel. The former is required for the latter. Here I argue that only by writing concurrent software can we fully utilize modern parallel hardware.

A short while back, Linus Torvalds wrote a bit about parallelism, and it showed up on Reddit a week or two ago. As is usually the case when it comes to Linus, he said some things that were a bit hyperbolic and got a strong reaction from the rest of the internet. In other news, water is wet. But parallelism and concurrency are vital to modern computing (as we’ll see shortly). This deserves some more attention.

Why does parallelism matter?

The discussion about parallelism has a lot to do with modern computer hardware and the challenges it faces. Unlike everyone else who had to walk uphill in snow, both ways, the life of the hardware designer was easier in the old days than it is today. You arranged transistors into gates, arranged those gates into logic circuits and memory, and those circuits circuits could generally be trusted to work. Today we are not so lucky. The push to make processors faster and faster is accomplished in part by making their components smaller and smaller. The problem is if things get much smaller than they are today, everything goes to hell.

Chips are made by cutting transistors into silicon using UV light, then using layers of metal as “wires” to connect the transistors into gates and circuits. It looks something like this (image courtesy of Wikipedia):

A cross-section of an IC

You can see the transistors on the bottom, with the metal “wiring” shown in orange. The big bulb on top is a solder dot to connect the chip to other chips.

The processor I have in my desktop is built out of transistors that are 22 nanometers wide. (For comparison, a human hair is about 80,000–100,000 nm wide.) At this ludicrously small size, weird things start happening.

The thinnest part of the transistor, the oxide, gets really thin. Current oxides are about five atoms thick. If you make it any thinner, quantum mechanics start to kick in, electrons start magically tunneling through the oxide, and your transistor stops working like a switch. This breaks the computer.
The metal tracks that connect the transistors get so close together that they start acting like parallel plate capacitors. This “parasitic capacitance” means that it takes more power just to get a signal from Point A to Point B on the chip because the wires themselves soak up some of the charge.
What were previously tiny, acceptable errors in manufacturing become showstoppers.

These challenges (and many more) mean transistors just can’t get much smaller, and when they do, it comes with considerable heat and power problems. So, instead of making cores faster, designers started putting several of them onto a single CPU. This has its own set of design challenges, but at least it is feasible.

The free lunch is over.

Unfortunately, this makes the programmer’s job harder. In the past, we could safely assume that newer processors would have faster clock speeds, which meant it could run our program faster (or our program could do more things) without any special effort on our part to improve performance. As Herb Sutter noted all the way back in 2005, this is no longer the case. An individual core of a next-gen processor probably won’t be much faster than current ones.

In order to take full advantage of modern hardware, we need to break problems up into pieces that can run independently. We need concurrency. This is what Linus glosses over (or even dismisses) when he says

The whole “let’s parallelize” thing is a huge waste of everybody’s time. There’s this huge body of “knowledge” that parallel is somehow more efficient, and that whole huge body is pure and utter garbage.

If I do all of my work together in a single process (or thread, etc.) on an eight-core processor, I can only utilize, at most, 12.5% of its power. If I can break my work into eight pieces that can run concurrently, each of those pieces can be divvied out to a separate core and all run in parallel. We’re back to where we were in the early 2000s — I can use the whole CPU again!

Caveats ahoy!

Now, very few programs you run use every available clock cycle on every available core of your processor. Why? Your program isn’t the only thing running on the computer and has to share

The processor cores themselves
Memory and the cache of memory on the processor
I/O devices such as the hard drive and network card, which are much slower than the processor even if your program is the only one using them

So if your program is doing relatively little computation and spends most of its time interacting with the hard drive or network, parallelism isn’t as big of a deal and it is probably not worth the trouble of breaking your problem up into concurrent pieces. To be fair, most things that most people do with their computer will probably fall into this category. But if you are doing some serious crunching and CPU is the bottleneck, parallelism is a must.

Don’t parallelize for parallelization’s sake.

It is also silly to bend over backwards to break a problem (or a processor) up into a million pieces, which Linus notes with

…crazies talking about scaling to hundreds of cores are just that - crazy… End users are fine with roughly on the order of four cores, and you can’t fit any more anyway without using too much energy to be practical in that space. And nobody sane would make the cores smaller and weaker in order to fit more of them - the only reason to make them smaller and weaker is because you want to go even further down in power use, so you’d still not have lots of those weak cores.

But his claim that

The whole ‘parallel computing is the future’ is a bunch of crock.

is at best a sweeping generalization and at worst a bunch of crock. ~~Unless~~ Until there is a fundamental change in the way we build processors, cores are going to get more numerous, not faster, and concurrency is the only way to leverage this simple fact.