Stating the Obvious: Error Handling

I’ve been at this software nonsense for over a decade now. While the idea that I have anything all figured out is hilarious, I hope I’ve at least noticed some useful patterns—things that work and things that don’t—across the projects, peers, languages, and systems I’ve seen.

Let’s try to distill some basics that I wish I knew when I started. In what’s hopefully the first in a series of discussions…

Let’s talk about ~~error handling~~ assumptions.

All software, on a fundamental level, works on assumptions. Take something as simple as:

int a = b + c;

What happens if b + c overflows because it’s larger than the biggest value int can hold? How is that behavior defined? What do you expect to happen?¹

Your answers to those questions are besides the point—what matters is that they form your working assumptions about how your code behaves. And when those assumptions are broken, things go sideways, fast. If you’re lucky, your program will crash. If you’re unlucky, your broken assumptions will silently creep into other code, spreading madness wherever they go. Once upon a time, the software that secures the whole Internet broke because someone assumed an integer wouldn’t overflow. Another time, Steam deleted all the files on your hard drive because someone assumed a variable was set.

How can we avoid such catastrophes? Simple:

Express your assumptions! (As code, not comments, please.)
When your assumptions don’t hold, do something!

We’ve just described error handling in two easy steps. That’s all it is—expressing what you expect to happen, and doing something when those expectations don’t pan out.

Complaining is not actually doing something.

You’ll often see stuff like this from people who don’t know better:

void doFoo() {
    if ( /* preconditions are not met */ ) {
        printf("Can't foo");
        return;
    }
    // ...Do the actual work...
}

Logging and printing are useful debugging tools, but they’re no substitute for actually handling the problem. You can’t assume every developer who calls doFoo() will see your message, and even if you could, it’s much better to give the calling code an opportunity to do something about it

Knowing when to quit

What to do really depends on the situation. Some errors are recoverable—you would be angry if Photoshop or Word crashed when it couldn’t open the file you selected. Ideally, you just get an error message that tells you what went wrong!

But on the other hand, some errors are unrecoverable. They express assumptions so foundational that there’s nothing sane to do if they break. If you try to get the 21st item in a ten-item list, or get a null reference when you expect valid data, give up. There’s a bug in your code and it clearly isn’t working how you thought it works. Stop before things get even worse!

Errors are expressed in all sorts of ways—recoverable ones can be handled with exceptions, error codes, algebraic data types, and more.² Unrecoverable errors are usually tested with assertions that kill the program when they turn out to be false.³ Specifics vary from language to language, or even project to project, but they’re much less important than this fundamental distinction: what’s recoverable, and what isn’t?

Lies real people believe about unrecoverable errors

You’ll hear programmers say the damnedest things:

I don’t want assertions in my code! Quitting when something goes wrong makes my code more fragile!

Like we’ve discussed, software is always a tightrope act, where every part of your program assumes every other part has behaved in ways you expect. Refusing to assert your assumptions doesn’t make them go away, it just makes bugs harder to track down (and possibly much more severe) when those assumptions don’t hold.
I don’t want assertions in optimized builds because it slows my program down.

Unless your code is in a tight loop or other hot path, you don’t need to worry about a couple of branches that check its basic assumptions.⁴

If you are in a hot path where every instruction counts, and you can prove it with profilers and benchmarks, it might make sense to have some debug_assert() that users can disable in release builds. But using that sparingly is a far cry from, “shut off all the assertions for maximum speed!”
Critical software shouldn’t have assertions.

Put another way, “critical software shouldn’t have unrecoverable errors”. This is mostly wishful thinking: we dream of a world where code that flies airplanes or controls medical devices doesn’t have bugs. But we don’t eliminate bugs by burying our heads in the sand and refusing to check for them, we eliminate bugs with careful design and testing. In many critical systems, a failed assertion restarts the program, since it’s better to get back to a known good state than to careen off the deep end. After all, integer overflow has blown up rockets.

As always, it depends (on things that depend).

What counts as an unrecoverable error depends on your application. We used “being unable to read a file” as an example of something you can usually recover from, but what if that file you’re trying to read tells the operating system about the hardware connected to the CPU? It would be a really bad idea to keep trying to boot the OS and drivers when you can’t even tell what hardware you’re booting!

On the other hand, sometimes the OS might want to continue in the face of what any other software would consider unrecoverable, because crashing the machine would mean there’s no way to report the problem. (Linus Torvalds and some Rust developers debated this recently in an interesting, if abrasive, exchange.)

Like most things in life, there’s no one correct answer, and good judgment comes with experience. Use your brain and don’t be an absolutist. (Your peers will thank you.)

Terrifyingly, it depends. Many languages will silently roll the value over, following the rules of two’s complement addition, since that’s how modern CPUs work anyways. In C or C++, signed integer overflow is undefined behavior, so the resulting program could wipe your hard drive and steal your cat, at least as far as the language standard is concerned. In Rust, overflow panics in debug mode and happens silently in release mode, unless you replace your nice + operator with methods like wrapping_add() anywhere it might happen! ↩
Arguing which of these is better is for another day. I’ll briefly say here that if you’re going to be doing this over and over in every program you write, being expressive and terse seem like important qualities. ↩
C and C++ programmers shouldn’t narrowly take assertions here to mean the assert() macro from <assert.h>, but instead any sort of machinery that records an unrecoverable error and immediately exits the program. In C++ this might involve throwing a special exception type and catching it in main() so that you unwind the stack before you go. (Compare this to Python’s AssertionError or Rust’s panic!(), which do the same thing.) ↩
We can even make these branches faster! Many compilers have somewhat-misnamed “likely” and “unlikely” intrinsics, which we can use to optimize for the case that our assertions are true. ↩