Anybody who says "I can write correct multi threaded code" probably should be saying "I don't test my multi-threaded code".

It is very difficult to write correct multi-threaded code.

One way to appreciate this is various "find-the-bug" pop quizzes that occasionally come up.

Another approach that I'll go after here is to look at the state space.

If single-threaded apps have O[n] execution paths where n = number of atomic operations,  just two threads have O[n!] paths or worse. (It's not O[n*n], or even polynomial - see below). And n is usually a very large number.

Each of those execution paths is an angle to hit your program and find a bug. Maybe you can test against O[n], but you're delusional if you think you can protect against O[n!] or worse.

15 is a small number, and 15 factorial is 1,307,674,368,000 = ~1.3 trillion. Testing a million cases is hard enough. "Trillion" may sound like an unrealistic number, but listen to some of the ship-stopper stress bugs that show up in product's end games followed by the "what are the odds that could ever have actually happened" comments, and it's not so unrealistic.

Practically, that means multi-threaded apps have so many states that you can't possibly keep track of them all in your head. And even if you do, the next guy that goes and changes a line of code just ruined your model.

Counting the execution paths

Suppose you have a function that executes sequence of atomic operations:  A0, A1, A2, A3.

I want to just compare single-threaded and multi-threaded cases, so for now, let's ignore loops, conditionals, etc. because that's common across both and so it just complicates the comparison. (This simplification actually downplays the complexity of adding additional threads). In this model, there's only 1 code-path through this sequence of N states: A0, A1, A2, A3.

Now suppose you have a second function B0, B1, B2, B3 executing on another thread.

The threads can interleave these operations in any order. The convenient case would be (A0,A1,A2, A3, ... B0,B1,B2,B3). But maybe it's something goofy like: (A0,A1,B0,A2,B1,B2,A3,B3). If the bug only occurs when the pattern (A1,B0,A2, B1) is executed, good luck finding it.

It turns out for 2 sequences of lengths N and M, there are choose(N+M, N) ways to interleave them. That's (N+M)! / ( M! N! ) execution paths.   (see quiz here ). Hence O[n!] for 2 threads.

You can see how much more complex just simple execution gets with threads. Now imagine how much more complicated threads make loops, conditionals, etc.

How many states?

Locks don't make your code safe. They just reduce the number of states and hedge your bets.  At one extreme, 1 giant lock around everything can reduce you to 1 state per thread. If you put 4 states under 1 big lock (eg, A0...A3), and you put that same lock around every other state (eg, B0...B3), then you've reduced those 4 states to 1 state, thus eliminating 3 states.

You likely don't control the number of states. An operation that appears atomic (like a memory read) may actually be multiple states. Maybe you're looking at the source-line and the compiler split it out into multiple states. Or maybe there are windows where you communicate with the file-system, or other processes - and so they can be effectively injecting states into your state space. Or maybe the hardware has multiple states (usually around memory models).  That's why there's InterlockedIncrement(&i) instead of just doing i++.