Exceptions should be exceptional

I spend a lot of time looking at systems that are not working very well. Sometimes they are unstable. Sometimes they are not scaling well. Very often, these applications will be throwing many handled exceptions and sometimes a few unhandled ones. I don’t get to look at many working systems because very few people call support to say how wonderfully well their systems are working but I suspect that they are not throwing many exceptions.

There is a lot to be said for an exception mechanism. It is a very easy to write way to get from a low level routine to a much higher level routine. I have seen people use it as a control mechanism before. Please note that I said "easy to write" rather than "quick and easy". There isn’t anything very quick about exceptions.

When I see that a process throws a lot of exceptions, I always ask the person that I am working with what causes the exceptions. Responses vary but have included:

1. I don’t know but they are all first chance so we can ignore them

2. That is normal. Don’t worry about it.

3. That is a change notification system. We throw an exception when a record has changed

4. Oh! Well, class A may have a reference to class B. If it doesn’t then we get an exception when we try to access it but that is OK because it is perfectly valid for class A not to reference class B

Um, sorry but no. If there are first chance exceptions then you are going through error handling code. How well did you test that code? Typically, people test the expected code path very well and a lot of error paths are never, ever run. Do you want to run untested code on your live server? I didn’t think that you did. Also, exceptions which are not expected are proof that your flow of execution is not what you intended. In other words, your application isn’t working right. Points 1 and 2 do not have a lot of validity in my opinion.

What about exceptions which are supposed to happen? I am not very keen on those either for 2 reasons. If you expect and handle a lot of exceptions then you probably are not going to notice an exception that means a real problem. Try…catch {} is not just sloppy programming. It should really be something that the compiler warns you about. This might seem like an extreme position so I would like to suggest a scenario or two.

In the first scenario, we are writing to a raw pointer which we got from a client of our DLL. We expect it to be a pointer to an employee record but it might rarely be null if we need to add an employee. If it is null then we catch the exception and so we save a comparison on the pointer which is clearly efficient programming. There are two problems with this. The first is that there is an assumption which may or may not be true. That assumption is that the only values ever passed will be null or a pointer to a valid employee record. What happens if it is neither? The answer is of course that it will AV or it will overwrite something that it shouldn’t do. So, this optimisation has opened us up to a number of bugs. If we get an invalid pointer and we catch it then we assume that it was a null and we need to create a new employee record – so we have corrupted the database. If it happened to get a pointer in to invalid memory then we create a new employee record that we don’t need and so corrupt the database. If we get a pointer in to an address that is inside the process then we trash someone else’s state and they will later crash. That seems a high price to pay for saving a comparison, especially when it was virtually free. If we are about to dereference a pointer then it will almost certainly be in a register. Testing a register for 0 is incredibly cheap – so cheap that it is normally faster to test and branch than do a long jump. It is also very easy for the processor to pipeline so that it can be ready for either case. Exception handling is several factors slower. As a result, it is almost always faster to check the data rather than just catch the exception. It is also safer.

This one is a real world example. There was a try block in managed code that did a few things. The first thing was to access an update record which may or may not exist. If this failed then the code skipped over the database update code and went to the catch. The catch block assumed that we had never got as far as opening the database connection before we had an exception. Actually, the OLEDB provider for the database threw an exception while it was holding a critical section but the caller "handled" the exception and carried on, leaking the critical section. That thread ran on, apparently more or less fine. A lot of other threads waited on the critical section. Sooner or later, the same thread would go through the same code and succeed which freed up the critical section and all the other threads were free to go again. As you might imagine, the system limped rather than ran.

In both these cases, real errors were masked by bad exception handling.

What about performance? That isn’t good either. We have to go down to low level to see why. Let us go to the heart of the CPU. The Fetch-Execute cycle is pumping away. Your code is more or less sequential. Everything that you will be doing for a while is in level 1 cache and the pipeline is full of instructions and second guessing for all it is worth. The code throws an exception. So much for the pipeline. The exception handler might not be in L1 cache. If it is several levels up then it probably won’t be. Using figures shamelessly stolen from Raymond Chen, a reference to memory in L1 cache takes about 2-3 times longer than an average instruction. Level 2 cache will take about the same as 6-10 instructions. Main memory is 25-50 instructions. If it is not in the working set and has to come off disk? That is 10,000,000 instructions. Is it just me or is that expensive? However, it is not actually quite as rosy as all that. When an exception is thrown, the Kernel gets the first look so we have a switch to ring 0 security (not free) and if the processor is busy with your application then the exception handler won’t be in cache. That will mean a lot of page faults. On XP SP2 and Server 2003 SP1, there are a lot of additional security checks that happen when you get an exception and they take a while because they have to access some external lists such as the DEP exception list. You can look at the KB 875352 for details if you like. After it has done this, control is handed back to your user mode application (a switch to ring 3). Will the code page be in cache? Almost certainly not after all that.

So, an exception requires very little code in an application but can result in many thousands of lines of code being run and many, many page faults. It is hard to imagine a worse way of handling a usual condition. Accordingly, exceptions should be exceptional.

Being intelligent people familiar with the very best in programming practice, naturally you won’t misuse exception handling. However, it is very likely that you will use some third party components in your system. If they are throwing a lot of exceptions then you might want to have a quick chat with them about it. Just a suggestion, of course.

Signing off

Mark