The Implications of Fixing a Corner Case Bug in a Common Code Path

The Implications of Fixing a Corner Case Bug in a Common Code Path

  • Comments 8

My name is Bogdan Mihalcea and I’m a developer in VC++ Development Team. For the last three years in the team I worked in almost all the areas of IDE (Intellisense, Project System, Resource Editor, Debugger). Recently we were engaged in fixing various bugs in our product for VS9 SP1. In multiple instances we were facing similar bugs that can be easily being described as “a crash in a corner case scenario, with a fix in a common code path which might impact everybody”.

 

Typically corner case bugs surface when there are unforeseen circumstances early in the development process, or major design changes applied to an old code base. Like any other mature code base, VC is no different and the same rules apply; occasional design changes due to new requirements or breaking changes in our dependencies, new developers joining the team, all together with constant changes due to market requirements and evolution can be a significant causing factor for these bugs.

So how do we approach this kind of bug when it surfaces? It is not easy and usually there is no clear path to follow. We use the experiences that we gained along the way while trying to minimize risk at the same time. There are always risks: to break the backwards compatibility, to degrade the performance, and yes to introduce new bugs (no developer is perfect, if that will be the case we will not have bugs to begin with). Any of these risks are important to us as they might impact millions of customers.

For a service packs usually we try to follow a very strict plan as we try to avoid all the above risks. In a recent bug that was reported to us a customer experienced a project system crash when building a project that uses short paths (i.e. DOS 8.3) in Additional Directories. Analyzing this bug we concluded the following:

-          Although a corner case it is a real crash in the product

-          The customer had a context that required a fix

-          The fix was in a common code path, with potential to affect every customer that uses the VC project system


Not going too much into the implementation details, but enough for this exercise, in order to increase the performance we used a file hash map, to prevent unnecessary File Manager Object recreation. The file path was used as key. There were 3 operations involving the map that concerned us: Create, Lookup, and Remove. The Create and Lookup were operations revolving only around the map; Remove was an operation belonging to the destructor of the File Manager Object.

 

And here was the bug, in the customer case the object was added into the map using the path “as is”, in this case was short, but when destroyed we used the path stored in the object which was processed and stored as long path. The remove operation failed and we were left with an object in the map that was already destroyed.  There is no need for a lot of imagination to see what happened, next, when we queried the map for this file path and we used the object we got in return.

 

The first iteration of the fix was to normalize the access to the map by using always lowered long path names. This was a clean and safe approach but it introduced degradation in our project performance load tests by 2ms per item, which might seem small but in reality when your projects have thousands of items it adds up easily to seconds.

 

The reason why it introduced this extra cost was the Windows’ API we used to get the long path, which hits the disk in all cases. So even if we have a short path and we convert it to long, or we try to convert an already long path to a long path we pay the same cost, we hit the disk.  Most customers have their projects using always long paths, so with this fix we were addressing a small number of the customers’ cases, and we introduced degradation of the performance on all build operations.  In order to validate this fix we had to run all our tests for build system including the performance tests. What we gained after the first iteration was only the fact we had clear borders of problem and we confirmed that the fix was good and targeted in the right place.

 

The second iteration strategy was to optimize the first fix and try to minimize the number of times we use the Windows API. So we made sure that the key is always in sync with the path stored in the File Manager object, this prevented the object from failing to remove itself from the map when destroyed. We deliberately didn’t change the look-up and the destroy functions to not introduce unnecessary API calls. Everything looked nice but far from being ready. During the routinely code review we had for the fix, we noticed that we still don’t prevent the map from having a short and a long path of the same file which in complicated scenarios will end up confusing the dependency graph.

For the third iteration we leveraged the fact that before the creation, we were performing a look-up in the table to see if we have the required object already in the map, if we did, we returned the existing object. To address the second iteration’s problem, we introduced a second look-up in the table. This time we used the path stored in the File Manager Obj as it has potential to be different, if the target file exists on disk (the API has no impact on paths that do not exist), than the original one we were asked to create the object. We ran our tests and this time we didn’t notice any performance drop.

 

Even if all 3 iterations fixes were functionally correct, we chose to go with the fix from Third Iteration; the main reasons were, based on the risks described above:

-          Smaller more surgical fix, with less chances to break other scenarios

-          No performance hit for the common code path (long paths always)

-          Addresses the dependency graph issue for existing files

 

The process is far from being perfect and we do our best to improve it every day; I used this past experience only to give you a glimpse of the complexity and the impact of apparently small bugs with obvious solutions on the product cycle and to the end customer (that includes us as well because of course we use the same version of VS to develop it, which is “fun” when it is during development, missing features, slow and most of the time built “Debug”, but I will save this for another blog).

 

Best regards,

Bogdan Mihalcea

VC++ Development

  • "a customer experienced a project system crash when building a project that uses short paths (i.e. DOS 8.3) in Additional Directories."

    +

    "There were 3 operations involving the map that concerned us: Create, Lookup, and Remove."

    In the middle of a build, you do some Removes from Additional Directories?

    Even though I can't figure out what scenario calls for that, I'd have started by storing both the long name and short name in the internal list when creating a node.

    The user could still defeat it by renaming a directory on disk while the compiler is in the middle of compiling, but I'm not sure if that's worth worrying about.  If it is, then the compiler could lock the directory against renaming, which itself is a pretty familiar scenario.

  • I don't understand why you were hitting the disk every time, isn't there a file system cache?

  • > The first iteration of the fix was to normalize the access to the map by using always lowered long path names

    Normalization of paths should be to uppercase them to match the operating system, not to lowercase. Otherwise, you going to hit problems with 'Σ' and uppercase Georgian characters on OS's earlier than Vista.

  • To David:

    I agree. We have an open work item to track this issue. As we have to update it in various places this change was bigger (multiple places) and the implications much broader than we were willing to risk taking in SP1.

  • To Norman:

    No. All these operations happen because the map has a broader use across project system and build system.

  •  wqjkwqwjsklsksuws jiow jsdskuowhwepsaaawm sskuoesup,sisksuss;;aklwjskld jddskskssskdoiwjs ';jk

    kdksksisj ]ss9[]s6sisa0-3=aksewliw

    skssa

    ldsl9ds

    ljasllksaklajhs

  • I am surprised to hear that the VC++ team uses the IDE because there are so many glaring problems when you use the product for anything larger than a small toy solution file.

    Examples:

    File tab channel opening tabs on left and ordering based on MRU. In any solution with lots of files and developers that need to bounce between them, this scheme falls apart and the developer is left wondering, Where did my file go? The VS 2003 behavior was way better. Don't you guys experience this in what must be a large solution?

    Intellisense peformance: this is mostly worked around now with various hotfixes, but I am stunned that this ever made it out the door if any real developers had used the product for anything larger than a toy test. How did you guys not see these issues?

    Why does Visual Studio build timing not report the entire solution build time? This is another stunning omission. As you are probably well aware, timing how long it takes to build is an important step in optimizing a team's workflow. Requiring users to watch it and time it with a stopwatch manually is absurd when the IDE already knows when a build starts and stops. Have you guys never needed to time how long it takes to build VC++ itself, or any test projects?

    Why can I only open 2 Find in Files windows? When searching a very large codebase, I need more than 2 quite often when searching for related terms.

    I recommend that you head over to spend a few days watching someone that works on a large C++ app use the product. After interviewing them and watching them work you will be amazed at just how inadequate many aspects of the product are. Surely some of these must be feasible to fix in a few days time (as opposed to things that require massive many-month efforts).

  • Jimmy Sieben talk about a very importent feature that the VS lost, that is output the total building time of the entire solution.

    I really hope VS team could add this feature for VS 2005/2008, I'm nearly mad now for statistic the total building time for the reason I have so many projects in my solution,and I must write a perl script to do this, which is so disgusting!

Page 1 of 1 (8 items)