So I’ve recently fixed two bugs that have come up that both shared a root cause.  I thought that they were interesting enough that it would be worth sharing with you guys.  In order to make them make sense I first need to give a little bit of an architectural background to it all. 

As you probably know, I’m responsible for the core IntelliSense™ architecture.  What does that mean?  Well, I work on the language analysis component that watches what you type and keeps our internal understanding of your source code up to date so that we can proffer help in many forms (tooltips, parameter help, completion lists, etc.).  This is a fairly complex task and there are many ways that one could architect a system like this.  In my opinion the most ideal way would be to have a system that watched what you did and always kept itself up to date after every action you performed.  This would be a system that would be very easy to reason about and which would always be able to help you the most since there would be no disparity between what you had written and what it knew about. 

Unfortunately, we do not have such a system.  Why not?  Well, first let me discuss the work that is done.  Consider an action like typing.  After a user has typed a character we might have to do all of the following steps:

  1. Update our token stream for that document.  Certain characters might require the entire document to be re-tokenized.  (imagine adding a /* )
  2. Update our parse tree for that document.  Depending on how much the token stream was churned will determine how much of the file needs to be reparsed and fixed up.  (imagine changing top level scoping by changing an open curly brace {
  3. Update our annotated expression graph.  A single token change could have an impact to the expression graph for your entire solution.  (Imagine changing a namespace name).

This is quite a bit of work to do, and we would potentially have to do all of that in the scant milliseconds between the characters a user has typed.  As it stands today with our current architecture, it’s too much work to do in too little time. 

So what did we decide on instead?  Well, the C# Language Service is divided into two threads: 

  1. The foreground thread which responds to user interaction events and presents the results of an request (like populating and bringing up a completion list after <dot> is hit)
  2. The background thread which is responsible for keeping our internal expression representation up to date after a user has performed an action.  i.e. after a character is typed it ensures that we now understand the user’s code given the massive amount of change in meaning that could have caused.

By choosing this sort of model though we have now made a tacit decision that the information we present to you might not be completely up to date or accurate.  i.e. when the foreground thread decides to do something like present a tooltip when you hover over an identifier it is requesting the information from an expression graph that could very well not be up to date.  Now, you might be saying to yourself “well, why don’t you just wait until the information is up to date and then present it to the user?”.  Interestingly enough, this is exactly what some language services choose to do.  When you’re in VB and you move away from the current bit of code you’re typing you might notice a pause.  What’s going on at that point in time?  Well, the VB Language Analysis System is taking what you’ve just written and making its entire system up to date.  However, in C# land we made a conscious decision that that was not the model we wanted.  We feel that pausing while a user types is something that our users will absolutely hate and it’s imperative that we not block the user while typing.  It should be noted that our architecture does exceedingly well on multi-proc and hyper-threaded machines.  In those cases the system will be running both of these threads simultaneously, and in effect we often appear to be up to date in between your keystrokes.

Now, is this such a big deal?  In practice the answer is almost always “no”.  While not being fast enough to perform in between your keystrokes, our system is still extremely fast and uses many snazzy techniques to do the work it does quickly.  So if you were to have the following code:

    public void Foo()
{
this.
}

and you then added a new method after “Foo” like so:

    public void Foo()
{
this.
}

public void Bar()
{
}

and you went back up to “Foo” and requested a completion list after “this”, you would almost certainly see “Bar” in the list.  We’re not necessarily fast enough to work within your keystrokes, but we are fast enough to deal with keystrokes and a bit of navigation.

When are cases where this being out of date can cause an issue?  One case is WinForms.  In order for them to accurately display what your form will look like they need us to accurately examine, decompose, and report back the meaning of your InitializeComponent method.  In the past we would just depend on our potentially out of date symbol graph, but it ended up being the source of many bugs and major headaches for the user (ever had all your controls disappear?  There’s a good chance it was due to that).  So, in VS2005, we changed our model so that if WinForms is asking us for data we will block the foreground thread until the background thread is finished working.  In effect, for this circumstance we’ve moved to the VB model.   Now, in order to not give you a horrible experience where you’re asking yourself “why the heck isn’t the system responding” we will pop up a progress dialog to tell you what’s happening, why, and how long there is left.  In that case we considered it important enough because in the event of us sending bad information to WinForms, we could end up corrupting your form. 

What we had here was a case where the requirements and capabilities of two different systems were not being met.  WinForms has a requirement that the Language Service in question provide high-fidelity analysis of the code in question, whereas the C# Language Service was designed to be fast, but not necessarily provide high fidelity information.  Unfortunately, this realization and formalization wasn’t clear to either the WinForms team or our team in the VS2003 timeframe which is why many of these bugs existed.

 

Ok, so that was the background bit, now onto the bugs I was fixing recently.  It’s been known to me for a while that there are more areas where requirements and capabilities are also not in sync.  But it wasn’t clear to me how to best fix the problem.

The first is “Bring Up Completion List On Identifier” (BUCLOI).  If you haven’t played with VS2005 yet, then here’s how it works: when you start any C# identifier (or keyword) we’ll bring up the completion list automatically populated with all the relevant identifiers for the that location (FYI: if you don’t like this feature it is simple to disable in the tools|options dialog).  i.e. if you’re starting to write a method and you type “pu” then we’ll show you:

            

and you can then select “public” without wasting time hitting <ctrl><space> to bring up the completion list.  Now, in order for this feature to be useful it has to be the case that we accurately provide you with the identifiers that would be valid in that location.  This was a great way for us to find bugs in the language service when we introduced this feature more than a year ago.  If the identifier wasn’t in the list then people would be frustrated and send us nasty bug reports telling us that this feature was getting in the way.  However, it allowed us to know that we weren’t doing a good enough job analyzing your code, or we weren’t getting all our internal information up to date fast enough.  By pushing this right up in peoples’ faces we were able to improve performance and accuracy by probably an order of magnitude.  However, even with all the work we did, there was one place where we were still running into problems and getting people frustrated.  Specifically, in this case:

            

As it turns out this is one of those areas where a single character change ends up having a massive effect on our internal symbol graph.  By adding the return value in for this generic method we are changing the parse tree extensively forcing a lot of re-analysis.  On top of that, generic methods are special in as far as how overloads and constraints are determined, and so there’s a fair bit of work that needs to be done to stitch all that information together.  Because of this it’s possible that on some occasions when you bring up the completion list for the return type it might not contain the generic type parameter in it.  And, what’s worse is that because it is highly likely that the type parameter is one character long, there’s a very high chance that it will be the prefix of some other item in the list.  If you then hit space, it will complete to that longer identifier and completely throw you off.  Ugh.  What an awful thing to do to you.  You’re trying to type completely legal code and we totally screw you over.  Definitely a bad thing and something that I think will leave some users cursing at us for.   Once again, this is a case where the high-fidelity requirement of BUCLOI contrasts with the lackadaisical nature of the analysis engine. 

Normally, we don’t run into these situations, but here was one case where it did become important.  In fact, in all grammatical areas in the C# language this is one of the rare ones where even though we’re so close to the definition of the item the adding/removing of the reference ends up having very costly effects on us.  In pretty much all other places it’s not a problem.  Now, unlike the WinForms case we couldn’t just block in this situation (since we’re dead set against that, and how would you feel about a dialog coming up saying “please wait” while you were trying to type the return type for a method!).  So what did we end up doing?  Well, something I really dislike, but which I think was an acceptable choice for VS2005.  We special cased.  Instead of having pretty generic and consistent rules for how we process changes in your code, we now special case what we do if you’re changing the return type of a generic method.  In that case we make darn sure that our understanding of the generic method type parameters is up to date very quickly in the process.  I would have preferred to not have to do this, but hopefully I can keep this hack abstracted away while keeping the rest of the architecture fairly clean.

The second place where this came up was in a great bug that was found by QA a while back.  The basic gist of it was that after you perform an “extract method’ refactoring, then immediately after the refactoring the “generate method stub” smart tag would appear to offer to generate the method you’d just created!  After reading everything so far it’s probably pretty clear to you what had happened.  The smart tag checked to see if the method call it was on could bind to an existing method.  Of course, because that method had *just* been created (probably <5 milliseconds prior), the IntelliSense™ background thread hadn’t incorporated it into a symbol graph.  And so, the smart tag saw that it wasn’t there and said: “hey, perfect place to display myself!”  So how did this get missed by us devs?  I mean, you’d think we’d have seen that when we created these features.  Well, as it turns out I develop on a dual proc machine.  And, on that machine this never repro’ed because on those machines the internal state was up to date when the smart tag queried it.   (similarly, the first example with method return types never repro’ed on these machines either).  So once again this was a case where a feature wanted up to date information but was depending on a system that didn’t guarantee any such thing.  Any guess as to how we eventually ended up solving this one?

As we move forward I think it’s very important to consider and document these design decisions very carefully.  The have far reaching consequences, and need to be understood and planned for when designing features that will end up being affected by them.