Larry Osterman's WebLog

Confessions of an Old Fogey
Blog - Title

Larry goes to Layer Court

Larry goes to Layer Court

Rate This
  • Comments 29
Two weeks ago, my boss, another developer in my group, and I had the opportunity to attend "Layer Court".

Layer Court is the end product of a really cool part of the quality gate process we've introduced for Windows Vista.  This is a purely internal process, but the potential end-user benefits are quite cool.

As systems get older, and as features get added, systems grow more complex.  The operating system (or database, or whatever) that started out as a 100,000 line of code paragon of elegant design slowly turns into fifty million lines of code that have a distinct resemblance to a really big plate of spaghetti.

This isn't something specific to Windows, or Microsoft, it's a fundamental principal of software engineering.  The only way to avoid it is extreme diligence - you have to be 100% committed to ensuring that your architecture remains pure forever.

It's no secret that regardless of how architecturally pure the Windows codebase was originally, over time, lots of spaghetti-like issues have crept into the  product over time.

One of the major initiatives that was ramped up with the Longhorn Windows Vista reset was the architectural layering initiative.  The project had existed for quite some time, but with the reset, the layering team got serious.

What they've done is really quite remarkable.  They wrote tools that perform static analysis of the windows binaries and they work out the architectural and engineering dependencies between various system components.

These can be as simple as DLL dependencies (program A references DLLs B and C, DLL B references DLL D, DLL D in turn references DLL C), they can be as complicated as RPC dependencies (DLL A has a dependency on process B because DLL A contacts an RPC server that is hosted in process B).

The architectural layering team then went out and assigned a number to every single part of the system starting at ntoskrnl.exe (which is the bottom, at layer 0).

Everything that depended only on ntoskrnl.exe (things like win32k.sys or kernel32.dll) was assigned layer 1 , the pieces that depend on those (for example, user32.dll) got layer 2, and so forth (btw, I'm making these numbers up - the actual layering is somewhat more complicated, but this is enough to show what's going on).

As long as the layering is simple, this is pretty straightforward.  But then the spaghetti problem starts to show up.  Raymond may get mad, but I'm going to pick on the shell team as an example of how a layering violation can appear.  Consider a DLL like SHELL32.DLL.  SHELL32 contains a host of really useful low level functions that are used by lots of applications (like PathIsExe, for example).  These functions do nothing but string manipulation of their input functions, so they have virtually no lower level dependencies.   But other functions in SHELL32 (like DefScreenSaverProc or DragAcceptFiles) manipulate windows and interact with large number of lower components.  As a result of these high level functions, SHELL32 sits relatively high in the architectural layering map (since some of its functions require high level functionality).

So if relatively low level component (say the Windows Audio service) calls into SHELL32, that's what is called a layering violation - the low level component has taken an architectural dependency on a high level component, even if it's only using the low level functions (like PathIsExe). 

They also looked for engineering dependencies - when low level component A gets code that's delivered from high level component B - the DLLs and other interfaces might be just fine, but if a low level component A gets code from a higher level component, it still has a dependency on that higher level component - it's a build-time dependency instead of a runtime dependency, but it's STILL a dependency.

Now there are times when low level components have to call into higher level components - it happens all the time (windows media player calls into skins which in turn depend on functionality hosted within windows media player).  Part of the layering work was to ensure that when this type of violation occurred that it fit into one of a series of recognized "plug-in" patterns - the layering team defined what were "recognized" plug-in design patterns and factored this into their analysis.

The architectural layering team went through the entire Windows product and identified every single instance of a layering violation.  They then went to each of the teams, in turn and asked them to resolve their dependencies (either by changing their code (good) or by explaining why their code matches the plugin pattern (also good), or by explaining the process by which their component will change to remove the dependency (not good, because it means that the dependency is still present)).   For this release, they weren't able to deal with all the existing problems, but at least they are preventing new ones from being introduced.  And, since there's a roadmap for the future, we can rely on the fact that things will get better in the future.

This was an extraordinarily painful process for most of the teams involved, but it was totally worth the effort.  We now have a clear map of which Windows components call into which other Windows components.  So if a low level component changes, we can now clearly identify which higher level components might be effected by that change.  We finally have the ability to understand how changes ripple throughout the system, and more importantly, we now have mechanisms in place to ensure that no lower level components ever take new dependencies on higher level components (which is how spaghetti software gets introduced).

In order to ensure that we never introduce a layering violation that isn't understood, the architectural layering team has defined a "quality gate" that ensures that no new layering violations are introduced into the system (there are a finite set of known layering violations that are allowed for a number of reasons).  Chris Jones mentioned "quality gates" in his Channel9 video, essentially they are a series of hurdles that are placed in front of a development team - the team is not allowed to check code into the main Windows branches unless they have met all the quality gates.  So by adding the architectural layering quality gate, the architectural layering team is drawing a line in the sand to ensure that no new layering violations ever get added to the system.

So what's this "layer court" thingy I talked about in the title?  Well, most of the layering issues can be resolved via email, but for some set of issues, email just doesn't work - you need to get in front of people with a whiteboard so you can draw pretty pictures and explain what's going on.  And that's where we were two weeks ago - one of the features I added for Beta2 restored some functionality that was removed in Beta1, but restoring the functionality was flagged as a layering violation.  We tried, but were unable to resolve it via email, so we had to go to explain what we were doing and to discuss how we were going to resolve the dependency.

The "good" news (from our point of view) is that we were able to successfully resolve the issue - while we are still in violation, we have a roadmap to ensure that our layering violation will be fixed in the next release of Windows.  And we will be fixing it :)

 

  • Larry, thank you for such an in-depth and informative post. It's intriguing to see not just what new features are coming in Windows, but to also see the new processes in place and how they will improve quality.

    A question: Is it a layering (or other) violation for a DLL marked as Win32 Console to call into a Win32 GUI DLL?
  • By "the next release of Windows" do you mean "Vista RC1" or "Vista RTM"?
  • No, it's not a layering violation.

    You can think of it this way: Higher level components have more dependencies than lower level components (by definition). So when a low level component takes a dependency on the higher level component, it moves higher in the layer map. But that is a bad thing, since it moves everything that depends on the low level component higher in the map, and potentially introduces cycles (which are horribly bad).

    This is a purely internal process involved in ensuring the architectural cleanlyness of Windows, for 3rd party applications, it isn't as important (to 3rd party platforms, the platform SDK is essentially their layer 0)
  • Awesome !
  • Personally, I found this post by Larry Osterman on 'layering violations'
    to be incredibly interesting. ...
  • Thanks for this post, very interesting.

    Just a simple question:

    why not hal.dll is layer 0? Which depends on the other: ntoskrnl.exe or hal.dll?
  • Awsome. I'd like to someday see that process firsthand.
  • Kazi, good question - the simple answer (as I understand it) is that the layering below ntoskrnl.exe isn't NEARLY as interesting as the layering above ntoskrnl.exe.

  • kernel32.dll depends on ntdll.dll, that depends on ntoskrnl.exe, so "Everything that depended only on ntoskrnl.exe (things like win32k.sys or kernel32.dll) was assigned layer 1" is wrong, but still it was a very interesting post.
    Kazi: They depend on each other

    Ivan.
  • I'm a little slow, could you please give another example of where a layer violation would be allowed?

    I get why this is a bad thing... I mean cycles are just messy so splitting everything off into a layer and saying "Only communicate down the tree!" seems perfectly logical... But what I don't get is why this would be allowed, ever...

    Of course peers should be allowed to communicate, so driver A can use a function in driver B, ditto for libs and or applications. But I can think of no example where a low level lib would be allowed to call something like the graphic lib...
  • Oh, Kazi almost beat me to it, but not quite.

    Since hal.dll is layer -1, that puts ntoskrnl.exe at layer -2, which puts hal.dll at layer -3, etc. The layering below ntoskrnl.exe is INFINITELY interesting.
  • Holy ****. I would love to have that code. There is a project I am working on that I bet has serious layering issues. Even on a small project, it can be a very good thing to keep layering in mind. If your classes aren't layered properly, then there probably are a few non-trivial relationships that are just going to cause endless problems.
  • I think the current state of the SHELL32 utility APIs reflects worse on the core OS team than on the shell team. File path manipulation on Windows can be surprisingly tricky with absolute paths, relative paths, mount points, UNC paths, device paths, escaped wide char paths, etc. That everyone is using the shell APIs points to a deficiency in the core filesystem APIs. One way to rectify this would be to move the functions and forward the old exports.

    Also, are naming conventions being scrutinized as well? Some of the APIs are rather collision-prone in their naming; I had code break a while back because OpenRaw was #defined A/W in a recent Platform SDK, and someone on the Windows Media SDK team needs to be brought to trial for naming a struct WM_GET_LICENSE_DATA.
  • Hi Larry,

    just a dumb question. How does Microsoft
    "remember" these architectural flaws
    (or left-out features) that are not being
    fixed/implemented within the "current"
    release but are "moved" to the next one?
  • Larry,

    Great blog, very informative reading. Larry, are you planning to post more details on the build process for Windows Vista?

    I find very interesting on how you guys manage to pull so many pieces together.

    Do you have any figures on the compile time for the builds for Windows Vista?.

    What are the specs for the systems that you are you compiling Windows Vista on? - I would expect these machines would have a fair amount of "horsepower" behind them.



Page 1 of 2 (29 items) 12