September, 2006

Larry Osterman's WebLog

Confessions of an Old Fogey
  • Larry Osterman's WebLog



    Most of our customers (and most professional software developers) don't really understand how much is involved in making a feature.  It's not just the specification and code work.  It's all the other things that have to be handled.

    I like to lump these into a great big aggregate that I call the "*bilities".

    The *bilities are all the things that are a critical part of deploying a product but aren't a part of your normal day-to-day process.

    You see, when you come up with a new feature, typically you'll sit down and brainstorm about the product, then dev, PM, and Test go off and write the specifications for the feature, then dev writes the code and test writes the test cases, and PM does whatever PM does while dev is writing the code and test is writing the test cases.

    And then the testers test the code, and the developers fix the bugs and you're done - the feature is ready to put into the product, right?


    Maybe.  It depends on how well you've dealt with the *bilities.  And what are the *bilities?

    Well, there's all the things that you've got to have before you can ship.  Here are some examples:

    • Reliability
    • Localizability
    • Internationalizability (it's different from Localizability)
    • Geopoliticalizability
    • Compatibility
    • Accessibility
    • Usability (this applies to both UI AND APIs)
    • Securability (ok, it's Security, but I had to figure out a *bility for Security)
    • Deployability
    • Manageability
    • Serviceability
    • Upgradability
    • Extensibility
    • etc.

    If you've not covered ALL of these, your feature simply isn't ready to ship.

    Some of these are pretty straightforward for many features.  For example, if your code doesn't interact with users, accessibility isn't that big a deal.  Or if your feature is totally new, compatability may not be a big deal.  On the other hand, you need to think about how upgradable your code is - what happens when you want to add something new to your API in the future?

    For Vista, we covered most of the *bilities in what were called the Basics - the Basics were just a formal process we deployed to ensure that the *bilities were covered in every single feature - all the features we deployed had to have a Basics section in their spec that discussed how the feature intersected with the Basics.

    The *bilities can quite literally be make-or-break for a product.  For instance, if you don't handle Accessibility, Localizability, Internationalizability, and Geopoliticalizability you may find your product banned in certain countries.  In the US, if your application (or operating system) isn't fully accessible, you can't sell to certain governments.

    In many ways, the *bilities are a tax associated with delivering a product, but there are some things that you just have to do.

  • Larry Osterman's WebLog

    A bedtime story :)


    Once upon a time (all the best stories start with "Once upon a time"), there was a computer manufacturer, let's call it by a totally random collection of letters, let's say "JCN".

    JCN had a brand new computer that it was producing (called the QD-BU).  One of the really cool things about this computer was that the hard disks on it were HUGE.  Twice the capacity of any PC available at the time.  This machine came with a 20 MEGABYTE hard disk!

    Oh, and did I tell you it was FAST?  It was so fast, you could move the heads around on the hard disk in only 10 milliseconds!

    JCN was quite proud of their new computer, and they made sure that it was rigorously tested.  They made sure that every component in the computer was of the highest quality, even the hard disks.  It was especially important that the hard disks work well, because with all that disk space, people would store more and more data on the hard disk.

    After months and months of testing, JCN decided that their computer was ready to fledge its wings and take flight.  They launched it with MUCH fanfare, and it was very well received.


    Unfortunately, they realized soon after the launch that there was a problem.  Users started reporting that hard drives in the QD-BU were starting to act "badly".  They would corrupt data seemingly at random.  JCN was NOT happy at hearing this, after all, they had spent a lot of time and money ensuring that the QD-BU would be better than any other computer available.

    The problem was that the manufacturer of the hard drives couldn't produce them in quantity.  Their quality suffered when producing drives in quantity.

    JCN worked to fix the problem with the manufacturer, and eventually everyone was happy.


    Of course, JCN tried very hard to learn from the lessons of the QD-BU.  And their key takeaway?  Not that they needed to be careful about which vendor they chose to make their drives.

    Instead the lesson they learned was "Fast hard drives break".  So for their next generation of computers (the QT/3), the hard drives were much slower.  They took 85 milliseconds to move the heads, but they were MUCH more reliable than the old QD-BU drives.


    Of course this is only a bedtime story.  It has no relationship to the real world whatsoever.

  • Larry Osterman's WebLog

    It's Bedlam all over again...


    A really long time ago, I wrote a post about the "Bedlam DL3" event at Microsoft.

    Well, a couple of days ago, we had another Bedlam DL3 event.  For some reason, the permissions on one of our internal DLs were messed up, and someone had granted "send-as" permission to all 2500 members of the DL.

    Someone then sent a message to the DL with the From: field set to the DL (I have no idea why, or who did it, but they did it).

    That person then realized that they had made a mistake and they tried to recall the message.

    The problem is that message recalls are handled on the outlook client (all 2500 of them).  So every recipient of the message sent a "Recall Success" or "Recall Failure" message to the sender of the email message.


    And the Exchange servers proceeded to.....


    Slow WAY down.  Not surprisingly, given that they were handling what was estimated at 36G of email (2500 emails sent to 2500 recipients is 6.25 million emails, each email was about 6K bytes long).

    But they handled it with aplomb.  Even a 36G email bomb was handled by the servers.  Email backed up for several hours, but the servers didn't crash.  Man, things have improved since way back when.


    The best part of this was that the email alias in question was a security-related email alias.  So everyone on the DL was sending emails to the DL speculating about who was pen-testing the live Exchange servers :)  All the while the queues were being drained and clients were actively using the system.


    I was pretty impressed, to be honest.

  • Larry Osterman's WebLog

    Going gaga over XGL


    Chris Pirillo's been making a ton of noise over a video he posted showing off a YouTube video of a demo of the XGL desktop running on KDE.

    He then turns around and asks "Why can't Vista look like this?".  I'm not a UX (user experience) guy, but I have watched the video and I've got some pretty strong opinions about it.


    First off, he's right - this is a pretty amazing demo.  It has TONS of eye candy.  The "bouncy" effects on the windows are very pretty.  The rotating cube is cool, as is the "windows bump into each other" effect. Having said all that, there's a TON of distance between a cool demo (or proof of concept, or whatever it is you call something that's not shipping in a product for millions of consumers).


    For instance, the bouncy windows make you seasick after a while.  And the cube desktop, while slick has some serious issues - for instance, you've got a strong potential for "losing" your windows (because they're on a face of the cube that's obscured).

    The key thing to realize is that it's relatively easy to make a cool UI.  I've seen the most amazing proof of concepts for Windows UI coming from our advanced UX team.  Really compelling stuff, that just knocks your socks off. 

    And not one of them has ever seen the light of day outside of Microsoft (to my knowledge).


    Why is this?  Because making a good user experience is HARD.  It's easy to make a cool user experience, it's REALLY hard to make one that's good, and that works for millions of users.  There are a ton of things you need to consider.  You need to consider usability, accessibility, localizability (yeah, it matters - Right-To-Left languages may have differnt visual conventions than Left-To-Right languages), all sorts of other *bilities.  I've been through enough and read enough UX reviews over the redesigned multimedia control panel in Vista to realize the complexity of the things that these guys have to deal with.  It's a lot harder than you think.  John Gruber over at Daring FIreball has a classic post entitled "Ronco Spray-On Usability where he talks about some of hte issues.


    Take floppy windows for example.  The Shell Fit&Finish dude (Dave Vroney) just put out a post explaining why they disabled floppy windows.  The answer is that they significantly reduce the usability of the system.  They may be cool but they get really annoying really soon.


    And, of course, Vista is only V1 of the DWM.  This release is about getting the heavy lifting and building a new desktop compositing engine.  Future releases are likely to have a ton more cool stuff coming from the UI wizards now that they have a platform on which they can do really cool things.

  • Larry Osterman's WebLog

    My life is a House episode


    Fox TV here in the US has a show called "House".  Valorie and I started watching it sometime towards the end of the 2nd season, the 3rd season started last week.  House stars Hugh Laurie as a genius drug addicted, lame doctor who, with his brilliant associates, finds the root cause of impossibly complicated diseases.

    Each episode starts with someone arriving at the hospital with some mysterious ailment, and house and his impossibly pretty team go to work trying to diagnose the person's problem.  They almost always succeed and the patient goes home cured (with several notable exceptions).

    Last week, I realized that aspects of my life are very similar to House's (without the drug addition, the handicap, and the crazy-good looking sidekicks part (sorry folks, but nobody on the audio team quite matches House's team, especially me :)).  I'm also not the boss of the team, just a peon.  One of the hallmarks of the show is that they perform a "differential diagnosis" - diagnosis based on the symptoms of the disease.  Invariably their original diagnosis is almost always wrong, but they eventually find the root cause.


    But there's so much of my life that works like a House episode.  Take last week.

    One of the people on my team was looking at the Vista RC1 OCA information and noticed that we had a single crash bucket that had a significant number of hits in one of our components.

    I took a look at the crash dump and immediately diagnosed a concurrency issue.  I worked up a fix based on the call stack of the crash (by default OCA crash dumps contain the call stacks for the threads in the process and the registers and not too much more), and I was done.  Nothing out of the ordinary.

    I built the fix, verified it on my machine and started the checkin process (there are a number of steps that have to be taken for any checkin, including code reviews, test signoff, etc).


    Unfortunately, I had this nagging feeling about my fix - the call stack didn't have quite enough information to completely diagnose the problem - my fix would explain the crash, but if the problem was the one I thought it was, I would have expected that there would be side effects.  Things didn't quite add up (the doctors original diagnosis was wrong - the patient should have had other symptoms).

    So I went and I asked the internal OCA web site to collect more information from our customers - I wanted a more detailed version of the crash dump that contained the contents of the heap (the doctors asked for more tests to be performed).

    It didn't take long (a day or so) for a couple of new occurances of the crash to be reported with the heap dumps.  With the new info, I was quite surprised by what I saw (the new tests that the doctor ordered showed some data that both confirmed and disputed the diagnosis).  The crash was occurring in code that looked like the following:


    for (i = 0 ; i < class->cElements ; i += 1)
        class->GetElement(i, &class->_ValueArray[i]);
    x = class->_ValueArray[0];

    The crash was occurring when accessing _ValueArray[0].  The code was:

    move ecx, [esi]+24
    move eax, [ecx]

    The crash was occurring at the mov eax instruction, eax was 0.  When I got the heap dumps, I saw that class->cElements was 8, and _ValueArray pointed to valid memory!  I looked at the code, the _ValueArray value was located 24 bytes from the start of the class, so the problem wasn't some wierd compiler issue.  There was no question that the value was 0 at the time of the crash, but apparently the memory pointed to by ESI wasn't 0 (the test results were inconclusive - they didn't rule out the original diagnosis, but they didn't confirm it).

    So I went back for more information.  One of the OCA options you can do is to ask the customer to fill out a survey which can be used to help diagnose the problem.  I set up the crash bucket to ask the customers for a survey (the doctors went back and took a new version of the patient history).

    Unfortunately, even with all this data, we still didn't have confirmation that my original diagnosis was accurate (there was no additional information in the patient history).  Bummer.

    Fortunately, late on Thursday afternoon, I got an email from a tester in another part of the Windows organization.  She had gotten this crash running this one series of tests and was wondering if anyone on our team wanted to look at it (the patients mother-in-law remembered something that was important). 

    It turns out that she had hit exactly the same bug that the customers had, and she had a live debugger attached to the machine, which meant that I could diagnose the problem directly.  And on her machine, I saw the side effects I had expected to see in the crash dumps (the doctor's eventually performed exploratory surgery and identified exactly the problem that was occurring, and saved the day).

    I then talked to the guys who are responsible for the OCA reports.  It turns out that the reason I didn't see the expected side effects on the crash dumps was because of other services that live in the same process as our service.  It turns out that because of those other services, the process of generating OCA crash dumps doesn't preserve the entire state of the process at the time of the crash - some threads continue to run after the crash occurred.  So the information for the current thread is completely accurate as of the time of the crash, other information in the process may not reflect the state of the process at crash time (the patient had another symptom that masked the expected side effects, complicating what would normally be a simple diagnosis).


    Yeah, I know the House analogy is a bit tortured, but it was all I could think of while I was looking at the problem - "Darn it, my diagnosis is good, I know I found a problem, but I can't tell if it's the root cause or not".

  • Larry Osterman's WebLog



    Way back when when NTFS was first being designed, the designers of the filesystem had a problem - what should they do with "." and "..".  Traditionally, in *nix filesystems (and MS-DOS's FAT filesystem), "." and ".." were two hard links that were created by the mkdir command that included links to the current and parent directory. 

    For subdirectories, "." and ".." posed no issues at all - NTFS could do exactly what *nix and FAT did - create links to the parent and current directory and be done with it.

    But there was a problem with the root.  You see, *nix just had a "." in the root, HPFS had "." in the root, but FAT didn't have any special characters in the root. 

    The NTFS guys decided to treat the root exactly the same as any other directory - the root would have a "." and a "..", both of which linked to the root directory.  That way apps that traversed should behave correctly.

    When they rolled this out, I raised some issues about appcompat to the NTFS team, and their answer was "Nah, it shouldn't be a problem, after all, the worst thing that can happen is if they traverse ".." through the root - no big deal.  And it made their design much cleaner - there was nothing special about the root directory, it was just another directory.



    Anyone who's read Raymond's blog knows exactly what happened next :).


    Yup, the bugs started coming in.  You see, applications that tried to present pretty dialogs for navigation (like the common file dialog) allowed the user to type in ".." into their control (or select the ".." entry).  Because they wanted to be helpful, they would also show the full path to the file, so when you clicked on ".." it removed the last path element from the file (so "C:\Users\LarryO\Pictures" would be displayed as "C:\Users\LarryO" in the title of the dialog box).  Of course, when you clicked on the ".." entry in the root, they immediately crashed because they tried to navigate through "C:" and couldn't find a \ to back up.


    So the NTFS guys had to rethink their plans for the root directory.  They probably could have gotten away with having just "." in the root and no "..", but the developers decided to be safe - "If FAT doesn't put entries in the root, we won't."  And thus the root directory on NTFS partitions doesn't contain a "." or ".." entry.



    This post written with Windows Live Writer.

    Edit: Replaced symlinks with hard links.


Page 1 of 1 (6 items)