January, 2005

Larry Osterman's WebLog

Confessions of an Old Fogey
  • Larry Osterman's WebLog

    Why is Control-Alt-Delete the secure attention sequence (SAS)?


    When we were designing NT 3.1, one of the issues that came up fairly early was the secure attention sequence - we needed to have a keystroke sequence that couldn't be intercepted by any application.

    So the security architect for NT (Jim Kelly) went looking for a keystroke sequence he could use.


    It turned out that the only keystroke combination that wasn't already being used by a shipping application was control-alt-del, because that was used to reboot the computer.

    And thus was born the Control-Alt-Del to log in.

    I've got to say that the first time that the logon dialog went into the system, I pressed it with a fair amount of trepidation - I'd been well trained that C-A-D rebooted the computer and....


  • Larry Osterman's WebLog

    Moore's Law Is Dead, Long Live Moore's law

    Herb Sutter has an insightful article that will be published in Dr. Dobb's in March, but he's been given permission to post it to the web ahead of time.  IMHO, it's an absolute must-read.

    In it, he points out that developers will no longer be able to count on the fact that CPUs are getting faster to cover their performance issues.  In the past, it was ok to have slow algorithms or bloated code in your application because CPUs got exponentially faster - if you app was sluggish on a 2GHz PIII, you didn't have to worry, the 3GHz machines would be out soon, and they'd be able to run your code just fine.

    Unfortunately, this is no longer the case - the CPU manufacturers have hit a wall, and are (for the foreseeable future) unable to make faster processors.

    What does this mean?  It means that (as Herb says) the free lunch is over. Intel (and AMD) isn't going to be able to fix your app's performance problems, you've got to fall back on solid engineering - smart and efficient design, extensive performance analysis and tuning.

    It means that using STL or other large template libraries in your code may no longer be acceptable, because they hide complexity.

    It means that you've got to understand what every line of code is doing in your application, at the assembly language level.

    It means that you need to investigate to discover if there is inherent parallelism in your application that you can exploit.  As Herb points out, CPU manufacturers are responding to the CPU performance wall by adding more CPU cores - this increases overall processor power, but if your application isn't designed to take advantage of it, it won't get any faster.

    Much as the financial world enjoyed a 20 year bull market that recently ended (ok, it ended in 1999), the software engineering world enjoyed a 20 year long holiday that is about to end. 

    The good news is that some things are still improving - memory bandwidth continues to improve, hard disks are continuing to get larger (but not faster).  CPU manufacturers are going to continue to add more L1 cache to their CPUs, and they're likely to continue to improve.

    Compiler writers are also getting smarter - they're building better and better optimizers, which can do some really quite clever analysis of your code to detect parallelisms that you didn't realize were there.  Extensions like OpenMP (in VS 2005) also help to improve this.

    But the bottom line is that the bubble has popped and now it's time to pay the piper (I'm REALLY mixing metaphors today).  CPU's aren't going to be getting any faster anytime soon, and we're all going to have to deal with it.

    This posting is provided "AS IS" with no warranties, and confers no rights.

  • Larry Osterman's WebLog

    Laptops and Kittens....

    I mentioned the other day that we have four cats currently.  Three of them are 18 month old kittens (ok, at 18 months, they're not kittens anymore, but we still refer to them as "the kittens").

    A while ago, one of them (Aphus, we believe) discovered that if they batted at Valorie's laptop, they could remove the keys from the laptop, and the laptop keys made great "chase" toys.  Valorie has taken to locking her laptop up in a tiny computer nook upstairs as a result, but even with that, they somehow made off with her "L" key.  We've not been able to find it even after six months of looking.  To get her computer up and running, we replaced the "L" key with the "windows" key. Fortunately she's a touch typist, and thus never looks at her keyboard - when she does, she freaks out.

    Last night, I left a build running on my laptop when I went to bed.  Valorie mentioned that it would probably be a bad idea to do this, since the kittens were on the loose.

    Since I couldn't close the laptop without shutting down the build, I hit on what I thought was a great solution.  I put the laptop in two plastic bags, one on each side of the laptop (sorry about the mess on the table :)):

    I went to bed confident that I'd outsmarted the kittens.  My laptop would remain safe.

    Well, this morning, I got up, and went downstairs (you can see Sharron's breakfast cereal on the table to the top right).  I asked the kids if there had been any problems, and Daniel, with his almost-teenager attitude said "Yeah, the kittens scattered the keys on your laptop all over the kitchen".

    I  figured he was just twitting me, until I went to check on the computer...

    Oh crud...

    There were the keys, sitting in a pile where Sharron had collected them...

    I love my cats, I really do...

    The good news is that I managed to find all the keys, although I was worried about the F8 key for a while.

  • Larry Osterman's WebLog

    What is localization anyway?

    I may be stomping on Michael Kaplan's toes with this one, but...

    I was reading the February 2005 issue of Dr. Dobbs Journal this morning and I ran into the article "Automating Localization" by Hew Wolff (you may have to subscribe to get access to the article).

    When I was reading the article, I was struck by the following comment:

     I didn't think we could, because the localization process is prety straightforward. By "localization", I mean the same thing as "globalization" (oddly) or "internationalization." You go through the files looking for English text strings, and pull them into a big "language table," assigning each one a unique key

    The first thing I thought was what an utterly wrong statement.  The author of the article is conflating five different concepts and calling them the same thing.  The five concepts are: localizability, translation, localization, internationalization, and globalization.

    What Hew's talking about is "localizability" - the process of making the product localizable.

    Given that caveat, he's totally right in his definition of localizability - localizability is the process of extracting all the language-dependant strings in your binary and putting them in a separate location that can be later modified by a translator.

    But he totally missed the boat on the rest of the concepts.

    The first three (localizability, translation, and localization) are about resources:

    • Localizability is about enabling translation and localization.  It's about ensuring that a translator has the ability to modify your application to work in a new country without recompiling your binary.
    • Translation is about converting the words in his big "language table" from one language to another.  Researchers love this one because they think that they can automate this process (see Google's language tools as an example of this).
    • Localization is the next step past translation.  As Yoshihiko Sakurai mentioned to Michael in a related discussion this morning "[localization] is a step past translation, taking the certain communication code associated with a certain culture.  There are so many aspects you have to think about such as their moral values, working styles, social structures, etc... in order to get desired (or non-desired) outputs.  This is one of the big reasons that automated translation tools leave so much to be desired - humans know about the cultural issues involved in a language, computers don't.

    Internationalization is about code.  It's about ensuring that the code in your application can handle strings in a language sensitive manner.  Michael's blog is FULL of examples of internationalization.  Michael's article about Tamil numbers, or deeptanshuv's article about the four versions of "I" in Turkish are great examples of this.  Another example is respecting the date and time format of the user - even though users in the US and the UK both speak English (I know that the Brits reading this take issue with the concept of Americans speaking English, but bear with me here), they use different date formats.  Today is 26/01/2005 in Great Britain, but it's 01/26/2005 here in the US.  If your application displays dates, it should automatically adjust them.

    Globalization is about politics.  It's about ensuring that your application doesn't step on the policies of a country - So you don't ever highlight national borders in your graphics, because you might upset your customers living on one side or another of a disputed border. I do want to be clear that this isn't the traditional use of globalization, maybe a better word would be geopoliticization, but that's too many letters to type, even for me, and since globalization was almost always used as a synonym for internationalization, I figured it wouldn't mind being coopted in this manner :)

    Having said that, his article is an interesting discussion about localization and the process of localization.  I think that the process he went through was a fascinating one, with some merit.  But that one phrase REALLY stuck in my craw.

    Edit: Fixed incorrect reference to UK dates - I should have checked first :)  Oh, and it's 2005, not 2004.

    Edit2: Added Sakurai-san's name.

    Edit3: Added comment about the term "globalization"

  • Larry Osterman's WebLog

    Keeping kids safe on the internet


    Joe Wilcox over at Microsoft Monitor recently posted an article about keeping kids safe on the internet.

    It’s a good article, but I’d add one other thing to his suggestions:  If you’ve got more than one computer in your house, disable internet access to all but public computers.  And if you’ve only got one computer put it in a public location, like the kitchen.

    We’ve got six different computers in our household – each kid has their own, I’ve got two, Valorie's got one, and there’s a common computer in the kitchen.  Valorie's and my computers have internet access, as does the common computer, but none of the others are allowed to access the internet – we filter it off access at the firewall.

    The kids also have up-to-date virus scanners on their computer (although their signatures get a smidge out-of-date).

    Once a month, after patch day, I manually enable internet access and go to windows update and ensure that they’re fully patched and their virus signatures are updated.  I know I could use SUS to roll my own update server, but it’s not that big a deal.  SImilarly, I could set one of the internet connected machines as the virus update location for the kids computers, but again, it's not that big a deal.

    This works nicely for me, and the principles can be applied to anyone's computer, even without all the added hoopla I go through.  The first and most important part of the equation is that all internet browsing is done on a public computer – that means that they’re not going to be sneaking around the darker corners of the internet, with Mom and Dad in the same room.  

    The other part of the equation is that all accounts on the public computer are LUA accounts, which adds an additional level of safety to browsing - nobody can accidentally install ActiveX controls or other software, which again adds a HUGE level of protection.  We have an admin account, but it's password protected and the kids don't know the password. 

    Edit: Addressed Michael Ruck's comment.


  • Larry Osterman's WebLog

    Bobs Math Question: The Official Answers


    EDIT: Please note: This is a single post explaining the answer to a question posted earlier on this blog. 

    This site is NOT intended as a general purpose site in which to get help with your math homework.

    If you're having problems with your math homework, then you should consider asking your parents for help, you're not likely to find it here, sorry about that.


    Ok, he's back :)  My last post was a math problem that the teacher in my wife's classroom gave to the students (mostly 11 and 12 year olds fwiw).

    Here's the official answer to the problem, the kids needed to show ALL the calculations (sorry for the word-junk):

    Pyramid L=W=2’ H2 = 22 – 12 so H = 1.73

    V        =1/3*l*w*h

    = 1/3*2*2*1.73 = 2.31 cubic feet

    SA     =b2 + 2bh

    = (2)2 + 2*(2)*1.73

    = 4 + 6.92 = 10.92 square feet.



    V=B*h   SA = front + back + 3 sides

    = 2*(1/2*l*h) + 3* L*W

    Triangle #1 : L=8’, W=2’ H2 = 82 – 42 H = 6.93

    V = 1/2*8*6.93*2 = 55.44 cubic feet

    SA = 2(1/2*8*6.93) + 3*8*2 = 103.44 square feet


    Triangle #2 : L=9’, W=2’ H2 = 92 – 4.52 H = 7.79

    V = 1/2*9*7.79*2 = 70.11 cubic feet

    SA = 2(1/2*9*7.79) + 3*9*2 = 124.11 square feet


    Triangle #3 : L=10’, W=2’ H2 = 102 – 52 H = 8.66

    V = 1/2*10*8.66*2 = 86.6 cubic feet

    SA = 2(1/2*10*8.66) + 3*10*2 = 146.6 square feet


    Base of Tree: L=W=2’  H= 3’

    V = L*W*H = 2*2*3 = 12 cubic feet

    SA     = 2(L*H) + 2(W*H) + 2(L*W)

              = 2(2*3 + 2*3 + 2*2)

              = 2(6 + 6 + 4)

              = 32 square feet


    6 cones with H=1’, R=.5’, S= 1.12’

    V = 1/3*π*r2h = 1/3 * 3.14*.52 * 1 = .26 cubic feet

    Total volume = 6*.26 = 1.56 cubic feet

    Volume before cutouts:

    Pyramid                    2.31

    Triangle #1           55.44

    Triangle #2           70.11

    Triangle #3           86.60

    Base                        12.00

    Cones                        1.56

    TOTAL                  228.02

                                 Cubic feet


     Surface Area before cutouts:

    Pyramid                   10.92

    Triangle #1           103.44

    Triangle #2           124.11

    Triangle #3           146.60

    Base                        32.00

    Cones                      15.30

    TOTAL                  432.37



    Cutout Calculations - Volume

    All of the volume of the cutouts are subtracted from the total volume of the Christmas tree.


    There are 6 cylinders total.

    1 has r=1, h=2

    4 have r=1.5, h=2

    1 has r=2, h=2


    V = πr2h       SA = 2πr2 + 2πrh

    V        = π*(12 + 4(1.52) + 22)*2

              = π*(1+9+4)*2

              = 3.14*14*2 = 87.92 cubic feet


    Small Triangular Prisms

    There are three triangular prisms.

    1 has L=B=1 and W = 2’

    H2 = 12 - .52 so H= .87’

    2 have L=B=1.5 and W = 2

              H2 = 1.52 - .752 so H = 1.69’


    V        = Bw where B=1/2*l*h

    V        = (1/2*1*.87*2) + 2*(1/2*1.5*1.69*2)

              = .87 + 5.07

              = 5.94 cubic feet


    Total volume to subtract:



    93.86 cubic feet


    Christmas tree volume minus cutouts:



    134.16 Cubic Feet total

    Cutout Calculations – SA

    The front and back SA’s are subtracted from the total SA of the Christmas Tree but the side SA’s are added to the total.



    Front and back SA = 2πr2

    Side SA = 2πrh

    Front and Back SA

              = 2π(12 + 4*1.52 + 22)

              =6.28 * (1+9+4)

              = 87.92 Square feet

    Side SA

              = 2πrh


              = 12.56 * 9 = 113.04 Square feet

    Small Triangular Prisms

    Front and Back SA

    = 2*1/2*b*h

    = b*h

    = 1*.87 + 2(1.5*1.69)

    = .87 + 5.07

    = 5.91 Square Feet


    Side SA

              = 3*b*w

              = 3*(1+1.5+1.5)*2

              = 24 square feet

    Twice the SA of top of Base

              =2(2*2)=8 Square Feet


    SA to Add:            137.04

    SA to Subtract:      101.83

    Total SA to add:      35.21


    Christmas Tree SA plus cutouts:



              467.58 Square Feet Total

    Edit: Reduced Google juice of this post by changing the title from "Bobs Math Answers" to something more accurate - this post isn't intended to be a Q&A for students who are having trouble with their math homework :)


  • Larry Osterman's WebLog

    Microsoft Anti-Spyware


    I don't normally do "Me Too" posts, and I know that this one will get a lot of coverage on the Microsoft blogs, but the Seattle PI blog just mentioned that the beta of Microsoft's new anti-spyware solution was just released to the web here.

    I installed it on my machines at work yesterday, and it seems pretty nice so far.  Of course I didn't have any spyware for it to find (because I'm pretty darned careful, and run as a limited user), but...  It'll be interesting to run it at home, especially since Valorie (and I) like playing some of the online games (like Popcap's) that get singled out as being spyware by some tools.


    I have no knowledge of their final product plans, so it's pointless to ask.  All I know about this is what I've read in the press.


  • Larry Osterman's WebLog

    What's wrong with this code, part 8 - Email Address Validation


    It's time for another "What's wrong with this code".

    Today's example is really simple, and hopefully easy.  It's a snippet of code I picked up from the net that's intended to validate an email address (useful for helping to avoid SQL injection attacks, for example).


        /// <summary>
        /// Validate an email address provided by the caller.
        /// Taken from http://www.codeproject.com/aspnet/Valid_Email_Addresses.asp
        /// </summary>
        public static bool ValidateEmailAddress(string emailAddress)
            string strRegex = @"^([a-zA-Z0-9_\-\.]+)@((\[[0-9]{1,3}" +
                              @"\.[0-9]{1,3}\.[0-9]{1,3}\.)|(([a-zA-Z0-9\-]+\" +
            System.Text.RegularExpressions.Regex re = new System.Text.RegularExpressions.Regex(strRegex);
            if (re.IsMatch(emailAddress))
                return (true);
            return (false);

    As always, my next post (Monday) will include the answers and kudos to all those who got it right (and who f

  • Larry Osterman's WebLog

    Why does Win32 even have Fibers?

    Raymond's had an interesting series on fibers (starting here), and I thought I'd expand on them a bit.

    Fibers were added to Windows (in NT 3.51 SP3, IIRC) because some customers (not just SQL server) believed that they could improve the performance of their server applications if they had more control over their threading environment.

    But why on earth did these customers want fibers?

    Well, it's all about scalability, especially on MP system.  On a multitasking system, it's easy to forget that a single processor can only do one thing at a time.

    The ramifications of this are actually quite profound.  If you've got two tasks currently running on your system, then your operating system will have to switch between each of them.  That switch is called a context switch, and it can be expensive (just for the sake of argument, let's say a context switch takes 5000 instructions).  In a context switch, the operating system has to (at a minimum):

    1. Enter Kernel Mode
    2. Save all the old threads registers.
    3. Acquire the dispatch spinlock.
    4. Determine the next thread to run (if the next thread is in another process, this can get expensive)
    5. Leave the dispatch spinlock.
    6. Swap the old threads kernel state with the new threads kernel state.
    7. Restore the new threads registers.
    8. Leave Kernel Mode

    That's a fair amount of work to perform (not outrageous, but not trivial).

    The OS won't do this unless it has to.  In general, there are three things that will cause the OS to cause a context switch are (there are others, like page faults, but these are the big ones):

    1. When your thread goes to sleep (either by calling Sleep() or calling WaitFor[Single|Multiple]Object[s])
    2. When your thread calls SwitchToThread() or Sleep(0) (this is a special case of the Sleep() API that is identical to SwitchToThread())
    3. When your thread's quanta elapses.

    A thread's quanta is essentially the amount of time that the OS will dedicate to a thread if there's another thread in the system that can also run.  A quantum is something like 5-10 ticks on a workstation and 10-15 on server, and each tick is typically somewhere between 10 and 15 milliseconds, depending on the platform.  In general, your thread will get its full quanta unless there is a higher priority runnable thread in the system (please note: this is a grotesque simplification, but it's sufficient for the purposes of this discussion).

    The thing is, for a highly scalable application, context switches are BAD.  They represent CPU time that the application could be spending on working for the customer, but instead is spent doing what is essentially OS bookkeeping.  So a highly scalable application REALLY wants to reduce the number of context switches.  If you ever have a service that's performing poorly, one of the first things to look for is the number of context switches/second - if it's high (for some value of high), then there's invariably a scalability issue in the application that needs to be addressed.

    So why fibers?  Because for highly scalable applications, you want each of your threads to get their full quanta - in other words, you want the only reason for a context switch to be reason #3 above. 

    Remember the first cause of context switches: Calling WaitFor*Object.  What that means is that if you call EnterCriticalSection on a critical section with contention, then you're highly likely to cause a context switch. The same thing happens when you wait for an I/O to complete, etc.  You absolutely want to avoid calling any Win32 APIs that might block under the covers.

    So fibers were created to resolve this issue.  A fiber is effectively removes steps 1, 3, 5 and 8 from the context switch steps above, switching from one fiber to another just saves the old register state, and restores the new register state.  It's up to the application to determine which fiber runs next, etc.  But the application can make its own choices.  As a result, a server application could have a dozen or more "tasks" running on each thread, and they'd radically reduce their context switch overhead, because saving and restoring registers is significantly faster than a full context switch.  The other thing that fibers allow is the ability to avoid the dispatcher spin lock (see John Vert's comment about context switches being serialized across all processors below).  Any global lock hurts your scalability, and fibers allow an application to avoid one of the global locks in the system.

    Ok, so why have fibers remained obscure?

    They've remained obscure first because of the reasons Raymond mentioned in his fibers caveat here - using fibers is an all-or-nothing thing, and it's not possible to use fibers from a shared library.  As Rob Earhart pointed out in this comment on Raymond's post, some of the idiosyncrasies of the fiber APIs have been resolved in the current versions of Windows.

    They're also HARD to deal with - you essentially have to write your own scheduler.

    Raymond also left off a couple of other gotchas: For example, if you're using fibers to improve your apps scalability, you can't call ANY Win32 APIs that might block (including filesystem APIs) because all the Win32 blocking APIs are also have thread affinity (not surprisingly :)) So if you're running 20 fibers on a single thread, when any of the fibers blocks, your thread blocks (however, the fibers can be run from another thread, because fibers don't have thread affinity, so if you have a spare thread around, that thread can run the fibers).

    The other reason that fibers have remained obscure is more fundamental.  It has to do with Moore's law (there's a reason for the posts yesterday and the day before).

    Back when fibers were first implemented, CPUs were a lot slower.  Those 5000 instructions for the context switch (again, this is just a guess) took .05 millisecond (assuming one cycle/instruction) to execute on a 100MHz machine (which would be a pretty fast machine in 1995).  Well, on a 2GHz machine, that .05 is .0025 millisecond - it's an order of magnitude smaller.  The raw cost of a context switch has gone down dramatically.  In addition, there has been a significant amount of work in the base operating system to increase the scalability of the dispatcher spinlock - nowadays, the overhead of the dispatcher lock is essentially nonexistant on many MP machines (you start to see contention issues on machines with a lot of CPUs, for some value of "large").

    But there's another aspect of performance that has gone up dramatically, and that's the cost of blowing the CPU cache.

    As processors have gotten smarter, the performance of the CPU cache has become more and more critical to their speed - because main memory is painfully slow compared to the speed of the processor, if you're not getting your data from the CPU's cache, you're paying a huge hidden penalty.  And fibers don't fix this cost - when you switch from one fiber to another, you're going to blow the CPU cache.

    Nowadays, the cost of blowing the cache has leveled the playing field between OS context switches and fibers - these days, you don't get nearly the benefit from fibers that you did ten years ago.

    This isn't to say that fibers won't become useful in the future, they might.  But they're no longer as useful as they were.

    Btw, it's important to note that fibers aren't the ONLY solution to the thread quantization issue mentioned above.  I/O completion ports can also be used to limit context switches - the built-in Win32 thread pool uses them (that's also what I used in my earlier post about thread pools).  In fact, the recomendation is that instead of spending your time rewriting your app to use fibers (and it IS a rewrite), instead it's better to rearchitect your app to use a "minimal context" model - instead of maintaining the state of your server on the stack, maintain it in a small data structure, and have that structure drive a small one-thread-per-cpu state machine.  You'll still have the issue of unexpected blocking points (you call malloc and malloc blocks accessing the heap critical section), but that issue exists regardless of how your app's architected.

    If you're designing a scalable application, you need to architect your application to minimize the number of context switches, so it's critical that you not add unnecessary context switches to your app (like queuing a request to a worker thread, then block on the request (which forces the OS to switch to the worker, then back to the original thread)). 

    Significant Edit (1/10/2005): Fixed several issues pointed out by the base performance team.


  • Larry Osterman's WebLog



    I wasn't planning on writing about the disaster, since I figured that many people more eloquent than I had already covered it.

    And then I got an email from Will Poole, the Senior VP in charge of the Windows division.  Will was on Phuket at the time of the tsunami,  Will was sea kayaking on Phuket at the time of the Tsunami.  Fortunately, he and his family were safe (they were on the "unaffected" part of the island.

    Will wrote up a Photostory3 photostory with his pictures of the event, and posted them on the Nikon digital photography site here (if you're running XP SP2, the content's mislabled so you need to allow the content to be downloaded).  You'll need WM10 or Photostory3 to see it, since it's encoded with the Photostory codec (which dramatically reduces the size of the WMV file).  Will ends with an ad for Photostory3, IMHO, that was unfortunate (since it detracts from his message), but...

    Anyway, the video's absolutely worth watching.

    And if you can somehow find the money, please, please give to one of the many charities helping out.  While the news reports currently indicate that they currently have more cash than they know what to do with, the reality is that the reason for this is simply that too much infrastructure's been lost for them to begin spending the money - the need is still there.

    This posting is provided "AS IS" with no warranties, and confers no rights.

  • Larry Osterman's WebLog

    18 years ago, Today.

    Editors Note: You knew this was coming Dear :)

    Eighteen years ago today, on January 17th, 1987, at the Scarsdale Synagogue - Tremont Temple, Valorie Lynne Holden and Lawrence William Osterman were married.  I still have a laminated copy of our wedding announcement from the local paper in Albany, NY:


      Plans for a Jan. 17 wedding are being made by Miss Valerie Lynn Holden and Lawrence William Osterman.  The bride-to-be is the daughter of Patricia Holden of Annapolis, Md and her fiancé is the son of Melvin H. Osterman, Darnley Greene, Delmar, and Elaine P. Osterman of Scarsdale, Westchester County.  Miss Holden is a student at George Mellon University, Pittsburgh, Pa.  Her fiancé is a senior computer programming engineer with Microsoft Corp., Redmond, Wash.

    Never mind the spelling errors (it's Valorie not Valerie) and the factual errors in the piece (there's no George Mellon University), I still treasure that scrap of paper.

    Valorie and I started dating back in October of 1982 (we came out of the closet and announced our couplehood while working as roadies on a Clash concert.

    Valorie's stuck with me for well 20 years now, through business trips to Europe (I left on one 36 hours after we returned from our honeymoon), six months where I commuted between Redmond and Austin Tx - Monday-Wednesday in Austin, Thursday-Sunday in Redmond.

    She's been there through the entire lifetime NT 3.1, with all its ups and downs, she was there for Exchange 4.0, 5.0, 5.5, and 2000, and she's still there.

    She's been there through the birth of our children Daniel and Sharron.

    She's been there through nine different cats (four currently), countless tropical fish, a dog, and two horses.

    She's nursed me through a debilitating back injury.

    She's put her own career ambitions on hold for our family, only now, twelve years later, going back to school for her teachers certificate.  She's spent countless hours in our children's classrooms making a difference for every child in those classrooms.  

    She's been there for better, for worse, for richer, for poorer, for fatter, for thinner, in sickness and in health.

    'Til death do we part.


    Happy Anniversary, Valorie.

    I love you.


  • Larry Osterman's WebLog

    419 scams 'R' us..


    I'm a bit fragged today (up too late working on a school project with Daniel) so instead of something technical, I thought I'd share an email I just received...

    FROM: Sgt. Mark Ed
    Important Message
    To President / Managing Director..

    Good day,

    My name is Mark Ed, I am an American soldier, I am serving in the military of the 1st Armoured Division in Iraq, As you know we are being attacked by insurgents everyday and car bombs.We managed to move funds belonging to Saddam Hussien's family.

    We want to move this money to you, so that you may invest it for us and keep our share for banking.We will take 50%, my partner and I. You take the other 50%. no strings attached, just help us moveit out of Iraq, Iraq is a warzone. We plan on using diplomatic courier and shipping the money out in one large silver box, using diplomatic immunity.

    If you are interested I will send you the full details, my job is to find a good partner that we can trust and that will assist us. Can I !

    trust you? When you receive this letter,kindly send me an e-mail signifying your interest including your most confidential telephone/fax numbers for quick communication also your contact details. This business is risk free. The box can be shipped out in 48hrs.


    Sgt. Mark Ed

    you can EMAIL ME AT.mark_ed_solder@<removed to protect  others>

    Man, the nerve of some people.

    Oh, and for the sake of completeness, here are the email headers (edited somewhat):

    Microsoft Mail Internet Headers Version 2.0
    Received: from mrson2427.com ([]) by
     df-imc-01.exchange.corp.microsoft.com with Microsoft SMTPSVC(6.0.3790.1289);
      Tue, 18 Jan 2005 09:58:00 -0800
    From: "Sgt. Mark Ed" <mark_ed_solder@somewhere>
    Reply-To: mark_ed_solder@somewhere
    Date: Tue, 18 Jan 2005 21:58:00 +0400
    Subject: FROM Sgt. Mark Ed
    X-Priority: 1
    X-Mailer: Microsoft Outlook Express 5.00.2919.6900 DM
    MIME-Version: 1.0
    Content-Type: text/plain; charset="us-ascii"
    Content-Transfer-Encoding: quoted-printable
    Return-Path: mark_ed_solder@somewhere
    Message-ID: <DF-IMC-015BBBVhyemY00002efb@df-imc-01.exchange.corp.microsoft.com>
    X-OriginalArrivalTime: 18 Jan 2005 17:58:00.0704 (UTC) FILETIME=[3FDC7000:01C4FD87]

    mrson2427.com isn't registered, and the IP address is actually owned by Microsoft, which implies that the real originating ip address got lost somewhere in Exchange - I'll have to follow up with the Exchange team to find out what happened to the rest of the headers.



  • Larry Osterman's WebLog

    Anatomy of a software bug, part 1 - the NT browser

    No, I don't mean that the NT browser's a software bug...

    Actually Raymond's post this morning about the network neighborhood got me thinking about the NT browser and it's design.  I've written about the NT browser before here, but never wrote up how the silly thing worked.  While reminiscing, I remembered a memorable bug I fixed back in the early 1990's that's worth writing up because it's a great example of how strange behaviors and subtle issues can appear in peer-to-peer distributed systems (and why they're so hard to get right).

    Btw, the current design of the network neighborhood is rather different than this one - I'm describing code and architecture designed for systems 12 years ago, there have been a huge number of improvements to the system since then, and some massive architectural redesigns.  In particular, the "computer browser" service upon which all this depends is disabled in Windows XP SP2 due to attack surface reduction.  In current versions of Windows, Explorer uses a different mechanism to view the network neighborhood (at least on my machine at work).


    The actual original design of the NT browser came from Windows for Workgroups.  Windows for Workgroups was a peer-to-peer networking solution for Windows 3.1 (and continued to be the basis of the networking code in Windows 95).  As such, all machines in a workgroup needed to be visible to all the other machines in the workgroup.  In addition, since you might have different workgroups on your LAN, it needed to be able to enumerate all the workgroups on the LAN.

    One critical aspect of WfW is that it was designed for LAN environments - it was primarily based on NetBEUI, which was a LAN protocol designed by IBM back in the 1980's.  LAN protocols typically scale quite nicely to several hundred computers, after which they start to fall apart (due to collisions, etc).  For larger networks, you need a routable protocol like IPX or TCP, but at the time, it wasn't that big a deal (we're talking about 1991 here - way before the WWW existed).

    As I mentioned, WfW was a peer-to-peer product.  As such, everything about WfW had to be auto-configuring.  For Lan Manager, it was ok to designate a single machine in your domain to be the "domain controller" and others as "backup domain controllers", but for WfW, all that had to be automatic.

    To achieve this, the guys who designed the protocol for the WfW browser decided on a three tier design.  Most of the machine on the workgroup would be "potential browser servers".  Some of the machines in the workgroup would be declared as "browser servers", one of the machine in the workgroup was the "master browser server".

    Client's periodically (every three minutes) sent a datagram to the master browser server, and the master browser would record this in it's server list.  If the server hadn't heard from the client for three announcements, it assumed that the client had been turned off and removed it from the list.  Backup browser servers would periodically (every 15 minutes) retrieve the browser list from the master browser.

    When a client wanted to browse the network, the client sent a broadcast datagram to the workgroup asking who the browser servers were on the workgroup.  One of the backup or master browser servers would respond within several seconds (randomly).  The client would then ask that browser server for its list of machines, and would display that to the user.

    If none of the browser servers responded, then the client would force an "election".  When the potential browser servers received the election datagram, they each broadcast a "vote" datagram that described their "worth".  If they saw a datagram from another server that had more "worth" than they did, they silently dropped out of the election.

    A servers "worth" was based on a lot of factors - the system's uptime, the version of the software running, their current role as a browser (backup browsers were better than potential browsers, master browsers were better than backup browsers).

    Once the master browser was elected, it nominated some number of potential browser servers to be backup browsers

    This scheme worked pretty well - browsers tended to be stable, and the system was self healing.

    Now once we started deploying the browser in NT, we started running into problems that caused us to make some important design changes.  The biggest one related to performance.  It turns out that in a corporate environment, peer-to-peer browsing is a REALLY bad idea.  There's no way of knowing what's going on on another persons machine, and if the machine is really busy (like if it's running NT stress tests), it impacts the browsing behavior for everyone in the domain.  Since NT had the concept of domains (and designated domain controllers), we modified the election algorithm for to ensure that NT server machines were "more worthy" than NT workstation machines, this solved that particular problem neatly.  We also biased the election algorithm towards NT machines in general, on the theory that NT machines were more likely to be more reliable than WfW machines.

    There were a LOT of other details about the NT browser that I've forgotten, but that's a really brief overview, and it's enough to understand the bug.  Btw, I'm the person who coined the term "Bowser" (as in "bowser.sys") during a design review meeting with my boss (who described it as a dog) :)

    Btw, Anonymous Coward's comment on Raymond's blog is remarkably accurate, and states many of the design criteria and benefits of the architecture quite nicely.  I don't know who AC is (my first guess didn't pan out), but I suspect that person has worked with this particular piece of code :)


  • Larry Osterman's WebLog

    What's the big deal with the Moore's law post?

    In yesterday's article, Jeff made the following comment:

    I don't quite get the argument. If my applications can't run on current hardware, I'm dead in the water. I can't wait for the next CPU.

    The thing is that that's the way people have worked for the past 20 years.  A little story goes a long way of describing how the mentality works.

    During the NT 3.1 ship party, a bunch of us were standing around Dave Cutler, while he was expounding on something (aside: Have you ever noticed this phenomenon?  Where everybody at a party clusters around the bigwig?  Sycophancy at its finest).  The topic on hand at this time (1993) was Windows NT's memory footprint.

    When we shipped Windows NT, the minimum memory requirement for the system was 8M, the recommended was 12M, and it really shined at somewhere between 16M and 32M of memory.

    The thing was that Windows 3.1 and OS/2 2.0 both were targeted at machines with between 2M and 4M of RAM.  We were discussing why NT4 was so big.

    Cutlers response was something like "It doesn't matter that NT uses 16M of RAM - computer manufacturers will simply start selling more RAM, which will put pressure on the chip manufacturers to drive their RAM prices down, which will make this all moot". And the thing is, he was right - within 18 months of NT 3.1's shipping, memory prices had dropped to the point where it was quite reasonable for machines to come out with 32M and more RAM. Of course, the fact that we put NT on a severe diet for NT 3.5 didn't hurt (NT 3.5 was almost entirely about performance enhancements).

    It's not been uncommon for application vendors to ship applications that only ran well on cutting edge machines with the assumption that most of their target customers would be upgrading their machine within the lifetime of the application (3-6 months for games (games are special, since gaming customers tend to have bleeding edge machines since games have always pushed the envelope), 1-2 years for productivity applications, 3-5 years for server applications), and thus it wouldn't matter if their app was slow on current machines.

    It's a bad tactic, IMHO - an application should run well on both the current generation and the previous generation of computers (and so should an OS, btw).  I previously mentioned one tactic that was used (quite effectively) to ensure this - for the development of Windows 3.0, the development team was required to use 386/20's, even though most of the company was using 486s.

    But the point of Herb's article is that this tactic is no longer feasible.  From now on, CPUs won't continue to improve exponentially.  Instead, the CPUs will improve in power by getting more and more parallel (and by having more and more cache, etc).  Hyper-threading will continue to improve, and while the OS will be able to take advantage of this, applications won't unless they're modified.

    Interestingly (and quite coincidentally) enough, it's possible that this performance wall will effect *nix applications more than it will affect Windows applications (and it will especially effect *nix derivatives that don't have a preemptive kernel and fully asynchronous I/O like current versions of Linux do).  Since threading has been built into Windows from day one, most of the high concurrency application space is already multithreaded.  I'm not sure that that's the case for *nix server applications - for example, applications like the UW IMAP daemon (and other daemons that run under inetd) may have quite a bit of difficulty being ported to a multithreaded environment, since they were designed to be single threaded (other IMAP daemons (like Cyrus) don't have this limitation, btw).  Please note that platforms like Apache don't have this restriction since (as far as I know), Apache fully supports threads.

    This posting is provided "AS IS" with no warranties, and confers no rights.

  • Larry Osterman's WebLog

    Anatomy of a software bug, part 2 - the NT browser


    Yesterday, I talked about the design of the NT browser service.

    Today, I want to talk about a really subtle bug we ended up finding in the service (fixed long before we shipped NT 3.1).

    As a brief refresher from yesterdays post, the NT browser was effectively a distributed single-master database system which was designed to run completely without administration.  All the machines that participated in the browsing architecture were elected to their position, the user wasn't involved in that process.

    The WfW browser used NetBIOS names to determine which machines had what role in the workgroup.  In general, the names followed a well established pattern of naming that was used for all the MS-NET products (since MS-NET 1.0 was introduced in 1983).  NetBIOS names are 16 byte flat names, in the MS-NET naming scheme, the last byte of the name was used for a signature, the first <n> bytes of the name were used for the computer name and the bytes between <n> and 15 were filled with 0x20 (space).  For example, the MS-NET server used <name>0x20 for the computer name.  MS-NET workstations used <name>0x00.

    NetBIOS names come in two flavors: Unique and Group.  Unique names are guaranteed to be associated with a single computer on the network.  Group names are shared between multiple machines on the network.  Unique names receive unicasts (directed traffic), Group names receive multicasts (broadcasts).

    For the browser, the master browser was identified because it had registered a NetBIOS name of <workgroup>0x1d.  The backup browsers and potential browsers all register the group name of <workgroup>0x1e.  When servers announce themselves, they send datagrams to <workgroup>0x1d.  There were other names used, and other functionality, but...

    Ok, that's enough background to describe the bug.

    As I mentioned yesterday, we cooked the browser election algorithm to ensure that an NT machine would always win the browser election.  Unfortunately, when we started wide deployment of NT machines on the corporate campus, this wasn't always the case.  We had tools that monitored the state of browsing in the most common domains on the network, and about once or twice a day, browsing would simply stop working on one or more of the domains.

    The maddening thing was that this behavior was totally unreproducable - all we knew is that there was a WfW machine that had held onto the master browser name, and this WfW machine was preventing the NT machine from becoming the new master browser.  The NT machine was trying, but the WfW machine kept holding onto the name.  The really annoying thing was that the WfW machine had apparently forgotten that it was a master browser (even though it was holding onto the master browser name).

    We gathered sniffs, we looked at code, we were clueless.

    Eventually, after talking to the WfW team, we discovered the WfW bug that was causing it to forget that it had had the master browser name - essentially there was a code path that would cause it to think it had won the election, and it started to become the master browser.  If, during the process of registering the NetBIOS name for the master browser, it received an election packet that would cause it to lose the election, it stopped functioning as the master browser, but it forgot to relinquish the NetBIOS name.  So the browser application on WfW didn't think that it owned the NetBIOS name, but the network transport on the WfW machine thought it owned the name.

    Ok, we'd found the bug, and it was the WfW team's bug.  Unfortunately, by this time, they'd already shipped, so they couldn't fix their code (and it wouldn't matter because there was a significant deployed base of WfW machines).  The thing is that they'd done a LOT of testing, and they'd never seen this problem.  So why was the NT browser exposing this?

    Well, we went back to the drawing boards.  We looked over the NT browser election logic.  And we looked at it again.

    And again.

    We stared at the code and we just didn't see it.

    And then one day, I printed out the code and looked at it one final time.

    And I saw what we'd missed in all those code reviews before.

    You see, there was one other aspect of the election process that I didn't mention before.  As a mechanism to ensure that elections damped down quickly (there's a phenomenon called "livelock" that occurs with distributed election systems that prevents elections from finishing), there were several timers associated with the election process.  Once an election request was received, the master browser would delay for 200ms before responding, backup browsers would delay for 400ms and potential browsers would delay for 800ms.  This ordering ensured that master browsers would send their request first, thus ensuring that the election would finish quickly (because if there was a current master browser, it ought to continue to be the master browser).

    Well, the code in question looked something like this (we all used text editors at the time, there weren't any GUI editors available):

    When we did our code reviews, this is what we all saw (of course this isn't the real code, I just mocked it up for this article).

    If I'd looked at the entire line, I'd have seen this:

    Pseudo-browser source code showing incorrect timers for master browser elections

    Note that the master browser case is using the backup browser timer, not the master browser timer.  It turns out that this was the ENTIRE root cause of the bug - because the master browser was delaying its election response for too long, the WfW machines thought they had won the election.  And they started to become masters, and during that process, they received the election packet from the master browser.  Which quite neatly exposed the bug in their code.  Even without the WfW bug, this bug would have been disastrous for the browsing system, because it would potentially cause the very livelock scenario the election algorithm was designed to remove.

    Needless to say, we quickly fixed this bug, and deployed it in the next NT build, and the problem was solved.

    So what are the lessons learned here?  Clearly the first is that code reviews have to be complete - if text is wrapping off the screen, it's not guaranteed to be correct.  Also, distributed systems misbehave in really subtle ways - a simple bug in timing of a single packet can cause catastrophic behaviors.

  • Larry Osterman's WebLog

    Threat Modeling, part 1

    One of the requirements for designing software at Microsoft these days is that every (and I do mean EVERY) feature in every product needs to have a threat model.

    Threat modeling isn't new, it's been a part of good design for years.  Heck, the IETF has a threat model for the entire internet :) 

    Note that I didn't say good security design, IMHO, threat modeling isn't necessarily about security design.  Threat modeling's really all about understanding your design, at a level that design documents don't necessarily cover.  

    Threat modeling is almost more of a discipline than an analysis tool- if it's done right, it enables you to gain a much deeper understanding of your components and how they interact.  And by applying that methodology, you learn a great deal about your component.

    We're currently running through the threat modeling process in my group.  Since I'm the security nut on my team, I'm the one who volunteered to run the process.  And it's really been an interesting exercise..

    The big thing that writing the threat model for your feature (or component, or protocol, or whatever) is that it really forces you to understand the interactions between the various pieces of your feature.

    When threat modeling, you need to concentrate on a bunch of things.  First, there's your assets (also known as protected resources).  All threats relate to assets that need to be protected.  If you don't have assets, you don't have threats.  But sometimes the things that are your assets can be quite subtle.  For example, the audio samples being sent to a sound card might be an asset (especially if the audio samples came from DRM'ed files).  So might the privileges of the LocalSystem (or the current user) account.  The contents of a file on the disk might be an asset for a filesystem, similarly, the contents of an email message is an asset for an email system (like Exchange).  Understanding your assets is key - assets tend are attacked in different ways, the attack strategies mounted against an email message probably won't work for transient data like an audio sample.

    The next thing you need to determine is the entry points to your components.  The entry points to your components are what the attacker is going to use to gain access to your component.

    Related to entry points, one other thing that you need to determine are the trust boundaries in your component.  A trust boundary is a boundary (physical or virtual) across which there is a varied level of trust.  For example, when you RPC from one machine to another (or one process to another), that forms a trust boundary.  Another example of a trust boundary is the network - almost by definition, "the wire" is a trust boundary.

    For some systems, there are no real trust boundaries - an interactive application running as the user that only outputs data may have no trust boundaries.  Similarly, an API that processes data and returns it to the caller may have no trust boundaries.

    Related to trust boundaries are trust levels - trust levels indicates how much you "trust" a portion of the system.  For instance, if a user is an administrator, they are trusted to do more than normal users.  When data flows between entities whose trust level is different, by definition, there is a trust boundary between those two entities.

    Once you've identified your entry points, assets, trust boundaries, and trust levels, the next major part of a threat model is the data flow diagrams.  A data flow diagram indicates the flow of data into and out of the various parts of your components.

    All of this is the up-front work.  It lays the foundation for the "meat" of the threat model, which, of course are the threats.  I'll write more about threats tomorrow..

    One of the things that I realized while writing down all the stuff above is that getting all this stuff written down provides a really cool backup for your design documents.  Much of the time, design documents don't necessarily include all the interactions of the various pieces of the system (often they do, but...).  Threat modeling forces you to write those aspects of the design down.  It also forces you to think about (and write down on paper) the design of your component in terms of the DATA manipulated by your component, instead of the CODE that makes up your component.


    Btw, Frank Swiderski and Window Snyder have an excellent book on threat modeling, it's a bit dry reading, but it's really invaluable for learning about the process.  Microsoft has also provided a threat modeling tool (written by Frank) here that can be used to help guide you through the process of developing the threat model.  Microsoft's been working hard internally at making the threat modeling process require less effort - the ultimate goal of the team is "Do the DFD, turn the crank!".

    There are also lots of fascinating resources on the web available, including Dan Epp's article here, and Krishnan Ranganathan's threat model checklist here.  In addition, Michael Howard's written a bunch about TM, here and here.

    Edit: Included some additional information and links for Michael Howard's TM articles.

    Edit2: Corrected possesive on Dana Epp's name :)

  • Larry Osterman's WebLog

    Transfering a pointer across processes

    I seem to be "Riffing on Raymond" more and more these days, I'm not sure why, but..

    Raymond Chen's post today on the type model for Win64 got me to thinking about one comment he made in particular:

    Notice that in these inter-process communication scenarios, we don't have to worry as much about the effect of a changed pointer size. Nobody in their right mind would transfer a pointer across processes: Separate address spaces mean that the pointer value is useless in any process other than the one that generated it, so why share it?

    Actually, there IS a really good reason for sharing handles across processes.  And the Win64 team realized that and built it into the product (both the base team and the RPC team).  Sometimes you want to allocate a handle in one process, but use that handle in another.  The most common case where this occurs is inheritance - when you allocate an inheritable handle in one process, then spawn a child process, that handle is created in the child process as well.  So if a WIn64 process spawns a Win32 process, all the inheritable handles in the Win64 process will be duplicated into the Win32 process.

    In addition, there are sometimes reasons why you'd want to duplicate a handle from your process into another process.  This is why the DuplicateHandle API has an hTargetProcessHandle parameter.  One example of this is if you want to use a shared memory region between two processes.  One way of doing this would be to use a named shared memory region, and have the client open it.  But another is to have one process open the shared memory region, duplicate the handle to the shared memory region into the other process, then tell the other process about the new handle.

    In both of these cases (inheritable handles and DuplicateHandle), if the source process is a 64bit process and the target process is a 32bit process, then the resulting handle is appropriately sized to work in the 32bit process (the reverse also holds, of course)

    So we've established that there might be a reason to move a handle from one process to another.  And now, the RPC team's part of the solution comes into play.

    RPC (and by proxy DCOM) defines a data type call __int32644.  An int3264 is functionally equivalent to the Win32 DWORD_PTR (and, in fact, the DWORD_PTR type is declared as an __int3264 when compiled for MIDL).

    An __int3264 value is an integer that's large enough to hold a pointer on the current platform.  For Win32, it's a 32 bit value, for Win64, it's a 64 bit value.  When you pass an __int3264 value from one process to another it either gets truncated or extended (either signed or unsigned)..

    __int3264 values are passed on the wire as 32bit quantities (for backwards compatibility reasons).

    So you can allocate a block of shared memory in one process, force dup the handle into another process, and return that new handle to the client in an RPC call.  And it all happens automagically.

    Btw, one caveat: In the current platform SDK, the HANDLE_PTR type is NOT RPC'able across byte sizes - it's a 32bit value on 32bit platforms and a 64bit value on 64bit platforms, and it does NOT change size (like DWORD_PTR values do).  The SDK documentation on process interoperability is mostly correct, but somewhat misleading in this aspect. It says "The 64-bit HANDLE_PTR is 64 bytes on the wire (not truncated) and thus does not need mapping" - I'm not going to discuss the "64 bytes on the wire" part, but most importantly it doesn't indicate that the 32-bit HANDLE_PTR is 32 bits on the wire.

    Edit: Removed HTML error that was disabling comments...


  • Larry Osterman's WebLog

    Another example of programming style


    Brad Abrams over on the CLR team just published an article containing the CLR team's internal coding conventions.  It makes an interesting counterpoint to my "what is programming style" series from back in November.

    I can't say I agree with everything they're doing (as I mentioned, personally I find it extremely useful to be able to visually discriminate between local variables, member variables and parameters, and their style doesn't allow for that) but it's a good example of a coding style document.


  • Larry Osterman's WebLog

    What does "Robust" mean?

    Back in the days of NT OS/2, one of the things that was absolutely drilled into the development team was robustness.  I even went so far as to write "Robustness" on my whiteboard in 1 foot high letters as a daily reminder.

    The team distributed mugs with "INDUSTRIAL STRENGTH" on them (to indicate that NT, unlike previous MS operating systems) needed to be robust enough to work in mission critical environments.

    One of the problems with this, of course is that "robustness", like "policy" and "session" is one of those words that really has no meaning.  Or rather, it has so many meanings that it has no effective meaning.

    The problem with "robustness" is that defining what robustness is is situational - the very qualities that define the robustness of a system depend on how and where it's deployed.  Thus it is meaningless to consider robustness without first describing the scenario. 

    I first learned this lesson in college, in my software engineering class (15-413 at Carnegie-Mellon).  When the instructor (whose name is escaping me currently) was discussing the concept of "robust engineering" he gave the following examples.

    An ATM needs to be robust (ATM's were relatively new back then, so this was a 'trendy' example).  It would be VERY, VERY bad if an ATM was to drop a transaction - if you withdrew money and the machine crashed after updating your account ledger but before giving you the money, you would lose money.  Even worse, if the machine crashed after giving you the money, but before updating your account ledger, then the bank would be out of money.  So it's critical that for a given transaction, the ATM not crash.  On the other hand, it's not a big deal for an ATM to be down for up to days at a time - customers can simply find a new ATM.

    On the other hand, the phone network also needs to be robust (this was soon after the AT&T breakup, so once again, it was a 'trendy' example).  It's not a problem if the phone network drops a phone call, because the caller will simply reestablish the phone connection.  Btw, this doesn't hold true for some calls - for instance the robustness requirements for 911 are different from normal calls due to their critical nature.  On the other hand, it would be disastrous for the phone network to be down for more than a couple of minutes at a time.  Similarly, the land line to my house is externally powered - which means that even if the power grid goes down, I can still use the phone to call for help if I need it.

    So which is more robust?  The answer is that they BOTH are, given their operating environments.  The robustness criteria for each of these is orthogonally different - the criteria that define robustness for an ATM are meaningless for a phone network.

    I'm not even sure that there are any universal robustness principals - things like "don't crash" really are meaningless when you think about them - the reality is that EVERYTHING crashes - my VCR crashes, my stereo crashes, all electronics crashes (given the right circumstances - most (but not all) of the electric appliances in my house don't work when the power to my house goes away).  Robustness in many ways is like the classic definition of pornography: "I shall not today attempt further to define the kinds of material... but I know it when I see it."

    The last time I tossed out a dilemma like this one (Measuring testers by test metrics doesn't) I got a smidge of flack for not proposing an alternative mechanism for providing objective measurements of a testers productivity, so I don't want to leave this discussion without providing a definition for robustness that I think works...

    So, after some thought, I came up with this:

    A program is robust if, when operating in its expected use scenarios, it does not cause unexpected behaviors.  If the program DOES fail, it will not corrupt its operating data.

    I'm not totally happy with that definition, because it seems to be really wishy-washy, but I've not come up with anything better.  The caveat (when operating in its expected fashion) is necessary to cover the ATM and the phone network cases above - the ATM's expected use scenarios involve reliable transactions, but do not involve continual operation, the phone network's expected use scenarios are just the opposite - they involve continual operation but not reliable transactions (phone calls).  One of the other problems with it is "unexpected behaviors" - that's almost TOO broad - it covers things like UI glitches that might not properly be considered relevant from a robustness standpoint (but they might - if the application was a rendering application, then rendering issues effect robustness).

    The second sentence was added to cover the "don't force the user to run chkdsk (or fsck) on reboot" aspect - if you DO encounter a failure, you're system will recover.  There's even weasel-words in that clause - I'm saying it shouldn't corrupt "operating data" without defining operating data.  For example,  NTFS considers its filesystem metadata to be operating data, but the users file data isn't considered to be operating data.  Exchange, on the other hand, considers the users email messages to be its operating data (as well as the metadata).  So the robustness criteria for Exchange (or any other email system) includes the users data, while the robustness criteria for NTFS doesn't. 

  • Larry Osterman's WebLog

    Kitten Pictures...

    Yesterday a bunch of people asked for pictures of the kittens.  We don't have any current ones, but these are a couple I took a while ago...

    This is Spazz and Aphus in our master bathroom - Spazz figured out how to jump onto that ledge when he was about 8 months old, and he now includes it as a part of his territory.  None of the other kittens is brave (read: stupid) enough to jump up there...

    And the 3rd of the kittens, Inque is in this picture:

    Yes, he's hanging onto a window screen.

    Edit: Corrected Inque's name (sorry about that Inque)...  Sigh.


  • Larry Osterman's WebLog

    XP and Systems Programming


    In Raymond's blog post today, he mentioned that if you didn't want the GetQueuedCompletionStatus to return when a handle is set to the signaled state, that you can set the bottom bit of an event handle - that would suppress notifications.

    The very first comment (from "Aaargh!") was that that was ugly.  And he's right.  Aaargh! suggested that instead a new "suppressCompletionNotifications" parameter be added to the functions that could use this mechanism that would achieve the same goal.

    And that got me to thinking about XP and systems programming (XP as in eXtreme Programming, not as in Windows XP) in general.

    One of the core tenets of XP is refactoring - whenever you discover that your design isn't working, or if you discover an opportunity for code sharing, refactor the code to achieve that.

    So how does this work in practice when doing systems programming.

    I'd imagine that the dialog goes something like this:

    Program Manager: "Hmm.  We need to add a more scalable queue management solution because we're seeing lots of lock convoy issues in our OS."

    Architect: "Let me think about it...  We can do that - we'll add a new kernel synchronization structure that maintains the queue in the kernel, and add an API returns when there's an item put onto the queue.  We'll then let that kernel queue be associated with file handles, so that when I/O completes on the file, the "wait for the item" API returns.  The really cool thing about this idea is that we just need to add a couple of new APIs, and we can hide all the work involved in the kernel so that no application needs to be modified unless it wants to use this new feature."

    Program Manager: "Sounds great!  Go for it"

    <Time Passes...  The feature gets designed and implemented>

    Tester, to Developer: "Hmm.  I was testing this new completion port mechanism you guys added.  I associated a completion port to a serial device, and I noticed that when I associated my file handle, my completion port was being signaled for every one of my calls.  That's really annoying.  I only want it to be signaled when a ReadFile or WriteFile call completes, I don't want it to be called when a call to DeviceIoControl completes, since I'm making the calls to DeviceIoControl out-of-band.  We need a mechanism to fix this."

    At this point, we have an interesting issue that shows up. Let's consider what happens when you apply XP as a solution...

    Developer, to Architect, sometime later: "Ya know, Tester's got a point.  This is clearly a case that we missed in our design, we need to fix this.  This is clearly an opportunity for refactoring, so we'll simply add a new "suppressCompletionNotifications" to all the APIs that can cause I/O completions to be signalled."

    Architect: "Yup, you're right".

    Developer goes out and adds the new suppressCompletionNotifications parameter to all the APIs involved.  He changes the API signature for 15 or so different APIs, fixes the build breaks that this caused, rebuilds the system and hands it to the test team.

    Tester: "Wait a second.  None of my test applications work any more - you changed the function signature of WriteFile, and now I the existing compiler can't write data from the disk!"

    Ok, that was a stupid resolution, and no developer in their right mind would do that, because they know that adding a new parameter to WriteFile would break applications.  But XP says that you refactor when stuff like this happens.  Ok, so maybe you don't refactor the existing APIs.  What about adding new versions of the APIs.  Let's rewind the tape a bit and try again...

    Developer goes out and adds the a new variant of all the APIs involved that has a new "suppressCompletionNotifications" parameter to all the APIs involved.  In fact, he's even more clever - he adds a "flags" parameter to the API and defines "suppressCompletionNotifications" as one of the flags (thus future-proofing his change).  He adds 15 or so different APIs, and then he runs into WriteFileEx.  That's a version of WriteFile that adds a completion routine.  Crud.  Now he needs FOUR different variants of WriteFile - two that have the new flag, and two that don't.  But since refactoring is the way to go, he presses on, builds the system and hands it to the tester.

    Tester: "Hey, there are FOUR DIFFERENT APIs to write data to a file.  Talk about API bloat, how on earth am I supposed to be able to know which of the four different APIs to call?  Why can't you operating system developers just have one simple way of writing a byte to the disk?"

    Tester (muttering under his breath): "Idiots".

    Now let's rewind back to the starting point and reconsider the original problem.

    Developer, to Architect, sometime later: "Ya know, Tester's got a point.  This is clearly a case that we missed in our design, we need to fix this.  I wonder if there's some way, that we could encode this desired behavior without changing any of our API signatures"

    Architect: "Hmm.  Due to the internal design of our handle manager, the low two bits of a handle are never set.  I wonder if we could somehow leverage these bits and encode the fact that you don't want the completion port to be fired in one of those bits..."

    Developer: "Hmm. That could work, let's try it."

    And that's how design decisions like this one get made - the alternative to exploiting the low bit of a handle is worse than exploiting the bit.

    And it also points out another issue with XP: Refactoring isn't compatible with public interfaces - once you've shipped, your interfaces are immutable.  If you decide you need to refactor, you need a new interface, and you must continue to support your existing interfaces, otherwise you break clients.

    And when you're the OS, you can't afford to break clients.

    Refactoring can be good as an internal discipline, but once you've shipped, your interfaces are frozen.

  • Larry Osterman's WebLog

    Interesting commentary on Washington State's election...

    I don't normally do political commentary in this blog - its' a technical blog, not a political blog, and there are enough of those blogs out there anyway, but...

    For those of you not in Washington State, during the November election, Republican Dino Rossi was declared the victor by something like 240 votes.  This triggered a mandatory recount under state law, which reduced his lead to 42 votes.  The Democrats asked for a hand recount, and the hand recount declared Democrat Christine Gregoire Governor by a margin of 129 votes (out of 2.7 million cast).  That means that by any stretch of the imagination, this election was a tie.  The election might as well have been determined by the flip of a coin.

    The Republican party is challenging the results of the election, this morning they had a press announcement where they touted 737 votes that they claimed were illegal.

    They're using this as evidence of what they call a "fundamentally flawed" election process.  On the other hand, when I look at those numbers, I see that they're saying that .02% of the votes (2 in 10,000) were flawed.  To me, that is evidence of an extraordinarily well run election - it's an error rate of 1 in 5,000 votes!

    I've recently been reading David Goldsten's blog at http://www.horsesass.org (named for his failed initiative to declare Tim Eyman a horses ass).  Goldy's pretty liberal (ok, he's a flaming liberal), but he's made some excellent posts about Washington State politics recently (whether or not you agree with him).

    Today, Goldy put up a really insightful post about very nature of elections, and especially close elections that I thought was worth pointing out.

    Edit: Fixed link to Goldy's site, sorry about that (I should know better than to post without first checking links).


  • Larry Osterman's WebLog

    Wholey Google Juice, Batman!


    I've noticed a small, but ongoing stream of comments coming into my "Bobs Math Answers" post from the start of this year.  That post answered the math homework problem posed in my "just before the break" post here.  Most of those questions were things like "what's the answer to problem 12"...

    And then I realized what was happening.  People were googling for "Math Answers", and there was my post, at the bottom of the first page...  Apparently, just as Eric Lippert is the expert on knowing if girls like you, I've become the expert on answering math questions for kids.

    I think I need to be a bit more careful about my post topics to ensure that the google searches work more accurately :)


  • Larry Osterman's WebLog

    What's wrong with this code, part 8 - the answers.

    Yesterday's post was a snippet of code that was supposed to validate an email address provided by the caller.

    As such, it wasn't actually that bad, and for most email addresses, it does the right thing.  This example is actually better than many email address validators I've found (some don't allow forms like user@subdomain.domain.com, which is patently legal).

    But it doesn't really validate RFC 2822 email addresses.  Instead, it validates COMMON email addresses.  It turns out that it's not really possible to validate all the legal 2822 email addresses in a regular expression.

    Here's the grammar for valid email addresses from RFC 2822 (ignoring the obsolete definitions).   Btw, this may not be complete - I tried to pull the relevant pieces from RFC2822, but...:

    addr-spec       =       local-part "@" domain
    local-part      =       dot-atom / quoted-string
    domain          =       dot-atom / domain-literal
    domain-literal  =       [CFWS] "[" *([FWS] dcontent) [FWS] "]" [CFWS]
    dcontent        =       dtext / quoted-pair
    dtext           =       NO-WS-CTL /     ; Non white space controls
                            %d33-90 /       ; The rest of the US-ASCII
                            %d94-126        ;  characters not including "[",
                                            ;  "]", or "\"
    atext           =       ALPHA / DIGIT / ; Any character except controls,
                            "!" / "#" /     ;  SP, and specials.
                            "$" / "%" /     ;  Used for atoms
                            "&" / "'" /
                            "*" / "+" /
                            "-" / "/" /
                            "=" / "?" /
                            "^" / "_" /
                            "`" / "{" /
                            "|" / "}" /
    atom            =       [CFWS] 1*atext [CFWS]
    dot-atom        =       [CFWS] dot-atom-text [CFWS]
    dot-atom-text   =       1*atext *("." 1*atext)
    text            =       %d1-9 /         ; Characters excluding CR and LF
                            %d11 /
                            %d12 /
    quoted-pair     =       ("\" text)
    NO-WS-CTL       =       %d1-8 /         ; US-ASCII control characters
                            %d11 /          ;  that do not include the
                            %d12 /          ;  carriage return, line feed,
                            %d14-31 /       ;  and white space characters

    The key thing to note in this grammar is that the local-part is almost free-form when it comes to the local part.  And there are characters allowed in the local part like !, *, $, etc that are totally legal according to RFC2822 that aren't allowed.

    So the validation routine in question won't accept a perfectly legal email address like "Foo!Bar@MyDomain.Org".

    Kudos: Uwe Keim was the first person to catch the big issue (although Denny came REALLY close), in the first two comments on the article.

    KCS also pointed out an inefficiency in the code - it uses "if (boolean) return true else return false", which is redundant.

    Non Bugs: Tad pointed out that the code doesn't check for a null emailAddress.  Anyone who's been reading my blog for long enough would realize that I don't consider issues like that that to be bugs (and, in fact, I consider checking for those cases to be bugs).  The strongest change I'd make towards validating emailAddress is to put a Debug.Assert(emailAddress != null) in the code - it's a bug in the caller if they pass in a null emailAddress.

    philoserf pointed out that the code "is good enough", and he's right.  On the other hand, Aaron Robinson pointed out that people in the .info and .museum top level domains are gonna be pretty unhappy.

    Adi Oltean pointed out that V2 of the .Net framework contains the System.Net.MailAddress class which contains a built-in validator.  Very cool.

  • Larry Osterman's WebLog

    Threat Modeling, Part 2 - threats

    The first threat modeling post discussed the first part of threat modeling, which is defining your assets and understanding how the data flows through the design of your component.  Today, I want to talk about the threats themselves.

    One of the key aspects of threats is that for a given design, the threats are static.  Change the design, and you change the threats, but for a given design, the threats don't change. 

    This is actually a really profound statement, when you think about it.  For a given design, the threats against that design are unchanging.  The only thing that matters for a threat is whether or not the threat is mitigated (and the quality of the mitigation).

    Another important aspect of a threat is that a threat applies to an asset.  If there's no asset affected, then it's not a threat.  One implication of this is that if you DO decide that you have a threat, and you can't figure out what asset is being threatened, then clearly you've not identified your assets correctly, and you should go back and re-evaluate your assets - maybe the asset being protected is an intangible asset like a privilege.

    At Microsoft, each threat is classified based on the "STRIDE" classification scheme (for "S"poofing, "T"ampering, "R"epudiation, "I"nformation disclosure, "D"enial of service, and "E"levation of privilege).  The STRIDE classification describes the consequences of a particular threat (note that some versions of the Microsoft documentation uses STRIDE as a methodology for threat determination, it's really better thought of as a classification mechanism).

    All threats are theoretical by their very nature.  The next part of threat analysis is to determine the attack vectors for that threat to ensure that there are no vulnerabilities associated with the threat.

    Attack vectors come in two flavors, mitigated and unmitigated.  All the unmitigated attack vectors are vulnerabilities.  When you have a vulnerability, that's when you need to get worried - in general, a threat where all the attack vectors are mitigated isn't a big deal. 

    Please note however: There may be unknown attack vectors, so you shouldn't feel safe just because you've not thought of a way of attacking the code. As Michael Howard commented in his article, that's why a good pentester (a tester skilled in gedanken experiments :)) is so valuable - they help to find vectors that have been overlooked.  In addition, mitigations may be bypassed.  Three years ago, people thought that moving their string manipulation from the stack to the heap would mitigate against buffer overruns associated with string data.  Then the hackers figured out that you could use heap overruns (and underruns) as attack vectors, which rendered the "move string manipulation to the heap" mitigation irrelevant.  So people started checking the lengths of their strings to mitigate against that threat, and the hackers started exploiting arithmetic overflows.  You need to continue to monitor the techniques used by hackers to ensure that your mitigations continue to be effective, because there are people out there who are continually trying to figure out how to bypass common mitigations.

    There's also an important corollary to this: If you can't mitigating a particular threat, then you need to either decide that the vulnerability isn't significant, or you need to change your design.  And some vulnerabilities aren't significant - it depends on the circumstances.  Here's a hypothetical:  Let's say that your feature contains a "TypeFile()" API.  This is an API contained in a DLL that causes the contents of a file on the disk to be displayed on the console.  If the design of the API was that it only would work on files in the "My Documents" folder, but it contained a canonicalization vulnerability that caused it to be able to display any file, then that might not be a vulnerability - after all, you're not letting the user see anything they don't have access to.  On the other hand, that very same canonicalization vulnerability might be a critical system vulnerability if the TypeFIle() API was called in the context of a web server.  It all depends on the expected use scenarios for the feature, each feature (and each vulnerability) is different.

    One really useful tool when trying to figure out the attack vectors against a particular threat is the "threat tree".  A threat tree (also known as an attack tree) allows you to measure the level of risk associated with a particular vulnerability.  Essentially you take a threat, and enumerate the attack vectors to that threat.  If the attack vector is a vulnerability (it's not directly mitigated), you enumerate all the conditions that could occur to exploit that vulnerability.  For each of those conditions, if they're mitigated, you're done, if they're not, once again, you look for the conditions that could exploit that vulnerability, and repeat.  This paper from CMU's SEI has a great example of a threat tree.  Bruce Schneier also had an article on them in the December 1999 Dr Dobb's Journal.  His article includes the following example of an attack tree against a safe:

    The really cool thing that you get from a threat tree is that it gives you an ability to quantify the risk associated with a given attack vector, and thus gives you objective tools to use to determine where to concentrate your efforts in determining mitigations.  I found a GREAT slide deck here from a talk given by Dan Sellers on threat modeling, with a wonderful example of a threat tree on slide #68.

    In addition to the articles I mentioned yesterday, I also ran into this rather fascinating slide deck about threat modeling.  Dana's also written about this here,

Page 1 of 2 (28 items) 12