March, 2004

Larry Osterman's WebLog

Confessions of an Old Fogey
  • Larry Osterman's WebLog

    Did you know that OS/2 wasn't Microsoft's first non Unix multi-tasking operating system?

    • 44 Comments

     Most people know about Microsoft’s official timeline for its operating-system like products

    1.      Xenix - Microsoft’s first operating system, which was a version of UNIX that we did for microprocessors. 

    2.      MS-DOS/PC-DOS, a 16 bit operating system for the 8086 CPU

    3.      Windows (not really an operating system, but it belongs in the timeline).

    4.      OS/2, a 16 bit operating system written in joint development with IBM.

    5.      Windows NT, a 32 bit operating system for the Intel i386 processor, the Mips R8800 and the DEC Alpha

    But most people don’t know about Microsoft’s other multitasking operating system, MS-DOS 4.0 (not to be confused with PC-DOS 4.0)

    MS-DOS 4.0 was actually a version of MS-DOS 2.0 that was written in parallel with MS-DOS 3.x (DOS 3.x shipped while DOS 4 was under development, which is why it skipped a version).

    DOS 4 was a preemptive real-mode multitasking operating system for the 8086 family of processors.  It had a boatload of cool features, including movable and discardable code segments, movable data segments (the Windows memory manager was a version of the DOS 4 memory manager).  It had the ability to switch screens dynamically – it would capture the foreground screen contents, save it away and switch to a new window.

    Bottom line: DOS 4 was an amazing product.  In fact, for many years (up until Windows NT was stable), one of the DOS 4 developers continued to use DOS 4 on his desktop machine as his only operating system.

    We really wanted to turn DOS 4 into a commercial version of DOS, but...   Microsoft at the time was a 100% OEM shop – we didn’t sell operating systems, we sold operating systems to hardware vendors who sold operating systems with their hardware.  And in general the way the market worked in 1985 was that no computer manufacturer was interested in a version of DOS if IBM wasn’t interested.  And IBM wasn’t interested in DOS.  They liked the idea of multitasking however, and they were very interested in working with that – in fact, one of their major new products was a product called “TopView”, which was a character mode window manager much like Windows.  The wanted an operating system that had most of the capabilities of DOS 4, but that ran in protected mode on the 286 processor.  So IBM and Microsoft formed the Joint Development Program that shared development resources between the two companies.  And the DOS 4 team went on to be the core of Microsoft’s OS/2 team.

    But what about DOS 4?  It turns out that there WERE a couple of OEMs that had bought DOS 4, and Microsoft was contractually required to provide the operating system to them.  So a skeleton crew was left behind to work on DOS and to finish it to the point where the existing DOS OEM’s were satisfied with it.

     

    Edit: To fix the title which somehow got messed up.

     

  • Larry Osterman's WebLog

    So you need a worker thread pool...

    • 19 Comments

    And, for whatever reason, the NT’s built-in thread pool API doesn’t work for you.

    Most people would write something like the following (error checking removed to reduce typing (and increase clarity)):

    class WorkItem
    {
        LIST_ENTRY m_listEntry;
            :
            :
    };

    class WorkerThreadPool
    {
        HANDLE m_heventThreadPool;
        CRITICAL_SECTION m_critsThreadPool;
        LIST_ENTRY m_workItemQueue;

        void QueueWorkItem(WorkItem *pWorkItem)
        {
            //
            //   Insert the work item onto the work item queue.
            //
            EnterCriticalSection(&m_critsWorkItemQueue);
            InsertTailList(&m_workItemQueue, pWorkItem->m_listEntry);
            LeaveCriticalSection(&m_critsWorkItemQueue);
            //
            //   Kick the worker thread pool
            //
            SetEvent(m_heventThreadPool);
        }
        void WorkItemThread()
        {
            while (1)
            {
                //
                // Wait until we’ve got work to do
                //
                WaitForSingleObject(&m_heventThreadPool, INFINITE);
                //
                //  Remove the first item from the queue.
                //
                EnterCriticalSection(&m_critsWorkItemQueue);
                workItem = RemoveHeadList(&m_workItemQueue);
                LeaveCriticalSection(&m_critsWorkItemQueue);
                //
                // Process the work item if there is one.
                //
                if (workItem != NULL)
                {
                    <Process Work Item>
                }
            }
        }
    }

    I’m sure there are gobs of bugs here, but you get the idea.  Ok, what’s wrong with this code?  Well, it turns out that there’s a MASSIVE scalability problem in this logic.  The problem is the m_critsWorkItemQueue critical section.  It turns out that this code is vulnerable to condition called “lock convoys” (also known as the “boxcar” problem).  Basically the problem occurs when there are more than one threads waiting on the m_heventThreadPool event.  What happens when QueueWorkItem calls SetEvent on the thread pool event?  All the threads in the thread pool immediately wake up and block on the work queue critical section.  One of the threads will “win” and will acquire the critical section, pull the work item off the queue and release the critical section.  All the other threads will then wake up, one will successfully acquire the critical section, and all the others will go back to sleep.  The one that woke up will see there’s no work to do and will block on the thread pool.  This will continue until all the work threads have made it past the critical section.

    Essentially this is the same situation that you get when you have a bunch of boxcars in a trainyard.  The engine at the front of the cars starts to pull.  The first car moves a little bit, then it stops because the slack between its rear hitch and the front hitch of the second car is removed.  And then the second car moves a bit, then IT stops because the slack between its rear hitch and the front hitch of the 3rd card is removed.  And so forth – each boxcar moves a little bit and then stops.  And that’s just what happens to your threads.  You spend all your valuable CPU time executing context switches between the various threads and none of the CPU time is spent actually processing work items.

    Now there are lots of band-aids that can be applied to this mechanism to make it smoother.  For example, the m_heventThreadPool event could be an auto-reset event, which means that only one thread would wake up for each work item.  But that’s only a temporary solution - if you get a flurry of requests queued to the work pool, you can still get multiple worker threads waking up simultaneously.

    But the good news is that there’s an easier way altogether.  You can use NT’s built-in completion port logic to manage your work queues.  It turns out that NT exposes a really nifty API called PostQueuedCompletionStatus that essentially lets NT manage your worker thread queue for you!

    To use NT’s completion ports, you create the port with CreateIoCompletionPort, remove items from the completion port with GetQueuedCompletionStatus and add items (as mentioned above) with PostQueuedCompletionStatus.

    PostQueuedCompletionStatus takes 3 user specified variables, one of which which can be used to hold a 32 bit integer (dwNumberOfBytesTransferred), and two of which can be used to hold pointers (dwCompletionKey and lpOverlapped).  The contents of these parameters can be ANY value; the API blindly passes them through to GetQueuedCompletionStatus.

    So, using NT’s completion ports, the worker thread class above becomes:

    class WorkItem
    {
            :
            :
    };

    class WorkerThreadPool
    {
        HANDLE m_hcompletionPort;

        void QueueWorkItem(WorkItem *pWorkItem)
        {
            PostQueuedCompletionStatus(m_hcompletionPort, 0, (DWORD_PTR)pWorkItem, NULL);
        }

        void WorkItemThread()
        {
            while (1)
            {
                GetQueuedCompletionStatus(m_hCompletionPort, &numberOfBytes, &pWorkItem, &lpOverlapped, INFINITE);
                //
                // Process the work item if there is one.
                //
                if (pWorkItem != NULL)
                {
                    <Process Work Item>
                }
            }
        }
    }

    Much simpler.  And as an added bonus, since NT’s managing the actual work queue in the kernel, it allows NT to eliminate the lock convoy in the first example.

     

    [Insert std disclaimer: This posting is provided "AS IS" with no warranties, and confers no rights]

  • Larry Osterman's WebLog

    One in a million is next Tuesday

    • 9 Comments

    Back when I was a wee young lad, fresh from college, I thought I knew everything there was to know.

     

    I’ve since been disabused of that notion, rather painfully.

    One of the best happened very early on, back when I was working on DOS 4.  We ran into some kind of problem (I’ll be honest and say that I don’t remember what it was). 

    I was looking into the bug with Gordon Letwin, the architect for DOS 4.  I looked at the code and commented “Maybe this is what was happening?  But if that were the case, it’d take a one in a million chance for it to happen”.

    Gordon’s response was simply: “In our business, one in a million is next Tuesday”.

    He then went on to comment that at the speeds which modern computers operate (4.77 MHz remember), things happened so quickly that something with a one in a million chance of occurrence is likely to happen in the next day or so.

    I’m not sure I’ve ever received better advice in my career. 

    It has absolutely stood the test of time – no matter how small the chance of something happening, with modern computers and modern operating systems, essentially every possible race condition or deadlock will be found within a reasonable period of time.

    And I’ve seen some absolute doozies in my time – race conditions on MP machines where a non interlocked increment occurred (one variant of Michael Grier’s “i = i + 1” bug).   Data corruptions because you have one non protected access to a data structure.  I’m continually amazed at the NT scheduler’s uncanny ability to context switch my application at just the right time as to expose my data synchronization bug.  Or to show just how I can get my data structures deadlocked in hideous ways.

    So nowadays, whenever anyone comments on how unlikely it is for some event to occur, my answer is simply: “One in a million is next Tuesday”.

    Edit: To fix the spelling of MGrier's name.

    Edit:  My wife pointed out the following and said it belonged with this post: http://www.jumbojoke.com/000036.html

  • Larry Osterman's WebLog

    What's the difference between MS-DOS and PC-DOS?

    • 6 Comments

    In the following comment Louis Parks asked about MS-DOS/PC-DOS being a joint deal with IBM.

    Actually the answer’s a bit more complicated and has to do with the way the PC market worked back in the 1980s.

    People these days don’t remember what the world looked like back in the ‘80s as far as computers go.  A fancy high end machine would cost $5,000 and came with a whopping 10M hard disk.  When the IBM PC/AT came out, it DOUBLED the hard disk space to 20M.  I distinctly remember one of the Xenix developers running down the hall holding up a shoebox sized hunk of metal crowing that it held SEVENTY MEGABYTES!  That was bigger than anything that anyone had seen before (at least for PC’s).

    Back in 1984, when I started at Microsoft, Microsoft didn’t sell operating systems to end-users (we did sell a product called MSXDOS in Japan, and we sold a hardware product for the Apple II called the Microsoft Softcard that included a version of CP/M with it).  Heck, with the exception of languages, Microsoft didn’t sell ANY systems products at all.  Microsoft sold other products at retail (games, productivity tools, hardware, and languages) but not systems products. Instead, all of Microsoft’s non language systems products were sold to OEMs (Original Equipment Manufacturers, basically PC manufacturers) who then bundled the product with their hardware.

    And we had one very special OEM, IBM.  IBM set the direction in hardware for the PC industry, and one of their requirements was that every new IBM computer needed to have a version of PC-DOS for the computer.  So versions of PC-DOS were essentially tied to the computer.

    PC-DOS 1.0 came with the original IBM PC that supported 160Kb floppies
    PC-DOS 1.1 came with the IBM PC that supported 360Kb floppies

    PC-DOS 2.0 came with the IBM PC XT, and added support for the 10M hard drive that came with the XT.

    PC-DOS 2.1 came with the PC Jr.
    PC-DOS 3.0 came with the IBM PC AT.
    PC-DOS 3.1 added networking support and was shipped at the same time that the IBM PC Network product shipped.
    PC-DOS 3.3 came with the IBM PS/2 line of computers.

    Microsoft’s contribution to DOS was hardware agnostic – it didn’t have any utilities for supporting specific PC hardware – the binaries shipped had just had DOS, the low level drivers (called the BIOS), the command interpreter, and a couple of other utilities (join, print, subst, replace, etc), IBM provided all the hardware specific utilities (mode.com, etc).

    The development agreement with IBM allowed Microsoft to sell its contributions to the PC-DOS product to other OEM’s, and Microsoft shipped its contributions to OEMs as the MS-DOS product.  It came on a 3 ring binder and included all the code Microsoft wrote.

    This made sense even in 1984 because not every PC sold was 100% IBM compatible.  There were significant variations between PC hardware platforms. Back in those days, the Tandy 1000 was still a viable platform, even though its display subsystem was incompatible with IBM’s.

    However, as time went on, it became very clear that PCs needed to be 100% IBM compatible to survive, and given that their computers were completely compatible with IBM’s computers, the OEM’s became dissatisfied with this process.  They wanted to have a shrink-wrapped product they could buy from Microsoft and just drop into their computer box on the assembly line.  Essentially they wanted a “packaged product” version of MS-DOS.  So Microsoft obliged them, for MS-DOS 3.2, we wrote a Microsoft version of all the utilities that IBM included in their PC-DOS.

    By the time that MS-DOS 3.3 came out, Microsoft and IBM were working under the Joint Development Agreement, and as a result of the JDA, Microsoft gained the right to redistribute IBM’s utilities (and I believe that IBM gained the rights to redistribute Microsoft’s).  So the packaged product version of MS-DOS 3.3 contained the real IBM utilities, which eased compatibility concerns on the part of our OEMs.

    So the simple answer: PC-DOS was name of the version of DOS that was sold by IBM, MS-DOS was the name of the version that was sold by OEM’s.  But it wasn’t a part of the joint development operation until the DOS 3.3 timeframe (1987ish).

     

    I want to thank MarkZ for reviewing this for accuracy before I posted it, I really appreciate his review.

    Insert std disclaimer: This posting is provided "AS IS" with no warranties, and confers no rights

     

  • Larry Osterman's WebLog

    Why doesn't CTRL-C stop NET USE?

    • 6 Comments

     John Vert’s been griping about this issue to  me for literally 14 years now.

    I do a NET USE * \\MYSERVER\SERVERSSHARE from the CMD.EXE prompt and the console hangs.  No amount of hitting CTRL-C will get it back until that silly application decides to give up.

    Why on earth is this?  Why can’t I just control-C and have my application stop?

    It turns out that this issue comes because of a bunch of different behaviors that combine to give a less than optimal user experience.

    The first is how CTRL-C is implemented in console applications.  When an application calls SetConsoleCtrlHandler, then the console subsystem remembers the callback address.  When the user hits CTRL-C on the console, then the console subsystem creates a brand new thread in the user’s application, and calls into the user’s specified Ctrl-C handler.  If there are multiple processes in the console window (which happens when CMD.EXE launches a process), then the console subsystem calls them in the order that they were registered, and doesn’t stop until one of the handlers returns TRUE (indicating that the handler’s dealt with the signal).

    If an app doesn’t call SetConsoleCtrlHandler, then CTRL-C is redirected to a handler that calls ExitProcess.

    Now CMD.EXE has a CTRL-C handler, but NET.EXE (the external command executed for NET USE) doesn’t.  So the system calls ExitProcess on NET.EXE when you hit CTRL-C.  So far so good.

    But there’s a problem.  You see, the main thread of NET.EXE is blocked calling WNetAddConnection2 API.  That API in turn is blocked issuing a synchronous IOCTL into the network filesystem, and the IOCTL’s blocked waiting on DNS name resolution.  And since ExitProcess guarantees that the process has cleaned up before it actually removes the process, it has to wait until that IOCTL completes.

    That seems silly, why doesn’t NT have a mechanism to cancel this outstanding I/O?  Well it does.  That’s what the CancelIo API’s all about.  It takes a file handle and cancels all outstanding I/O’s for that handle

    But if you look at the documentation for CancelIo carefully, it clearly says that CancelIo only cancels I/O’s that were initiated on the thread that called CanceIo.  And remember – the console subsystem created a brand new thread to execute the control C handler.  There aren’t any I/O’s outstanding on that thread, all of the I/O’s in the application are outstanding on the main thread of the application.

    And that thread’s blocked on a synchronous IOCTL call into the network filesystem.  Which won’t complete until it’s done doing its work.  And that might take a while.

    The really horrible thing is that there isn’t any good solution to the problem.  The I/O system has really good reasons for implementing I/O cancellation the way it does, they’re deeply embedded in the design of I/O completion.  The WNetAddConnection2 API is a synchronous system API (for ease of use), so it issues synchronous I/O’s to its driver.  And they can’t add a console control C handler into the WNetAddConnection2 handler because if they did, it would override the intentions of the application – what if the application had indicated to the system that it NEVER wanted to be terminated by CTRL-C?  If WNetAddConnection2 somehow managed to cancel its I/O when the user hit CTRL-C, then it would cause the application to malfunction.

    This problem could be handled by CMD.EXE, except CMD.EXE doesn’t get control when the user hits CTRL-C, since it’s not the foreground process (NET.EXE is).

    So you wait. And wait.  And every time John runs into me in the hall, he asks me when I’m going to fix CTRL-C.

     

     

  • Larry Osterman's WebLog

    So why does NT require such a wonking great big paging file on my machine?

    • 14 Comments

    UPDATE 5/5/2004: I posted a correction to this post here.  Sorry if there was any inconvenience.

     Raymond Chen posted a fascinating comment on the dangers that paging causes server applications the other day.

    One of the people commenting asked one of the most common questions about NT:

    Why does the OS keep on asking for a wonking great big paging file?

    I'm not on the Memory Management team, but I've heard the answer enough times that I think I can answer it :)

    Basically the reason is that a part of the contract that NT maintains with applications guarantees that every page of virtual memory can be flushed from physical memory if necessary.

    If the page is backed by an executable (in other words it's code or static data), then the page will be reloaded from the executable file on disk. If the page isn't backed by an executable, then NT needs to have a place to put it.

    And that place is the paging file.

    You see, even if your machine has 16G of physical RAM, NT can't guarantee that you won't suddenly decide to open up 15 copies of Adobe Photoshop and start editing your digital photo album. Or that you won't all of a sudden decide to work on editing the movie collection you shot in ultra high-def. And when you do that, all of a sudden all that old data that Eudora had in RAM needs to be discarded to make room for the new applications that want to use it.

    The operating system has two choices in this case:

    1.      Prevent the new application from accessing the memory it wants. And this means that an old application that you haven’t looked at for weeks is stopping you from doing the work you want to do.

    2.      Page out the memory being used by the old application, and give the memory to the new application.

    Clearly #2's the way to go. But once again, there's a problem (why are there ALWAYS problems?)

    The first problem is that you need a place to put the old memory. Obviously it goes into the paging file, but what if there's no room in the paging file? Then you need to extend it. But what if there's not enough room on disk for the extension? That's "bad" - you have to fail Photoshop’s allocation.

    But again, there's a solution to the problem - what if you reserve the space in the paging file for Eudora's memory when Eudora allocates it? In other words, if the system guarantees that there's a place for the memory allocation in the paging file when memory's allocated, then you can always get rid of the memory. Ok, problem solved.

    So in order to guarantee that future allocations have a better chance of succeeding, NT guarantees that all non-code pages be pre-allocated in the paging file when the memory is allocated. Cool.

    But it doesn't explain why NT wants such a great big file at startup. I mean, why on earth would I want NT to use 4G of my hard disk just for a paging file if I never actually allocate a gigabyte of virtual memory?

    Well, again, there are two reasons. One has to do with the way that paging I/O works on NT, the second has to do with what happens when the system blue screens.

    Paging I/O in NT is special. All of the code AND data associated with the paging file must be in non-pagable memory (it's very very bad if you page out the pager). This includes all the metadata that's used to describe the paging file. And if your paging file is highly fragmented, then this metadata gets really big. One of the easiest ways of guaranteeing that the paging file isn't fragmented is to allocate it all in one great big honking chunk at system startup time. Which is what NT tries to do - it tries to allocate as much space for the paging file when the paging file is created to attempt to help keep the file contiguous. It doesn't always work, but...

    The other reason has to do with blue screens. When the system crashes, there's a bit of code that runs that tries to write out the state of RAM in a dump file that Microsoft can use to help diagnose the cause of the failure. But once again, it needs a place to hold the data. Well, if the paging file's as large as physical RAM, then it becomes a convenient place to write the data - the writes to that file aren't going to fail because your disk is full after all.

    Nowadays, NT doesn't always write the entire contents of memory out - it's controlled by a setting in the Startup and Recovery settings dialog on the Advanced tab of the system control panel applet - there are 4 choices - None, a small "minidump", a Kernel memory dump and a full memory dump. Only the full memory dump will write all of RAM, the others limit the amount of memory that's written out. But it still goes to the paging file.

     

  • Larry Osterman's WebLog

    When security firms offer bad advice.

    • 7 Comments

    So I’m reading /. and I ran into the following article: http://slashdot.org/article.pl?sid=04/03/17/1942232&mode=nested&tid=126&tid=128&tid=172&tid=185&tid=190&tid=201

    In the article is a link to someone known as the “LURQHQ Thread Intelligence Group” who posts this analysis of the “Phatbot” trojan.

    I was fascinated by the capabilities of the Trojan, but thought very little of it, until I ran into the following in the alert:

    Manual Removal
    Look for the following registry keys:

     HKLM\Software\Microsoft\Windows\CurrentVersion\Run\Generic Service Process 
     HKLM\Software\Microsoft\Windows\CurrentVersion\RunServices\Generic Service Process 

    The associated binary may be srvhost.exe, svrhost.exe or a variation of the same. Kill the associated process in the Task Manager, then remove the "Generic Service Process" registry key. Remove the executable from the Windows system directory.

    Here’s the problem.  Windows has an internal component called “svchost.exe”, which is known as the “Generic Host Process for Win32 Services”.  A naive user looking to see if their system is infected with this Trojan would see the 6 or so copies of svchost.exe running on their system and assume that they were infected.

    And the next thing they’d do is to kill those processes, just like the advisory says.  Well, what are some of the services they’d be killing?

    ·         AUDIOSRV – the windows audio service.  This goes and bye bye audio.

    ·         DHCP – The Dynamic Host Configuration Protocol.  Say good by to your TCP/IP networking.

    ·         LanmanServer – The file and print server.  If you’ve got a networked printer on your machine, nobody’s printing on it any more.

    ·         LanmanWorkstation – The CIFS client.  If that one goes, you’re not accessing remote file&print services.

    ·         ShellHWDetection – This blows away autorun

    ·         Spooler – You’re not printing any more.

    And there’s a lot more, those are just the highlights.

    One of the more insidious parts of this problem is that even if the user’s machine survives killing all the svchost processes, the next thing the advisory tells the user to do is to delete the file.

    But Windows has this really cool feature that’s intended to prevent you from messing up your machine called “Windows File Protection”.  In a nutshell, this feature automatically copies critical system files if they’re deleted or overwritten.  And, you guessed it – svchost.exe is a critical system file.

    So here’s the user following the advice from the security company who removes svchost.exe.  And 30 seconds later, the file’s right back where it was!

    So what is the ONLY interpretation that they could have?  Remember – they believe that this file is a Trojan horse and it’s endangering their system.  The only interpretation they could possibly have is that the Trojan has somehow REINFECTED their machine.  They try to delete the file again and again and again.  And they never get anywhere.  So the next thing they do one of two things:

    1)      They call Product Support and spend lots of money to discover that there’s no real problem, or…

    2)      They write up an email about this hideous Trojan horse called svchost.exe that’s installed on their machine that they can’t remove and asking their friends for help.

    And thus another JDBGMGR.EXE or SULFNBK.EXE hoax is born.  Only this time the component IS a critical windows component instead of a relatively minor unused system utility.

    Sigh.

     

  • Larry Osterman's WebLog

    Things not to do when writing software

    • 2 Comments

    I figured I’d start off with an old war story.  A REALLY old story.  From back in the Windows 1.0 days.  Once that could never ever happen these days. 

    It’s about 2 months before Windows 1.0 is about to ship (so it’s sometime around August/September 1985).  Microsoft had announced Windows 18 months before this point, and we were getting HAMMERED in the press about vapor ware.  So the team was under a HUGE amount of pressure to ship windows as fast as humanly possible with the maximum feature set. 

    Anyway, as I said, it’s about 2 months before ship time.  And the developer responsible for the windows memory manager comes in on Monday morning and announces to the team that he’s just checked in a new version of the memory manager that supports swapping movable data segments to disk (up until that point Windows had the ability to discard code segments and reload them but no ability to reload data).

    Steve Ballmer, who was the development lead for the project at that point’s only comment was, “Ok, I want to fire the SOB.  I REALLY want to fire the SOB, but we need this feature”.

     

    Things not to do when writing software, part 2 (I thought about putting this in a separate post but I figured I’d include this one in it since it shows the other side of the coin).

    Also a Windows 1.0 story.  This one’s about the printing subsystem.  The developer for the printing system, apparently inspired by the example above, decided to rewrite the printing subsystem in Windows 2 weeks before they were scheduled to ship.

    Now most of the time, the way that the developers tested printing in Windows was to load up a file in Notepad, print it on the Epson MX-80 dot matrix printer in their office, and if it worked, they checked in the change.  Now Windows 1.0 supported a lot of font features that Notpad didn’t.  Things like variable width fonts.  And boldfaced and italic text.  And underlining.  None of which were usually tested.

    Well, given the previous track record with printing, Linne’ Puller, the test lead for windows printing absolutely freaked out when she heard that he had made the change, and begged for the opportunity to run a test pass before he checked it in.  The developer involved agreed, but he didn’t see what the point was, since he had already thoroughly tested his change.

    Linne and Valorie Holden (the other Windows printing tester) worked through the night running tests and come the next morning, they deposited a stack of papers over a foot high onto the developer’s desk.

    “So those are the results of the test run? Man, that’s a lot of paper; could you whittle it down to just the failures?”

    “No, those ARE just the failures”.

    “Oh”.

    Needless to say, the feature didn’t get checked in.  Once again, testers saved developments hind quarters.

    And now for the full disclaimers before I get flamed and quoted on /. as proof positive that Microsoft can’t code it’s way out of a paper bag: This is an old war story  It’s almost 20 years old at this point.  The Windows team back in those days was known as a bunch of cowboys (that’s actually why Steve Ballmer ended up being in charge of the team). 

    Microsoft has gotten a whole lot better at software engineering in the past twenty years.  Even at that point it was really clear that both of these examples were clearly things that people shouldn’t do instead of examples of the way that changes should happen.

    These days, long before a project gets to the state that Windows was in, the project gets locked down TIGHT.  All new features are reviewed by a DCR (Design Change Request) review board, which consists of development, program management, AND test.  And all three need to sign off before coding even STARTS on the new feature.

    In addition, the test team is intimately involved in the development of new features.  At a minimum, before any significant change gets checked in, the process that Linne and Valorie went through has been formalized into what we now call “smoke tests”.  Each group at Microsoft has their own version of them (some call them buddy build, some call them private tests), but they all have some form of process that allows the test team to veto/comment on the quality of a change.  At some point I’ll write about what a “normal” day looks like from the inside...

     

    Obligatory personal note: Valorie Holden was a summer intern whose internship stretched into December – she was my college girlfriend who had come out to work at Microsoft as a tester for the summer.   We got engaged in December of 1985.  And we’ve been happily married for the past 17 years.

     

  • Larry Osterman's WebLog

    Howdy!

    • 2 Comments

    Ok, so they've finally convinced me to go live with something that resembles my own thoughts (instead of just making snippy comments about other peoples’ posts :))...

    So who am I and why do I think that anyone in their right mind would care about my ramblings?

    Well, I’ve been here for a really long time (see my bio for details: Larry Osterman's bio).  And I’ve got really strong opinions about lots of stuff, and I figured that other people should be able to share in my accumulated wisdom J

     

    And yes, I do have a weird sense of humor and a totally over-exaggerated sense of my own abilities.  But heck, if you can’t be arrogant, why bother to work for Microsoft?  J

     

    The odds are good that I’ll be posting about just about anything.  From Exchange security internals, to working with ATL or COM, to my limited embedded systems experience, to hoary old war stories about working for “a small software company near Seattle”. 

     

  • Larry Osterman's WebLog

    Non Technical: She came, She saw, She kicked hiney!

    • 2 Comments

    So Sharron competed in her first dressage competition over the weekend.  And she utterly and absolutely rocked!

    Sharron competed in two tests, the intro level A and B tests, she got 2nd place in the level A test, and 1st place in the level B test (a much harder test).  Her first score was a 54% and her second was 67%, which is a HUGE improvement across the board.

    A quick aside about dressage scores:  As my wife put it this morning: “Dressage is scored on a scale of 1-10, where nobody gets above an 8 (except MAYBE in Olympic competition)”.  Intro level riders are supposed to be held to the exact same standards as the most advanced riders (that’s not always the case for schooling shows but).  So these results are just absolutely wonderful.

    Sharron riding on Oliver

    Anyway, I’m the proud daddy currently J

     

  • Larry Osterman's WebLog

    Problems...

    • 5 Comments

    Donald Rumsfeld made the concept of “unknown unknowns” popular recently, but I actually first heard the comment back in 1985ish.  One of the Microsoft program managers I worked with was discussing his idea of a concept of “hierarchy of problems”:

    Basically his postulate was that all problems that can be solved fall into three categories:

    1)      Known problems.

    2)      Known Unknown problems.

    3)      Unknown Unknown problems.

    Known problems are the kind of problems where you know the answer.  These are problems like “How do I fix this bug”.  They’re also problems like “We need to rebuild the Exchange security infrastructure to use NT ACLs”.  The problems are pretty well scoped and the mechanisms to implement the feature are well understood – it just takes time to figure them out.

    Known Unknown problems are harder.  We know that we don’t know the answer to the problem, but we have some ideas on how hard it will be to solve them, and we have some techniques for solving them.  These are the problems like “The NT networking system needs to be faster”.  Or “We need to add diagnostics to help the user”.  These are no less important than the Known problems, but their requirements are typically vague in scope.  Another characteristic of Known Unknown problems is that sometimes their scope is much broader than expected - “Add filesystem access to Exchange”.

    And then there are the Unknown Unknowns.  These are the REALLY hard problems.  These are the ones that make you cringe.  These ones are so hard that you don’t know enough to make them known unknowns.  Some famous “Unknown Unknown” problems are:  “Design a computer program to translate between English and German”.  And of course the classic: “Design a computer program that will tell if face of the person checking in at the airport reservation counter is the face of a known terrorist.”

    And then there’s this classic:  “Write a windowed operating environment on top of MS-DOS that would allow for multiple graphical applications to run simultaneously.  Oh, and you’ve got to make it work on a 4.77mHz 8088 processor with 256K of RAM on a computer platform with a graphics adapter whose highest supported resolution is 640x200 with 2 colors (black and white).”

     

  • Larry Osterman's WebLog

    Non Technical: So What's Larry going to be doing this weekend?

    • 2 Comments

    For the past two years, my daughter Sharron’s been taking dressage lesson’s at a local dressage training facility, and she’s gotten pretty darned good at it.

    Back in October, Sharron really wanted to start competing in dressage, and we were told that one of the prerequisites was that Sharron had to own a pony, since dressage competitions are of both the horse and rider.  So last fall, we purchased a pony named Oliver (known in the Osterman family as “Oliver the wonder pony”) for her.  Sharron has been training him exclusively since December (Oliver’s does much better with her as a trainer than with Sharron’s normal instructor).

    Well, this weekend is Sharron’s very first dressage competition; she and Oliver are going to be in the Whidbey Equestrian Center’s spring schooling show testing the USDF Dressage intro level 1 and 2 tests.

    Needless to say, we’re all a smidge nervous. J  But it’ll be exciting.

  • Larry Osterman's WebLog

    Exchange 5.5's access check logic

    • 0 Comments

    I just noticed that KC posted the first of the blog entries I wrote for the Exchange team last week, check it out!

    http://blogs.msdn.com/exchange/archive/2004/03/17/91454.aspx

     

     

Page 1 of 1 (13 items)