Larry Osterman's WebLog

Confessions of an Old Fogey
  • Larry Osterman's WebLog

    Fun with names

    • 10 Comments
    The other day, someone sent an email to an internal mailing list asking about a "typo" in the eventvwr.

    It seems they noticed a number of events coming from the "bowser" event source, and they were convinced that it had to be a typo.

     

    Well, it's not :)  The name of the component is bowser, and I wrote it back in NT 3.1...

     

    The bowser is actually the kernel mode portion of the Computer browser service.  It also handles receiving broadcast mailslot messages and handing them.  When I originally described the functionality, my boss at the time (who was rather opinionated) said "What a dog!  Why don't we call it the bowser?" 

    For various technical reasons we didn't want to call the kernel component browser.sys (because it messed up the debugger to have two components with the same name), so the name bowser just stuck.

    Thus was born the name of the "misspelled" system component.  Nowadays the bowser is essentially gone (for instance, I can't find it on my XP SP2 installation), but the name lives on in eventlogs everywhere...

     

  • Larry Osterman's WebLog

    Error Code Paradigms

    • 33 Comments

    At some point when I was reading the comments on the "Exceptions as repackaged error codes" post, I had an epiphany (it's reflected in the comments to that thread but I wanted to give it more visibility).

    I'm sure it's just an indication of just how slow my mind is working these days, but I just realized that in all the "error code" vs. "exception" discussions that seem to go on interminably, there are two UNRELATED issues being discussed.

    The first is about error semantics - what information do you hand to the caller about what failed.  The second is about error propogation - how do you report the failure to the caller.

    It's critical for any discussion about error handling to keep these two issues separate, because it's really easy to commingle them.  And when you commingle them, you get confusion.

    Consider the following example classes (cribbed in part from the previous post):

    class Win32WrapperException
    {
        // Returns a handle to the open file.  If an error occurs, it throws an object derived from
        // System.Exception that describes the failure.
        HANDLE OpenException(LPCWSTR FileName)
        {
            HANDLE fileHandle;
            fileHandle = CreateFile(FileName, xxxx);
            if (fileHandle == INVALID_HANDLE_VALUE)
            {
                throw (System.Exception(String.Format("Error opening {0}: {1}", FileName, GetLastError());
            }

        };
        // Returns a handle to the open file.  If an error occurs, it throws the Win32 error code that describes the failure.
        HANDLE OpenError(LPCWSTR FileName)
        {
            HANDLE fileHandle;
            fileHandle = CreateFile(FileName, xxxx);
            if (fileHandle == INVALID_HANDLE_VALUE)
            {
                throw (GetLastError());
            }

        };
    };

    class Win32WrapperError
    {
        // Returns either NULL if the file was successfully opened or an object derived from System.Exception on failure.
        System.Exception OpenException(LPCWSTR FileName, OUT HANDLE *FileHandle)
        {
            *FileHandle = CreateFile(FileName, xxxx);
            if (*FileHandle == INVALID_HANDLE_VALUE)
            {
                return new System.Exception(String.Format("Error opening {0}: {1}", FileName, GetLastError()));
            }
            else
            {
                return NULL;
            }

        };
        // Returns either NO_ERROR if the file was successfully opened or a Win32 error code describing the failure.
        DWORD OpenError(LPCWSTR FileName, OUT HANDLE *FileHandle)
        {
            *FileHandle = CreateFile(FileName, xxxx);
            if (&FileHandle == INVALID_HANDLE_VALUE)
            {
                return GetLastError();
            }
            else
            {
                return NO_ERROR;
            }
        };
    };

    I fleshed out the example from yesterday and broke it into two classes to more clearly show what I'm talking about.  I have two classes that perform the same operation.  Win32WrapperException is an example of a class that solves the "How do I report a failure to the caller" problem by throwing exceptions.  Win32WrapperError is an example that solves the "How do I report a failure to the caller" problem by returning an error code.

    Within each class are two different methods, each of which solves the "What information do I return to the caller" problem - one returns a simple numeric error code, the other returns a structure that describes the error.  I used System.Exception as the error structure, but it could have just as easily been an IErrorInfo class, or any one of a bazillion other ways of reporting errors to callers.

    But looking at these examples, it's not clear which is better.  If you believe that reporting errors by exceptions is better than reporting by error codes, is Win32WrapperException::OpenError better than Win32WrapperError::OpenException?  Why? 

    If you believe that reporting  errors by error codes is better, then is CWin32WrapperError::OpenError better than CWin32WrapperError::OpenException?  Why?

    When you look at the problem in this light (as two unrelated problems), it allows you to look at the "exceptions vs. error codes" debate in a rather different light.  Many (most?) of the arguments that I've read in favor of exceptions as an error propagation mechanism  concentrate on the additional information that the exception carries along with it.  But those arguments ignore the fact that it's totally feasible (and in fact reasonable) to define an error code based system that provides the caller with exactly the same level of information that is provided by exception.

    These two problems are equally important when dealing with errors.  The mechanism for error propagation has critical ramifications for all aspects of engineering - choosing one form of error propagation over another can literally alter the fundamental design of a system.

    And the error semantic mechanism provides critical information for diagnosability - both for developers and for customers.  Everyone HATES seeing a message box with nothing but "Access Denied" and no additional context.

     

    And yes, before people complain, I recognize that none of the common error code returning APIs today provide the same quality of error semantics that System.Exception does as first class information - the error return information is normally hidden in a relatively unsophisticated scalar value.  I'm just saying that if you're going to enter into a discussion of error codes vs. exceptions, from a philosophical point of view, then you need to recognize that there are two related problems that are being discussed, and differentiate between these two. 

    In other words, are you advocating exceptions over error codes because you like how they solve the "what information do I return to the caller?" problem, or are you advocating them because you like how they solve the "how do I report errors?" problem?

    Similarly, are you denigrating exceptions because you don't like their solution to the "how do I report errors?" problem and ignoring the "what information do I return to the caller?" problem?

    Just some food for thought.

  • Larry Osterman's WebLog

    Ok, what the heck IS the windows audio service (audiosrv) anyway?

    • 12 Comments

    This morning, Dmitry asked what the heck was the audio service for anyway.

    That's actually a really good question.

    For Windows XP, the most common use for the audiosrv service is that if the audiosrv service didn't exist, applications that linked with winmm.dll would also get setupapi.dll in their address space.  This is a bad thing, since setupapi is relatively large, and for 99% of the apps that use winmm.dll (usually to call PlaySound), they don't need it until they actually start playing sounds (which is often never). 

    As a part of this, audiosrv monitors for plug and play notifications (again, so the app doesn't have to) and allows the application to respond to plug and play changes without having to burn a thread (and a window pump) just to detect when the user plugs in their USB speakers.  All that work's done in audiosrv.

    There's a bunch of other stuff, related to global audio digital signal processing that audiosrv manages, and some stuff to manage user audio preferences, but offloading the PnP functionality is the "big one".  Before Windows XP, this functionality was actually a part of csrss.exe (the windows client/server runtime subsystem), but in Windows XP it was broken out into its own service.

    For Longhorn, Audiosrv will be doing a lot more, but unfortunately, I can't talk about that :(  Sorry. 

    I really do want to be able to talk about the stuff we're doing, but unfortunately none of it's been announced yet, and since none of its been announced yet...

    Edit: Corrected title.  Also added a little more about longhorn.

  • Larry Osterman's WebLog

    Why do people think that a server SKU works well as a general purpose operating system?

    • 70 Comments

    Sometimes the expectations of our customers mystify me.

     

    One of the senior developers at Microsoft recently complained that the audio quality on his machine (running Windows Server 2008) was poor.

    To me, it’s not surprising.  Server SKUs are tuned for high performance in server scenarios, they’re not configured for desktop scenarios.  That’s the entire POINT of having a server SKU – one of the major differences between server SKUs and client SKUs is that the client SKUs are tuned to balance the OS in favor of foreground responsiveness and the server SKUs are tuned in favor of background responsiveness (after all, its a server, there’s usually nobody sitting at the console, so there’s no point in optimizing for the console).

     

    In this particular case, the documentation for the MMCSS service describes a large part of the root cause for the problem:  The MMCSS service (which is the service that provides glitch resilient services for Windows multimedia applications) is essentially disabled on server SKUs.  It’s just one of probably hundreds of other settings that are tweaked in favor of server responsiveness on server SKUs. 

     

    Apparently we’ve got a bunch of support requests coming in from customers who are running server SKUs on their desktop and are upset that audio quality is poor.  And this mystifies me.  It’s a server operating system – if you want client operating system performance, use a client operating system.

     

     

    PS: To change the MMCSS tuning options, you should follow the suggestions from the MSDN article I linked to above:

    The MMCSS settings are stored in the following registry key:

    HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows NT\CurrentVersion\Multimedia\SystemProfile

    This key contains a REG_DWORD value named SystemResponsiveness that determines the percentage of CPU resources that should be guaranteed to low-priority tasks. For example, if this value is 20, then 20% of CPU resources are reserved for low-priority tasks. Note that values that are not evenly divisible by 10 are rounded up to the nearest multiple of 10. A value of 0 is also treated as 10.

    For Vista, this value is set to 20, for Server 2008 the value is set to 100 (which disables MMCSS).

  • Larry Osterman's WebLog

    WMDG loses one of its own.

    • 18 Comments
    So often, you don't hear about the developers who work behind the curtains here at Microsoft.  Today I'd like to talk a bit about one of them.

    One of the key developers on Windows Multimedia at Microsoft is Syon Bhattacharya.  Syon was responsible for many of the internal pieces of the multimedia work on windows, much of the core code was written by him.  If you've ever watched an AVI file, or seen a windows media player visualization, you've been running his code.

    He started at Microsoft in June of 1995, straight out of college, and worked in the multimedia group his entire career at Microsoft.  Coincidentally, he also came from Carnegie-Mellon University (I know a bunch of the other developers from CMU that came at the same time as Syon, but didn't know him until I joined this group).

    Syon was an extraordinary developer, he had an encyclopedic knowledge of the internals of the multimedia code.  When we were doing the code reviews for XP SP2, when I'd see something that I thought was a vulnerability, I'd wander over to Syon's office to ask him.  Syon not only knew the code I was looking at, but he was able to reconstruct (from his head) all of the code paths in which the potentially vulnerable routine was called.  He's the person that the multimedia team went to when they had tough problems - there didn't seem to be a problem that he couldn't solve.

    In addition to being a complete technical wizard, Syon was one of the nicest persons I've ever worked with, his unflagging good humor throughout development cycles was legend around this group.

    Two years ago, Syon was diagnosed with stomach cancer.

    He continued to work, although it was clear that the treatments were taking their toll on him - I often saw him walking down the hall looking horrible, I'm sure that the treatments were hideously uncomfortable, but he pressed on.

    Over the summer, Syon took a leave of absence to concentrate his energies on fighting the cancer that was eating away at him.

    Unfortunately, yesterday he lost that battle, he passed away at a hospice in Seattle.  His family and friends were with him at the end, and it was apparently very peaceful.  He was 30 years old.

    We will all miss him, the world is a smaller place without him.

    Edit: I'll be adding recollections to this post as they come in...

    I've asked my group (and others) to collect their memories of Syon, here's what they wrote (in no particular order):

    Ji Ma:

    I know Syon since I was transferred to DirectShow group almost 6 years ago.  He was hard working soul and low profile person and easy to talk to.  He has amazing ability to solve very tough computer problem and was always willing go extra miles to help others and he knows so much and so deep about computer and programming that he can solve almost anything that no other people can.  I found him to be indispensable dependent technical resources and ideas.  When ever I have difficulty or lack of idea, I will look up to him for help and he always let me a hand.

    He was easy going and willing to listen and we had very good work relation since many of his developed work was tested by me so we interact a lot and we were truly perfect match to each other and great team.  Many times, I just went to his office and we chat about many things in life and he was always willing to listen and provide valuable comments and genuinely enjoy and appreciate the conversation.  It is very rare in the working environment.  I will always remember him for that.  

    I have several fruit trees and some vegetables grown in our garden.  When it was harvesting time, I brought some to our group and passed to many colleges, including him.  He always admires and grateful to what he get, an apple or a tomato or what ever.  I can tell he true enjoy and appreciate the friendship that we had. 

    He talks very little about his personal life so it is mystery to me and I only know he lives around green lake area. 

    Syon, may you rest in peace and we will always remember you.

    Tracy Shew:

    When I first came to Microsoft as a contractor five years ago, it was sometimes difficult and daunting to work with developers.  These were, after all, the people who had written the code for Windows.  Many of them sometimes acted as if they were aware of this fact, and of the distinction between their station and mine – a mere software tester.  The tester – developer relationship can be antagonistic at times, particularly if I had the gall to find a bug or regression in “their” code.  Sometimes, some developers had little time for my questions, and acted as if my concerns were unimportant.  This was discouraging for me, and made me question why I was working at Microsoft at times.

    Syon, more than anyone else, gave me encouragement to continue.  He was a developer, and he was brilliant, but he never – and I mean never – acted as if my concerns were unimportant.  His door was always open, and he always seemed to have the answer ready – or if not, he knew the person to go to.  And he never made me feel ignorant or inferior to him for having to answer a question.  I quickly learned that Syon was a valuable resource, a wealth of information.  But it was much more than that.  Syon taught me, through his example, that I was not a “mere tester” – that I was making an equally valuable contribution to the product.  This encouraged me to continue at Microsoft, eventually becoming a full-time employee in test.

    I had the pleasure to work closely with Syon for almost four years, being the main tester responsible for checking his code.  Syon’s skill was unquestionable; problems were very rare, and, if one was encountered, Syon was extraordinary at quickly locating the difficulty – even if it was outside his area.  I do not know the number of times he has trudged over to the lab to look at one of our machines.  “Why is it doing that?” we would ask, looking at a bazaar error message or a garbled, incomprehensible stack trace.  I sometimes felt that we took advantage of his openness and generosity – not many developers will “dirty their feet” by coming into the lab to look at a sick computer, unless you can first prove it is their code at fault – they would rather have a remote, at the very least, or have you port the bug off to the “owner” – something which is sometimes difficult to determine.  I tried to use Syon as an “avenue of last resort,” lest we overuse the resource – if we absolutely couldn’t determine the issue, and know one else knew what was happening, only then would we bring in Syon.  And, in four years of steady work, day to day, I can count on one hand the number of times we managed to stump him.  And never, not on a single occasion, did Syon refuse help because he was too busy, or because it was not his area, or for any reason at all for that matter.

    Since Syon’s illness took him away from work, there hasn’t been a week go by that this resource hasn’t been missed.  Very frequently, an issue will come up, and someone will say, “If Syon were here, we could figure this out.”  His combination of knowledge, intuition towards problems, and plain generosity in sharing what he knew is unequalled.  People often use the word “irreplaceable” when they lose a colleague, but for us there is no degree of exaggeration in applying it.

    For me, though, Syon was more than a resource.  He demonstrated to me the value that I was contributing to Microsoft, and a vision of the partnership that should exist between development and test, and between teams, where “ownership” should not be used either as a dividing line to avoid issues, nor as a way of assigning responsibility or blame.  Syon simply loved making the best code he could, and he loved solving problems, so he saw all of our contributions, whether development or test, assisting in this process.  He encouraged everyone around him to do their best, and to be excellent.  I wished I could have known him better – losing him is a tremendous blow, certainly professionally, but also personally.  Even though we had a professional rather than social relationship – you would have to call us colleagues rather than friends – I am grateful to him for many different things, and especially for the encouragement he gave.

    Eric Rudolph:

    Syon always was a team player, and he ended up being the backbone of the DirectShow product at Microsoft. After many other people had been reorganized, or had moved on, Syon stuck with DirectShow and not only supported it, but he also supported it's customers and all the accompanying hassles. Not only did he do this really well, but he did it with a gracefulness and humbleness that made it seem easy. Syon knew everything about everything, he was the go-to guy when it came to something that nobody else knew. I don't know a single person at Microsoft (myself included) who wouldn't use that kind of responsibility as a bargaining chip to further their career, but not Syon. When I asked him, "why don't you try and promote yourself more?" He would say, "oohhhh, I guess I'm just lazy." But Syon was anything _but_ lazy. Maybe unmotivated for self gain, but that was one of the things that was cool about him. On a personal note, Syon wasn't easy to get too close to, but I'm proud to count myself among his friends, and he was always up for doing anything. He was my personal movie critic, if I wanted to know if a movie was good, Syon was the first person I would ask. He was an amazing guy, and the effects of who he was and what he did to help people, will ripple outwards forever. I respect him immensely, he taught me many things while I had the chance to work with him.

    Martin Puryear (Dev Manager for WMDG): 

    Syon was one of those rare selfless people that willingly took on any task without a complaint, regardless of the task.  Sometimes the most important tasks are the most tedious as well - ensuring that myriad far-flung fixes were ported back and forth between different OSes; painstakingly crawling through very old Windows source code looking for security vulnerabilities.  I'm fairly certain that I never heard him ever utter a complaint - if he did, then I'm sure it was accompanied with a smile that seemed to say "well, these things happen." 

     Syon was a sterling example of the phrase "still waters run deep."  Over the years he built up considerable expertise in the multimedia arena, but you might not know it from watching his actions.  He always made time to help others, answering even the most basic questions.  Upon asking, one quickly discovered that he understood the overall system and how your question related - and he usually knew the technical details that you needed as well.  After RobinSp himself (overall architect for quartz/ActiveMovie/DirectShow), SyonB was the one to which we repeatedly went with hard problems facing that architecture. 

     Syon didn't have the "rough edges" sometimes found in SW engineers (including the stereotypical MS developer).  If you were wrong, he would couch his words with a soft-spoken "I believe the way it works is…."  He didn't have an egotistical bone in his body - in fact it was understood among managers that we needed to make sure that he got the recognition he deserved. 

     Syon was a class act - in this day and age, the industry needs more like him.  Truly, the world needs more like him.  He will be sorely missed as a coworker and a friend. 

    Steve Rowe:

    When I think of Syon I three adjectives come to mind:  quiet, helpful, and intelligent.  Syon was always soft spoken.  I never saw him get angry or snap at anyone.  He was always calm and collected.  Unlike some people who are good at what they do, he didn’t need to prove it.  He didn’t need the limelight.  Syon was always willing to help.  I never asked him a question he didn’t know the answer to and no matter how busy I’m sure he was, he always took the time to answer my questions.  All you had to do was ask and whatever small feature or tool or tweak you needed would be added.  I recall one time I stopped by to ask him to mock up a fix for a particular issue.  We didn’t need the full implementation, just a simple version to prove it would work.  The next day I had the complete version on my desk.  Syon was extremely intelligent.  He knew the system forwards and backwards.  He rarely had to consult the code, he just knew the answer.  We’ll miss his expertise around here but more importantly, we’ll miss him as a person.  It is rare someone so kind, so willing to help, and so smart comes along.

    Tuan Le:

    Hi Larry, please post another one from me.

     

    At work, Syon is simply brilliant. Syon will take on any task, big or small, challenging or tedious, with the same level of enthusiastic (in his own quiet, pleasant demeanor), and always come through with amazing execution. It is obvious that Syon takes pride in what he works on and set a very high bar for himself. As a person, Syon is a confidence, generous, patient, gentle, and thoughtful person. Syon is simple wonderful to have as a co-worker, and a friend.

     

    Syon is someone I instinctively trust and often share thought on things with. My kids love him! Syon always has things or toys to entertain them when ever they stop by his office. It’s hard for us to accept the fact that Syon has moved on, we often talk about Syon as if he is still with us. Of the many things that Syon enjoys, food and speed are high on his list. We talked often about different cuisine / food blog / car racing / driving school / traveling / etc, and we would go out a try a new restaurant whenever we get a chance. Syon enjoys trying and doing new things, he is always eager to join and share with us. We are very fortunate to have Syon in our lives, and he will miss all the good time we have with him.

    Savvy Dani:

    My first encounter with Syon’s hard-core technical skills was soon after I joined the group. There were some 20 odd high-priority non-trivial bugs that needed immediate attention on a Friday afternoon. I didn’t know the team well enough, but there were many strong voices saying ‘Give it Syon’ and I decided to play along. I understood why when I came back on Monday and all issues were resolved. When I tried to praise him, he just shrugged it off with a gentle, self-deprecating smile. I became a Syon fan after that. Time only added good things to my list - extremely smart, dedicated, gentle, compassionate, unruffled, good sense of humor and on and on. I don’t think anybody ever found anything negative in their interactions with him unless he was too good to be real.

    But Syon was real enough when I got to know him better, What stands out for me during the two years I have worked with Syon are my 1:1s with him. I usually started my Fridays with his meeting. Since he was very quiet, we could not go beyond 15-20 minutes initially and that with me doing most of the talking. Since technical issues were a no-brainer for him, our meetings dwindled into silence soon. I told him frankly that we have got to do better, so we came up with this idea to talk about personal things and get to know each other in non-work related ways as well. Syon accepted this gamely and we went on for a year or more. There was a lot of laughing and a good number of discussions during this time. We talked about his love for car racing and taking his Audi for a spin on the safe track (?). We would catch up on the latest movies, good restaurants, his unsuccessful experiments with Indian recipes, my fluctuating aspirations to be a literary fiction writer etc. I suddenly realized this summer that our meetings had gotten longer and that he was doing most of the talking. We would go past the slotted hour and then walk down to lunch. When we exchanged hugs as he went on leave, I knew I was going to miss my friend.

    Syon is not a typical Indian name and I asked him about it once. I believe there are two stories behind his name. a) He was named after Sayanacharya, a great Indian philosopher who lived in the 14th century A.D whose commentaries apparently defined the speed of light to be pretty close to the numbers we have today. b) He was born in London close to Syon park and his parents shortened his name to Sayana and then morphed it to Syon. ‘Acharya’ literally means Master, so Syon definitely lived up to his name.

    Bertrand Lee:

    Syon was from my ECE '95 cohort at CMU, and I remember seeing his name in the CMU newsgroups when he participated in various technical discussions.

     

    However, I only got to know him a bit better when I worked with him in WMDG, and he struck me as one of the most knowledgeable engineers I have ever had the privilege to work with. As many would attest to, he was _the_ DirectShow guru, and any time I had some intractable DirectShow bug that I was making no headway into, I would consult Syon and he would very willingly come over and help me to debug the cause of the problem, which due to his deep expertise took hardly any time at all, even for the most complex problems.

     

    More importantly however, he was one of the most gentle-natured and helpful people I've ever known, and I will always remember and miss him as a great person, coworker and friend.

     Steve Ball:

    Hey Larry -
    Although I barely knew Syon, and only had a very small set of direct interactions with him around DShow, I do have a few small observations from my experience in working with and near him over this past three years. 
     
    Syon was a like Zen master.
     
    While I run around like a headless chicken most of the time, being with Syon in a meeting or even simply passing him in the hall was always like being in the presence of a great Master.  His pace, his interactions, his movements were always intentional, methodical, calming, even charming.   He set an example just in his being who and how he was: collected, positive, responsive, and ready to embrace and solve even the toughest problems. 
     
    Just being near him was a calming.  His presence, sincere smile, and the peaceful look in his eyes often felt to me like a gift and provided a simple and wonderful reminder to slow down, collect myself, and be thankful for the amazing resources and opportunities we have at our fingertips everyday.
     
    His very presence was a gift, and his absence touches me deeply.
     
    With best wishes to his closest friends and family,
    -Steve
     
    A Co-Worker:

    I think you're experiencing what a lot of other people have: Syon was such a quiet, unassuming guy who didn't really like to talk about himself much that it wasn't easy to get to know him; he would probably have been embarassed by all this attention.  But everyone who came into contact with him remembers him as the kind, helpful, thoughtful person that he was.  As news of his passing has spread, we've been amazed at the number of people who've come forward with stories about Syon.  Some didn't even know he had been so sick - it just wasn't in his nature to talk about himself.  He died the way he lived: peacefully, with his quiet, inner strength shining through.

    Alex Wetmore (friend from college):

    Syon was always really quiet.  On our last visit together we were trying to remember how we met, but I'm not really sure.  From my freshman to junior years at CMU he spent a lot of time hanging out at my dorm room (my 4th year, his last year, I moved off campus and it wasn't as easy to do so). 
     
    Recollections are hard.  He was so quiet, but with a great sense of humor.  He never wanted to be a burden on anyone.  In his freshman year he had a collapsed lung and didn't even tell anyway -- I saw him everyday and never learned about it until I didn't see him for two days and his roommate found out where he was.  He loved food which I think made the stomach cancer even harder.  He and my wife Christine used to go hunting around Seattle for the best fried chicken, hamburgers, or other comfort/junk food.  He also loved really good food and knew all of the best resturants in town (but was quiet about it...not the normal belltown foodie type).  I think he was social at heart and liked to be around people, but had a hard time opening up.
     
    He was at CMU from 91 to 95, EE/CE

    Alok Chakrabarti:

    The most I remember about Syon:
    1.  Whenever I had a stress issue to debug involving a whole bunch threads and random locks taken by components such as DDraw.  I would pull my hair out for a while, narrow it down a bit, and then get totally stuck.  The next step was to walk over to Syon office, ask him to connect the remote (mostly wdeb in those days of Win9x) and go through all those threads and finally figuring out what the problem was, and what to do about it -- mostly assign the bug to someone appropriate.  It became a common thing almost everyday, and I am sure he had enough work to do, but he never stopped taking the time to help out.
     
    2.  His calmness and that smile: I have never seen Syon getting upset with anything.  He was so calm always.  And that slice of smile he had on his face -- I still remember so vividly.
     
    3.  His typing speed: That was unbelievable!!! I still can't really type after working with PCs for about 19 years.  But his typing was just out of the world.  And he would thinking even faster that his typing.
     
    I always will remember him as a person I wished I could be somewhat like, but knew I didn't even have a chance.  He was much younger than me, but still my hero, not to say just today -- I always thought that way.  Such a brilliant but unassuming person, so helpful and nice.  Syon has been so unique.
    Wendy Liu:

    When I first joined MS, Syon was recommended highly on the list as the goto guy for any technical question. He is one of the gurus on DSHOW.

     Syon was not the person who liked to talk too much at work. He always had a cool style and you never saw him rush in the hallway. When you chatted with him, he always spoke warmly, slowly and carried a smile; when you asked him for help, you never expected him to say no. From time to time, he brought fresh bagels and cream to put outside his office to share with us.

    For the last four more years, I have got many helps from Syon. I still clearly remember that I once asked him to take a look at bug which I had been working for the whole day. As usual, he didn’t talk much, and sitting in front of the machine and his fingers moving quickly on the key board, his mind was completely on the bug. He tried various ways to poke it, and more than half an hour passed, we still didn’t get any clue. I began to apologize for the time and asked him to stop there. Still in deep thinking, he kept working. Then he said “Let’s try this.” Bingo! We found the problem.

    Syon was one of the few people I have known who never showed any impatience or frustration. Even when he was telling me that his family doctor had mistaken his symptoms for years, he just sounded unhappy about it and there was no anger in his tone like what I felt at that moment. That was the only time I heard he complained, and it was in such a cool way.

    It is a great loss for all of us to miss such a good colleague. We will always member him.

    Wenhong Liu

    Brent Mills:

    While Syon and I were not close buddies, I always felt comfort in speaking with him, and he always made me curious enough to ask what he’s been up to and how he was doing (before he was sick).  I have not met a more genuinely nice man…Ever! 

    After he left MS, I exchanged a few mails with him and he seemed positive as usual, but I couldn’t help thinking that bad news may be on the horizon….I don’t shed tears easily or often, but I remembered thinking to myself that any person and especially not one as good as he, should be going through something like this; the tears flowed.

    I have and will continue to miss Syon and I hope he is in a better place.     

     Ted Youmans:

    Six years (I think) and I have no anecdotes or stories. What is so surprising about this is that I liked Syon quite a bit. He was one of the nicest and most intelligent people I have had the pleasure of working with here as MS. When I actually came up against something in DShow that I couldn’t find an answer to, he was usually the only one that could answer it. I truly wish I had something to offer for your LJ or for the memory book, but I am coming up with a complete blank. Maybe it’s because I don’t take enough notice of day to day happenings or maybe it’s because the extraordinary was an every day occurrence for Syon and none of it sticks out any more than any other day. What I can say is that he will be sorely missed and this place hasn’t really been the same since he left.

    Penelope Broomer:

    Other than building checkins for Syon during Win2K, he was the point person for the multimedia team, I never got to work with him directly, I therefore consider myself to have been one of Syon’s friends rather than a colleague.  Syon came to our home two or three times, we love our curries and he was very polite about the home made curry we inflicted upon him during his last visit!

     

    Like many, I have fond memories of Syon, one that springs to mind is the time that he rescued Stephen (stestrop) from the car park at Barnes & Noble in Bellevue.  I was working in the build lab at the time and it was my turn to be on the ‘late shift’, Stephen, facing another night in on his own, went off to Barnes & Nobel to pass some time.  He must have had a lot on his mind as it wasn’t until he was in the store browsing through the computer books that he realized that he didn’t have his car keys.  Concerned, as he thought he’d locked them in the car, he returned to the vehicle only to discover that he’d not only left them in the car but that the car was still running!  He called me in the build lab in a state of panic asking me to go and rescue him – this was as we were coming up to shipping Win2K - it was late in the evening, I was on my own and I couldn’t leave the build lab.  Several calls later Stephen asked me to try calling Syon in his office, Syon was still at work and without hesitation agreed to go to Stephen’s rescue, that’s just the kind of person he was.

    Soccer Liu:

    I remember him as an soft-spoken and sharp thinking gentleman. I worked with him on only on several instances. I had a couple of conversations with him. I really miss him.

    Robin Speed (and Eric Rudolph):

    I guess this old email from Eric sums up Syon rather nicely work-wise..

     He was also a really nice guy – sounds bland but in this case it is true.  He never pushed himself forward – almost to a frustrating level - but always had time for everyone.  People all over knew and respected him.  Someone the word humble truly applied to.  What an unfair world.

     Robin

     _____________________________________________
    From: Eric Rudolph
    Sent: Tuesday, May 04, 1999 8:53 PM
    To: Robin Speed
    Subject: Syon B, Master Brain

     Whatever we're paying Syon, it's not enough. He always knows exactly how to fix any weird compiler, linker, or base class problems I'm having. The man's a genious.

     

  • Larry Osterman's WebLog

    Does Visual Studio make you stupid?

    • 43 Comments

    I know everyone's talking about this, but it IS a good question...

    Charles Petzold recently gave this speech to the NYC .NET users group.

    I've got to say, having seen Daniel's experiences with Visual Basic, I can certainly see where Charles is coming from.  Due partly to the ease of use of VB, and (honestly) a lack of desire to dig deeper into the subject, Daniel's really quite ignorant of how these "computer" thingies work.  He can use them just fine, but he has no understanding of what's happening.

    More importantly, he doesn't understand how to string functions/procedures together to build a coherent whole - if it can't be implemented with a button or image, it doesn't exist...

     

    Anyway, what do you think?

     

  • Larry Osterman's WebLog

    What does style look like, part 7

    • 37 Comments
    Over the course of this series on style, I've touched on a lot of different aspects, today I want to discuss aspects C and C++ style specifically.

    One of the things about computer languages in general is that there are often a huge number of options available to programmers to perform a particular task.

    And whenever there's a choice to be made while writing programs, style enters into the picture.  As does religion - whenever the language allows for ambiguity, people tend to get pretty religious about their personal preferences.

    For a trivial example, consider the act of incrementing a variable.  C provides three different forms that can be used to increment a variable. 

    There's:

    • i++,
    • ++i,
    • i+=1, and
    • i=i+1.

    These are all semantically identical, the code generated by the compiler should be the same, regardless of which you chose as a developer (this wasn't always the case, btw - the reason that i++ exists as a language construct in the first place is that the original C compiler wasn't smart enough to take advantage of the PDP-8's built-in increment instruction, and i++ allowed a programmer to write code that used it).

    The very first time I posted a code sample, I used my personal style, of i+=1 and got howls of agony from my readers.  They wanted to know why on EARTH I would use such a stupid construct when i++ would suffice.  Well, it's a style issue :)

    There are literally hundreds of these language specific style issues.  For instance, the syntax of an if statement (or a for statement) is:

    if (conditional) statement

    where statement could be either a single line statement or a compound statement.  This means that it's totally legal to write:

    if (i < 10)
        i = 0;

    And it's totally legal to write

    if (i < 10)
    {
        i = 0;
    }

    The statements are utterly identical from a semantic point of view.  Which of the two forms you choose is a style issue.  Now, in this case, there IS a fairly strong technical reason to choose the second form over the first - by putting the braces in always, you reduce the likelihood that a future maintainer of the code will screw up and add a second line to the statement.  It also spaces out the code (which is a good thing IMHO :) (there's that personal style coming back in again)).

    Other aspects of coding that ultimately devolve to style choices are:

    if (i == 10)

    vs

    if (10 == i)

    In this case, the second form is often used to prevent the assignment within an if statement problem - it's very easy to write:

    if (i = 10)

    which is unlikely to be what the developer intended.  Again, this is a style issue - by putting the constant on the left of the expression, you cause the compiler to generate an error when you make this programming error.  Of course, the compiler has a warning, C4706, to catch exactly this situation, so...

    Another common stylistic convention that's often found is:

    do {
        < some stuff >
    } while (false);

    This one exists to allow the programmer to avoid using the dreaded "goto" statement.  By putting "some stuff" inside the while loop, it enables the use of the break statement to exit the "loop". Personally, I find this rather unpleasant, a loop should be a control construct, not syntactic sugar to avoid language constructs.

    Speaking of goto...

    This is another language construct that people either love or hate.  In many ways, Edsger was totally right about goto - it is entirely possible to utterly misuse goto. On the other hand, goto can be a boon for improving code clarity.  

    Consider the following code:

    HRESULT MyFunction()
    {
        HRESULT hr;

        hr = myCOMObject->Method1();
        if (hr == S_OK)
        {
            hr = myCOMObject->Method2();
            if (hr == S_OK)
            {
                hr = myCOMObject->Method3();
                if (hr == S_OK)
                {
                    hr = myCOMObject->Method4();
                }
                else
                {
                    hr = myCOMObject->Method5();
                }
            }
        }
        return hr;
    }

    In this really trivial example, it's vaguely clear what's going on, but it suffices.  One common change is to move the check for hr outside and repeatedly check it for each of the statements, something like:

        hr = myCOMObject->Method1();
        if (hr == S_OK)
        {
            hr = myCOMObject->Method2();
        }
        if (hr == S_OK)
     

    What happens when you try that alternative implementation?

    HRESULT MyFunction()
    {
        HRESULT hr;

        hr = myCOMObject->Method1();
        if (hr == S_OK)
        {
            hr = myCOMObject->Method2();
        }
        if (hr == S_OK)
        {
            hr = myCOMObject->Method3();
            if (hr == S_OK)
            {
                hr = myCOMObject->Method4();
            }
            else
            {
                hr = myCOMObject->Method5();
            }
        }
        return hr;
    }

    Hmm.  That's not as nice - some of it's been cleaned up, but the Method4/Method5 check still requires that you indent an extra level.

    Now consider what happens if you can use gotos:

    HRESULT MyFunction()
    {
        HRESULT hr;

        hr = myCOMObject->Method1();
        if (hr != S_OK)
        {
            goto Error;
        }
        hr = myCOMObject->Method2();
        if (hr != S_OK)
        {
            goto Error;
        }
        hr = myCOMObject->Method3();
        if (hr == S_OK)
        {
            hr = myCOMObject->Method4();
        }
        else
        {
            hr = myCOMObject->Method5();
        }
        if (hr != S_OK)
        {
            goto Error;
        }
    Cleanup:
        return hr;
    Error:
        goto Cleanup;
    }

    If you don't like seeing all those gotos, you can use a macro to hide them:

    #define IF_FAILED_JUMP(hr, tag) if ((hr) != S_OK) goto tag
    HRESULT MyFunction()
    {
        HRESULT hr;

        hr = myCOMObject->Method1();
        IF_FAILED_JUMP(hr, Error);

        hr = myCOMObject->Method2();
        IF_FAILED_JUMP(hr, Error);

        hr = myCOMObject->Method3();
        if (hr == S_OK)
        {
            hr = myCOMObject->Method4();
            IF_FAILED_JUMP(hr, Error);
        }
        else
        {
            hr = myCOMObject->Method5();
            IF_FAILED_JUMP(hr, Error);
        }

    Cleanup:
        return hr;
    Error:
        goto Cleanup;
    }

    Again, there are no right answers or wrong answers, just choices.

    Tomorrow, wrapping it all up.

  • Larry Osterman's WebLog

    Life in a faraday cage

    • 32 Comments

    There was an internal discussion about an unrelated topic recently, and it reminded me of an early experience in my career at Microsoft.

    When I started, my 2nd computer was a pre-production PC/AT (the first was an XT). The AT had been announced by IBM about a week before I started, so our pre-production units were allowed to be given to other MS employees (since I had to write the disk drivers for that machine, it made sense for me to own one of them).

    Before I got the machine, however, it was kept in a room that we semi-affectionately called "the fishtank" (it was the room where we kept the Salmons (the code name for the PC/AT)).

    IBM insisted that we keep all the pre-production computers we received from them in this room - why?

    Two reasons.  The first was that there was a separate lock on the door that would limit access to the room.

    The other reason was that IBM had insisted that we build a faraday cage around the room.  They were concerned that some n'er-do-well would use the RF emissions from the computer (and monitor) to read the contents of the screen and RAM.  I was told that they had technology that would allow them to read the contents of an individual screen from across the street, and they were worried about others being able to do the same thing.

    Someone at work passed this link along to a research paper by Wim van Eyk that discusses the technical details behind the technology.

     

  • Larry Osterman's WebLog

    It's the platform, Silly!

    • 69 Comments

    I’ve been mulling writing this one for a while, and I ran into the comment below the other day which inspired me to go further, so here goes.

    Back in May, Jim Gosling was interviewed by Asia Computer Weekly.  In the interview, he commented:

    One of the biggest problems in the Linux world is there is no such thing as Linux. There are like 300 different releases of Linux out there. They are all close but they are not the same. In particular, they are not close enough that if you are a software developer, you can develop one that can run on the others.

    He’s completely right, IMHO.  Just like the IBM PC’s documented architecture meant that people could create PC’s that were perfect hardware clones of IBM’s PCs (thus ensuring that the hardware was the same across PCs), Microsoft’s platform stability means that you could write for one platform and trust that it works on every machine running on that platform.

    There are huge numbers of people who’ve forgotten what the early days of the computer industry were like.  When I started working, most software was custom, or was tied to a piece of hardware.  My mother worked as the executive director for the American Association of Physicists in Medicine.  When she started working there (in the early 1980’s), most of the word processing was done on old Wang word processors.  These were dedicated machines that did one thing – they ran a custom word processing application that Wang wrote to go with the machine.  If you wanted to computerize the records of your business, you had two choices: You could buy a minicomputer and pay a programmer several thousand dollars to come up with a solution that exactly met your business needs.  Or you could buy a pre-packaged solution for that minicomputer.  That solution would also cost several thousand dollars, but it wouldn’t necessarily meet your needs.

    A large portion of the reason that these solutions were so expensive is that the hardware cost was so high.  The general purpose computers that were available cost tens or hundreds of thousands of dollars and required expensive facilities to manage.  So there weren’t many of them, which means that companies like Unilogic (makers of the Scribe word processing software, written by Brian Reid) charged hundreds of thousands of dollars for installations and tightly managed their code – you bought a license for the software that lasted only a year or so, after which you had to renew it (it was particularly ugly when Scribe’s license ran out (it happened at CMU once by accident) – the program would delete itself off the hard disk).

    PC’s started coming out in the late 1970’s, but there weren’t that many commercial software packages available for them.  One problems developers encountered was that the machines had limited resources, but beyond that, software developers had to write for a specific platform – the hardware was different for all of these machines, as was the operating system and introducing a new platform linearly increases the amount of testing required.  If it takes two testers to test for one platform, it’ll take four testers to test two platforms, six testers to test three platforms, etc (this isn’t totally accurate, there are economies of scale, but in general the principal applies – the more platforms you support, the higher your test resources required).

    There WERE successful business solutions for the early PCs, Visicalc first came out for the Apple ][, for example.  But they were few and far between, and were limited to a single hardware platform (again, because the test and development costs of writing to multiple platforms are prohibitive).

    Then the IBM PC came out, with a documented hardware design (it wasn’t really open like “open source”, since only IBM contributed to the design process, but it was fully documented).  And with the IBM PC came a standard OS platform, MS-DOS (actually IBM offered three or four different operating systems, including CP/M and the UCSD P-system but MS-DOS was the one that took off).  In fact, Visicalc was one of the first applications ported to MS-DOS btw, it was ported to DOS 2.0. But it wasn’t until 1983ish, with the introduction of Lotus 1-2-3, that PC was seen as a business tool and people flocked to it. 

    But the platform still wasn’t completely stable.  The problem was that while MS-DOS did a great job of virtualizing the system storage (with the FAT filesystem)  keyboard and memory, it did a lousy job of providing access to the screen and printers.  The only built-in support for the screen was a simple teletype-like console output mechanism.  The only way to get color output or the ability to position text on the screen was to load a replacement console driver, ANSI.SYS.

    Obviously, most ISVs (like Lotus) weren’t willing to deal with this performance issue, so they started writing directly to the video hardware.  On the original IBM PC, that wasn’t that big a deal – there were two choices, CGA or MDA (Color Graphics Adapter and Monochrome Display Adapter).  Two choices, two code paths to test.  So the test cost was manageable for most ISVs.  Of course, the hardware world didn’t stay still.  Hercules came out with their graphics adapter for the IBM monochrome monitor.  Now we have three paths.  Then IBM came out with the EGA and VGA.  Now we have FIVE paths to test.  Most of these were compatible with the basic CGA/MDA, but not all, and they all had different ways of providing their enhancements.  Some had some “unique” hardware features, like the write-only hardware registers on the EGA.

    At the same time as these display adapter improvements were coming, disks were also improving – first 5 ¼ inch floppies, then 10M hard disks, then 20M hard disks, then 30M.  And system memory increased from 16K to 32K to 64K to 256K to 640K.  Throughout all of it, the MS-DOS filesystem and memory interfaces continued to provide a consistent API to code to.  So developers continued to write to the MS-DOS filesystem APIs and grumbled about the costs of testing the various video combinations.

    But even so, vendors flocked to MS-DOS.  The combination of a consistent hardware platform and a consistent software interface to that platform was an unbelievably attractive combination.  At the time, the major competition to MS-DOS was Unix and the various DR-DOS variants, but none of them provided the same level of consistency.  If you wanted to program to Unix, you had to chose between Solaris, 4.2BSD, AIX, IRIX, or any of the other variants.  Each of which was a totally different platform.  Solaris’ signals behaved subtly differently from AIX, etc.  Even though the platforms were ostensibly the same, they were enough subtle differences so that you either wrote for only one platform, or you took on the burden of running the complete test matrix on EVERY version of the platform you supported.  If you ever look at the source code to an application written for *nix, you can see this quite clearly – there are literally dozens of conditional compilation options for the various platforms.

    On MS-DOS, on the other hand, if your app worked on an IBM PC, your app worked on a Compaq.  Because of the effort put forward to ensure upwards compatibility of applications, if your application ran on DOS 2.0, it ran on DOS 3.0 (modulo some minor issues related to FCB I/O).  Because the platforms were almost identical, your app would continue to run.   This commitment to platform stability has continued to this day – Visicalc from DOS 2.0 still runs on Windows XP.

    This meant that you could target the entire ecosystem of IBM PC compatible hardware with a single test pass, which significantly reduced your costs.   You still had to deal with the video and printer issue however.

    Now along came Windows 1.0.  It virtualized the video and printing interfaces providing, for the first time, a consistent view of ALL the hardware on the computer, not just disk and memory.  Now apps could write to one API interface and not worry about the underlying hardware.  Windows took care of all the nasty bits of dealing with the various vagaries of hardware.  This meant that you had an even more stable platform to test against than you had before.  Again, this is a huge improvement for ISV’s developing software – they no longer had to wonder about the video or printing subsystem’s inconsistencies.

    Windows still wasn’t an attractive platform to build on, since it had the same memory constraints as DOS had.  Windows 3.0 fixed that, allowing for a consistent API that finally relieved the 640K memory barrier.

    Fast forward to 1993 – NT 3.1 comes out providing the Win32 API set.  Once again, you have a consistent set of APIs that abstracts the hardware and provides a constant API set.  Win9x, when it came out continued the tradition.  Again, the API is consistent.  Apps written to Win32g (the subset of Win32 intended for Win 3.1) still run on Windows XP without modification.  One set of development costs, one set of test costs.  The platform is stable.  With the Unix derivatives, you still had to either target a single platform or bear the costs of testing against all the different variants.

    In 1995, Sun announced its new Java technology would be introduced to the world.  Its biggest promise was that it would, like Windows, deliver platform independent stability.  In addition, it promised cross-operating system stability.  If you wrote to Java, you’d be guaranteed that your app would run on every JVM in the world.  In other words, it would finally provide application authors the same level of platform stability that Windows provided, and it would go Windows one better by providing the same level of stability across multiple hardware and operating system platforms.

    In Jim Gosling post, he’s just expressing his frustration with fact that Linux isn’t a completely stable platform.  Since Java is supposed to provide a totally stable platform for application development, just like Windows needs to smooth out differences between the hardware on the PC, Java needs to smooth out the differences between operating systems.

    The problem is that Linux platforms AREN’T totally stable.  The problem is that while the kernel might be the same on all distributions (and it’s not, since different distributions use different versions of the kernel), the other applications that make up the distribution might not.  Java needs to be able to smooth out ALL the differences in the platform, since its bread and butter is providing a stable platform.  If some Java facilities require things outside the basic kernel, then they’ve got to deal with all the vagaries of the different versions of the external components.  As Jim commented, “They are all close, but not the same.”  These differences aren’t that big a deal for someone writing an open source application, since the open source methodology fights against packaged software development.  Think about it: How many non open-source software products can you name that are written for open source operating systems?  What distributions do they support?  Does Oracle support other Linux distributions other than Red Hat Enterprise?  The reason that there are so few is that the cost of development for the various “Linux” derivatives is close to prohibitive for most shrink-wrapped software vendors; instead they pick a single distribution and use that (thus guaranteeing a stable platform).

    For open source applications, the cost of testing and support is pushed from the developer of the package to the end-user.  It’s no longer the responsibility of the author of the software to guarantee that their software works on a given customer’s machine, since the customer has the source, they can fix the problem themselves.

    In my honest opinion, platform stability is the single thing that Microsoft’s monoculture has brought to the PC industry.  Sure, there’s a monoculture, but that means that developers only have to write to a single API.  They only have to test on a single platform.  The code that works on a Dell works on a Compaq, works on a Sue’s Hardware Special.  If an application runs on Windows NT 3.1, it’ll continue to run on Windows XP.

    And as a result of the total stability of the platform, a vendor like Lotus can write a shrink-wrapped application like Lotus 1-2-3 and sell it to hundreds of millions of users and be able to guarantee that their application will run the same on every single customer’s machine. 

    What this does is to allow Lotus to reduce the price of their software product.  Instead of a software product costing tens of thousands of dollars, software products costs have fallen to the point where you can buy a fully featured word processor for under $130.  

    Without this platform stability, the testing and development costs go through the roof, and software costs escalate enormously.

    When I started working in the industry, there was no volume market for fully featured shrink wrapped software, which meant that it wasn’t possible to amortize the costs of development over millions of units sold. 

    The existence of a stable platform has allowed the industry to grow and flourish.  Without a stable platform, development and test costs would rise and those costs would be passed onto the customer.

    Having a software monoculture is NOT necessarily an evil. 

  • Larry Osterman's WebLog

    Anatomy of a software bug, part 1 - the NT browser

    • 20 Comments
    No, I don't mean that the NT browser's a software bug...

    Actually Raymond's post this morning about the network neighborhood got me thinking about the NT browser and it's design.  I've written about the NT browser before here, but never wrote up how the silly thing worked.  While reminiscing, I remembered a memorable bug I fixed back in the early 1990's that's worth writing up because it's a great example of how strange behaviors and subtle issues can appear in peer-to-peer distributed systems (and why they're so hard to get right).

    Btw, the current design of the network neighborhood is rather different than this one - I'm describing code and architecture designed for systems 12 years ago, there have been a huge number of improvements to the system since then, and some massive architectural redesigns.  In particular, the "computer browser" service upon which all this depends is disabled in Windows XP SP2 due to attack surface reduction.  In current versions of Windows, Explorer uses a different mechanism to view the network neighborhood (at least on my machine at work).

     

    The actual original design of the NT browser came from Windows for Workgroups.  Windows for Workgroups was a peer-to-peer networking solution for Windows 3.1 (and continued to be the basis of the networking code in Windows 95).  As such, all machines in a workgroup needed to be visible to all the other machines in the workgroup.  In addition, since you might have different workgroups on your LAN, it needed to be able to enumerate all the workgroups on the LAN.

    One critical aspect of WfW is that it was designed for LAN environments - it was primarily based on NetBEUI, which was a LAN protocol designed by IBM back in the 1980's.  LAN protocols typically scale quite nicely to several hundred computers, after which they start to fall apart (due to collisions, etc).  For larger networks, you need a routable protocol like IPX or TCP, but at the time, it wasn't that big a deal (we're talking about 1991 here - way before the WWW existed).

    As I mentioned, WfW was a peer-to-peer product.  As such, everything about WfW had to be auto-configuring.  For Lan Manager, it was ok to designate a single machine in your domain to be the "domain controller" and others as "backup domain controllers", but for WfW, all that had to be automatic.

    To achieve this, the guys who designed the protocol for the WfW browser decided on a three tier design.  Most of the machine on the workgroup would be "potential browser servers".  Some of the machines in the workgroup would be declared as "browser servers", one of the machine in the workgroup was the "master browser server".

    Client's periodically (every three minutes) sent a datagram to the master browser server, and the master browser would record this in it's server list.  If the server hadn't heard from the client for three announcements, it assumed that the client had been turned off and removed it from the list.  Backup browser servers would periodically (every 15 minutes) retrieve the browser list from the master browser.

    When a client wanted to browse the network, the client sent a broadcast datagram to the workgroup asking who the browser servers were on the workgroup.  One of the backup or master browser servers would respond within several seconds (randomly).  The client would then ask that browser server for its list of machines, and would display that to the user.

    If none of the browser servers responded, then the client would force an "election".  When the potential browser servers received the election datagram, they each broadcast a "vote" datagram that described their "worth".  If they saw a datagram from another server that had more "worth" than they did, they silently dropped out of the election.

    A servers "worth" was based on a lot of factors - the system's uptime, the version of the software running, their current role as a browser (backup browsers were better than potential browsers, master browsers were better than backup browsers).

    Once the master browser was elected, it nominated some number of potential browser servers to be backup browsers

    This scheme worked pretty well - browsers tended to be stable, and the system was self healing.

    Now once we started deploying the browser in NT, we started running into problems that caused us to make some important design changes.  The biggest one related to performance.  It turns out that in a corporate environment, peer-to-peer browsing is a REALLY bad idea.  There's no way of knowing what's going on on another persons machine, and if the machine is really busy (like if it's running NT stress tests), it impacts the browsing behavior for everyone in the domain.  Since NT had the concept of domains (and designated domain controllers), we modified the election algorithm for to ensure that NT server machines were "more worthy" than NT workstation machines, this solved that particular problem neatly.  We also biased the election algorithm towards NT machines in general, on the theory that NT machines were more likely to be more reliable than WfW machines.

    There were a LOT of other details about the NT browser that I've forgotten, but that's a really brief overview, and it's enough to understand the bug.  Btw, I'm the person who coined the term "Bowser" (as in "bowser.sys") during a design review meeting with my boss (who described it as a dog) :)

    Btw, Anonymous Coward's comment on Raymond's blog is remarkably accurate, and states many of the design criteria and benefits of the architecture quite nicely.  I don't know who AC is (my first guess didn't pan out), but I suspect that person has worked with this particular piece of code :)

     

  • Larry Osterman's WebLog

    Psychic Perf Analysis, or "RegFlushKey actually DOES flush the registry key"

    • 19 Comments
    One of Raymond's more endearing features is what he calls "Psychic Debugging", it even made his wikipedia entry (wow, he even has a wikipedia entry, complete with picture :))

    There's a variant of Psychic Debugging called "Psychic Perf Analysis".  It works like this:

    I get an IM from one of Ryan, one of the perf guys. 

    Ryan: "Hey Larry, we just found a great perf bug that caused a 3 second slowdown in Windows boot time"

    Me: "Let me guess, they were calling RegFlushKey in a service startup path."

    <long pause>

    Ryan: "Who told you?"

     

    One of the things people don't realize about RegFlushKey is that it actually flushes the data that backs the registry key (doh!).  Well, flushing the data means that you need to write it to disk, and the semantics of RegFlushKey ensure that the data's actually been committed - in other words, the RegFlushKey API is going to block until all the disk writes needed to ensure that the data backing the key is physically on the disk.  This can take hundreds and hundreds of milliseconds.

    In Ryan's case, it was complicated because the service was calling RegFlushKey from a DllMain function (Doh!) which held the loader lock, which meant that all the other services in that process were blocked, and there were other services that depended on those services, and...  You get the picture, it REALLY wasn't pretty.

    The documentation for RegFlushKey explicitly says that "In general, RegFlushKey rarely, if ever, need be used", and it's right.

    Why did I know that this was a problem?  Well, when we first deployed the new audio stack into Vista, we were blocked from RI'ing into winmain because the audio service degraded the boot time of Windows by 3/4 of a second (yes, we measure boot time performance to the millisecond, and changes that degrade the system boot performance aren't allowed in).  When I looked at the perf logs of the boot process, I noticed a significant number of writes occurring during the start of the audiosrv service.  I chased it down further, and realized that the writes correlated almost perfectly with some code that was modifying the registry.  I dug deeper and discovered a call to RegFlushKey that we had mistakenly added.  Removing the call to RegFlushKey fixed the problem.

  • Larry Osterman's WebLog

    Riffing on Raymond - FindFirst/FindNext

    • 16 Comments

    As I mentioned, I've been Riffing on Raymond a lot - Yesterdays post from Raymond got me to thinking about FindFirst and FindNext in MS-DOS.

    As Raymond pointed out:

    That's because the MS-DOS file enumeration functions maintained all their state in the find structure. The FAT file system was simple enough that the necessary search state fit in the reserved bytes and no external locking was necessary to accomplish the enumeration. (If you did something strange like delete a directory while there was an active enumeration on it, then the enumeration would start returning garbage. It was considered the program's responsibility not to do that. Life is a lot easier when you are a single-tasking system.)

    The interesting thing about the fact that MS-DOS kept its state in a the reserved bytes of the find structure was that there were a bunch of apps that figured this out.  And then they realized that they could make suspend and resume their searches by simply saving away the 21 reserved bytes at the start of the structure and spitting them into a constant find first structure.

    So a program would do a depth first traversal of the tree, and at each level of the tree, instead of saving the entire 43 byte FindFirst structure, they could save 22 bytes per level of the hierarchy by just saving the first 21 bytes of the structure.  In fact, some of them were even more clever, they realized that they could save just the part of the reserved structure that they thought were important (something like 8 bytes/level).

    And that's just what they did...

    Needless to say, that caused kittens when the structures used for search had to change - these apps looked into the internal data structures and assumed they knew what they did...

     

  • Larry Osterman's WebLog

    Keeping kids safe on the internet

    • 26 Comments

    Joe Wilcox over at Microsoft Monitor recently posted an article about keeping kids safe on the internet.

    It’s a good article, but I’d add one other thing to his suggestions:  If you’ve got more than one computer in your house, disable internet access to all but public computers.  And if you’ve only got one computer put it in a public location, like the kitchen.

    We’ve got six different computers in our household – each kid has their own, I’ve got two, Valorie's got one, and there’s a common computer in the kitchen.  Valorie's and my computers have internet access, as does the common computer, but none of the others are allowed to access the internet – we filter it off access at the firewall.

    The kids also have up-to-date virus scanners on their computer (although their signatures get a smidge out-of-date).

    Once a month, after patch day, I manually enable internet access and go to windows update and ensure that they’re fully patched and their virus signatures are updated.  I know I could use SUS to roll my own update server, but it’s not that big a deal.  SImilarly, I could set one of the internet connected machines as the virus update location for the kids computers, but again, it's not that big a deal.

    This works nicely for me, and the principles can be applied to anyone's computer, even without all the added hoopla I go through.  The first and most important part of the equation is that all internet browsing is done on a public computer – that means that they’re not going to be sneaking around the darker corners of the internet, with Mom and Dad in the same room.  

    The other part of the equation is that all accounts on the public computer are LUA accounts, which adds an additional level of safety to browsing - nobody can accidentally install ActiveX controls or other software, which again adds a HUGE level of protection.  We have an admin account, but it's password protected and the kids don't know the password. 

    Edit: Addressed Michael Ruck's comment.

     

  • Larry Osterman's WebLog

    More proof that crypto should be left to the experts

    • 41 Comments

    Apparently two years ago, someone ran a static analysis tool named "Valgrind" against the source code to OpenSSL in the Debian Linux distribution.  The Valgrind tool reported an issue with the OpenSSL package distributed by Debian, so the Debian team decided that they needed to fix this "security bug".

     

    Unfortunately, the solution they chose to implement apparently removed all entropy from the OpenSSL random number generator.  As the OpenSSL team comments "Had Debian [contributed the patches to the package maintainers], we (the OpenSSL Team) would have fallen about laughing, and once we had got our breath back, told them what a terrible idea this was."

     

    And it IS a terrible idea.  It means that for the past two years, all crypto done on Debian Linux distributions (and Debian derivatives like Ubuntu) has been done with a weak random number generator.  While this might seem to be geeky and esoteric, it's not.  It means that every cryptographic key that has been generated on a Debian or Ubuntu distribution needs to be recycled (after you pick up the fix).  If you don't, any data that was encrypted with the weak RNG can be easily decrypted.

     

    Bruce Schneier has long said that cryptography is too important to be left to amateurs (I'm not sure of the exact quote, so I'm using a paraphrase).  That applies to all aspects of cryptography (including random number generators) - even tiny changes to algorithms can have profound effects on the security of the algorithm.   He's right - it's just too easy to get this stuff wrong.

     

    The good news is that there IS a fix for the problem, users of Debian or Ubuntu should read the advisory and take whatever actions are necessary to protect their data.

  • Larry Osterman's WebLog

    No sound on a Toshiba M7 after a Vista install (aka: things that make you go "Huh?")

    • 31 Comments

    We recently had a bug reported to us internally.  The user of a Toshiba M7 had installed Vista on his machine (which was previously running XP) and discovered that he didn't get any more sounds from his machine after the upgrade.

    We tried everything we could to figure out his problem - the audio system was sending samples to the sound card, the sound card was updating its internal position register, everything looked great.

    Usually, at this point, we start asking the impolitic questions, like:

    "Sometimes some dirt collects between the plug and the internal connectors on the sound card - could you please unplug the speakers and plug them back in?" (this is the polite way of asking "Did you remember to plug your speakers in?").

    "Sometimes a set of speakers only turn on the speaker when they detect a signal being sent to them, could you try wiggling the volume knob to see if it fixes the problem?" (I actually have one of these in my office, it's excruciatingly annoying).

    "Is it possible there's an external volume control on your speakers?  What's it set to?" (this is the polite question that catches the people who accidentally hit the mute button on their speakers or turned the volume down - we get a surprising number of these).

    Unfortunately, in this case none of these worked.  So we had to dig deeper.  For some reason (I'm not sure why), someone asked the user to boot back to XP and see if he could get sound working on XP.  He booted back to XP and it worked.  He then booted back to Vista, and...

    The sounds worked!

    He mentioned to us that when he'd booted back to XP, the sound driver reported that the volume control was muted, so he un-muted it before booting to Vista.  Just for grins, we asked him to mute the volume control on XP and boot into Vista and yup, the problem had reappeared.  Somehow muting the sound card on XP caused it to be muted in Vista.

    We got on the horn with the manufacturer of the system and the manufacturer of the sound card and they informed us that for various and sundry reasons, the XP audio driver twiddled some hardware registers that were hidden from the OS to cause the sound card to mute.  The Vista driver for the sound card didn't know about those special hardware registers, so it didn't know that the sound card was muted, so Vista didn't know it was muted.

    Needless to say, this is quite annoying - the design of the XP driver for this machine made it really easy for the customer to have a horrible experience when running Vista, which is never good.  It's critical that the OS know what's going on in the hardware (in other words, back doors are bad).  When a customer has this experience, they don't blame their system vendor or their audio driver, they blame Vista.

     

    The good news is that there’s a relatively easy workaround for people with an M7 – make sure that your machine is un-muted before you upgrade, the bad news is that this is a relatively popular computer (at least at Microsoft) and sufficient numbers of people have discovered the problem that it’s made one of our internal FAQs.

  • Larry Osterman's WebLog

    Windows Error Reporting and Online Crash Analysis are your friends.

    • 31 Comments

    I normally don’t do “me too” posts, since I figure that most of the people reading my blog are also looking at the main weblogs.asp.net/blogs.msdn.com feed, but I felt obliged to chime in on this one.

    A lot of people on weblogs.msdn.com have been posting this, but I figured I’d toss in my own version.

    When you get an “your application has crashed, do you want to let Microsoft know about it?” dialog, then yes, please send the crash report in.  We’ve learned a huge amount of where we need to improve our systems from these reports.  I know of at least three different bug fixes that I’ve made in the audio area that directly came from OCA (online crash analysis) reports.  Even if the bugs are in drivers that we didn’t write (Jerry Pisk commented about creative lab’s drivers here for example), we still pass the info on to the driver authors.

    In addition, we do data mining to see if there are common mistakes made by different driver authors and we use these to improve the driver verifier – if a couple of driver authors make the same mistake, then it makes sense for us to add tests to ensure that the problems get fixed on the next go-round.

    And we do let 3rd party vendors review their data.  There was a chat about this in August of 2002 where Greg Nichols and Alther Haleem discussed how it’s done.  The short answer is you go here and follow the instructions.  You have to have a Verisign Class 3 code-signing ID to do participate though.

    Bottom line: Participate in WER/OCA – Windows gets orders of magnitude more stable because of it.  As Steve Ballmer said:

    About 20 percent of the bugs cause 80 percent of all errors, and — this is stunning to me — one percent of bugs cause half of all errors.

    Knowing where the bugs are in real-world situations allows us to catch the high visibility bugs that plague our users that we’d otherwise have no way of discovering.

  • Larry Osterman's WebLog

    Things you shouldn't do, part 1 - DllMain is special

    • 5 Comments

    A lot of people have written about things not to do in your DllMain.  Like here, and here and here.

    One other thing not to do in your DllMain is to call LoadLibraryEx.  As others have written, DllMain’s a really special place to be.  If you do anything more complicated than initializing critical sections, or allocating thread local storage blocks, or calling DisableThreadLibraryCalls, you’re potentially asking for trouble.

    Sometimes, however the interaction is much more subtle.  For example, if your DLL uses COM, you might be tempted to call CoInitializeEx in your DllMain.  The problem is that that under certain circumstances, CoInitializeEx can call LoadLibraryEx.  And calling LoadLibraryEx is one of the things that EXPLICITLY is forbidden during DllMain (You must not call LoadLibrary in the entry-point function).

     

  • Larry Osterman's WebLog

    What comes after Quaternary?

    • 21 Comments

    Valorie asked me this question today, and I figured I'd toss it out to everyone who runs across this post.

    She works in a 5/6 split class, and they're working on a unit on patterns and functions.  They're ordering the data into columns, each of which is derived from the information in the previous column.

    The question is: What do they label the columns?

    The first couple are obvious: Primary, and Secondary.

    Third and fourth: Tertiary, Quaternary.

    But what's the label for the fifth and subsequent columns?

    AskOsford.com suggested that they use quinary (5), senary (6), septenary (7), octonary (8), nonary (9) and denary (10), using the latin roots.

    But the teacher in the class remembers a different order and thinks that the next one (5) should be cinquainary (using the same root as the poetry form cinquains).

    Valorie also pointed to  http://mathforum.com/dr.math/faq/faq.polygon.names.html for a 2-page history lesson. Coolest fact she found: the "gon" part of the word means "knee" and the "hedron" means "seats" so a polygon means "many knees" and polyhedra means "many seats".

    So does anyone have any suggestions?

     

  • Larry Osterman's WebLog

    Anatomy of a software bug, part 2 - the NT browser

    • 17 Comments

    Yesterday, I talked about the design of the NT browser service.

    Today, I want to talk about a really subtle bug we ended up finding in the service (fixed long before we shipped NT 3.1).

    As a brief refresher from yesterdays post, the NT browser was effectively a distributed single-master database system which was designed to run completely without administration.  All the machines that participated in the browsing architecture were elected to their position, the user wasn't involved in that process.

    The WfW browser used NetBIOS names to determine which machines had what role in the workgroup.  In general, the names followed a well established pattern of naming that was used for all the MS-NET products (since MS-NET 1.0 was introduced in 1983).  NetBIOS names are 16 byte flat names, in the MS-NET naming scheme, the last byte of the name was used for a signature, the first <n> bytes of the name were used for the computer name and the bytes between <n> and 15 were filled with 0x20 (space).  For example, the MS-NET server used <name>0x20 for the computer name.  MS-NET workstations used <name>0x00.

    NetBIOS names come in two flavors: Unique and Group.  Unique names are guaranteed to be associated with a single computer on the network.  Group names are shared between multiple machines on the network.  Unique names receive unicasts (directed traffic), Group names receive multicasts (broadcasts).

    For the browser, the master browser was identified because it had registered a NetBIOS name of <workgroup>0x1d.  The backup browsers and potential browsers all register the group name of <workgroup>0x1e.  When servers announce themselves, they send datagrams to <workgroup>0x1d.  There were other names used, and other functionality, but...

    Ok, that's enough background to describe the bug.

    As I mentioned yesterday, we cooked the browser election algorithm to ensure that an NT machine would always win the browser election.  Unfortunately, when we started wide deployment of NT machines on the corporate campus, this wasn't always the case.  We had tools that monitored the state of browsing in the most common domains on the network, and about once or twice a day, browsing would simply stop working on one or more of the domains.

    The maddening thing was that this behavior was totally unreproducable - all we knew is that there was a WfW machine that had held onto the master browser name, and this WfW machine was preventing the NT machine from becoming the new master browser.  The NT machine was trying, but the WfW machine kept holding onto the name.  The really annoying thing was that the WfW machine had apparently forgotten that it was a master browser (even though it was holding onto the master browser name).

    We gathered sniffs, we looked at code, we were clueless.

    Eventually, after talking to the WfW team, we discovered the WfW bug that was causing it to forget that it had had the master browser name - essentially there was a code path that would cause it to think it had won the election, and it started to become the master browser.  If, during the process of registering the NetBIOS name for the master browser, it received an election packet that would cause it to lose the election, it stopped functioning as the master browser, but it forgot to relinquish the NetBIOS name.  So the browser application on WfW didn't think that it owned the NetBIOS name, but the network transport on the WfW machine thought it owned the name.

    Ok, we'd found the bug, and it was the WfW team's bug.  Unfortunately, by this time, they'd already shipped, so they couldn't fix their code (and it wouldn't matter because there was a significant deployed base of WfW machines).  The thing is that they'd done a LOT of testing, and they'd never seen this problem.  So why was the NT browser exposing this?

    Well, we went back to the drawing boards.  We looked over the NT browser election logic.  And we looked at it again.

    And again.

    We stared at the code and we just didn't see it.

    And then one day, I printed out the code and looked at it one final time.

    And I saw what we'd missed in all those code reviews before.

    You see, there was one other aspect of the election process that I didn't mention before.  As a mechanism to ensure that elections damped down quickly (there's a phenomenon called "livelock" that occurs with distributed election systems that prevents elections from finishing), there were several timers associated with the election process.  Once an election request was received, the master browser would delay for 200ms before responding, backup browsers would delay for 400ms and potential browsers would delay for 800ms.  This ordering ensured that master browsers would send their request first, thus ensuring that the election would finish quickly (because if there was a current master browser, it ought to continue to be the master browser).

    Well, the code in question looked something like this (we all used text editors at the time, there weren't any GUI editors available):

    When we did our code reviews, this is what we all saw (of course this isn't the real code, I just mocked it up for this article).

    If I'd looked at the entire line, I'd have seen this:

    Pseudo-browser source code showing incorrect timers for master browser elections

    Note that the master browser case is using the backup browser timer, not the master browser timer.  It turns out that this was the ENTIRE root cause of the bug - because the master browser was delaying its election response for too long, the WfW machines thought they had won the election.  And they started to become masters, and during that process, they received the election packet from the master browser.  Which quite neatly exposed the bug in their code.  Even without the WfW bug, this bug would have been disastrous for the browsing system, because it would potentially cause the very livelock scenario the election algorithm was designed to remove.

    Needless to say, we quickly fixed this bug, and deployed it in the next NT build, and the problem was solved.

    So what are the lessons learned here?  Clearly the first is that code reviews have to be complete - if text is wrapping off the screen, it's not guaranteed to be correct.  Also, distributed systems misbehave in really subtle ways - a simple bug in timing of a single packet can cause catastrophic behaviors.

  • Larry Osterman's WebLog

    The Windows command line is just a string...

    • 30 Comments

    Yesterday, Richard Gemmell left the following comment on my blog (I've trimmed to the critical part):

    I was referring to the way that IE can be tricked into calling the Firefox command line with multiple parameters instead of the single parameter registered with the URL handler.

    I saw this comment and was really confused for a second, until I realized the disconnect.  The problem is that *nix and Windows handle command line arguments totally differently.  On *nix, you launch a program using the execve API (or  it's cousins execvp, execl, execlp, execle, and execvp).  The interesting thing about these APIs is that they allow the caller to specify each of the command line arguments - the signature for execve is:

    int execve(const char *filename, char *const argv [], char *const envp[]);

    In *nix, the shell is responsible for turning the string provided by the user into the argv parameter to the program[1].

     

    On Windows, the command line doesn't work that way.  Instead, you launch a new program using the CreateProcess API, which takes the command line as a string (the lpComandLine parameter to CreateProcess).  It's considered the responsibility of the newly started application to call the GetCommandLine API to retrieve that command line and parse it (possibly using the CommandLineToArgvW helper function).

    So when Richard talked about IE "tricking" Firefox by calling it with multiple parameters, he was apparently thinking about the *nix model where an application launches a new application with multiple command line arguments.  But that model isn't the Windows model - instead, in the Windows model, the application is responsible for parsing it's own command line arguments, and thus IE can't "trick" anything - it's just asking the shell to pass a string to the application, and it's the application's job to figure out how handle that string.

    We can discuss the relative merits of that decision, but it was a decision made over 25 years ago (in MS-DOS 2.0).

     

    [1] Yes, I know that the execl() API allows you to specify a command line string, but the execl() API parses that command line string into argv and argc before calling execve.

  • Larry Osterman's WebLog

    Everyone wants a shiny new UI

    • 55 Comments

    Surfing around the web, I often run into web sites that contain critiques of various aspects of Windows UI.

    One of the most common criticisms on those sites is "old style" dialogs.  In other words, dialogs that don't have the most up-to-date theming.  Here's an example I ran into earlier today:

    AutoComplete

    Windows has a fair number of dialogs like this - they're often fairly old dialogs that were written before new theming elements were added (or contain animations that predate newer theming options).  They all work correctly but they're just ... old.

    Usually the web site wants the Windows team update the dialog to match the newest styling's because the dialog is "wrong".

    Whenever someone asks (or more often insists) that the Windows team update their particular old dialog, I sometimes want to turn around and ask them a question:

    "You get to choose: You can get this dialog fixed OR you can cut a feature from Windows, you can't get both.  Which feature in Windows would you cut to change this dialog?"

    Perhaps an automotive analogy would help explain my rather intemperate reaction:

    One of the roads near my house is a cement road and the road is starting to develop a fair number of cracks in it.  The folks living near the road got upset at the condition of the road and started a petition drive to get the county to repair the road.  Their petition worked and county came out a couple of weeks later and inspected the road and rendered their verdict on the repair (paraphrasing):  We've looked at the road surface and it is 60% degraded.  The threshold for immediate repairs on county roads is 80% degradation.  Your road was built 30 years ago and cement roads in this area have a 40 year expected lifespan.  Since the road doesn't meet our threshold for immediate repair and it hasn't met the end of its lifespan, we can't justify moving this section of road up ahead of the hundreds of other sections of road that need immediate repair.

    In other words, the county had a limited budget for road repairs and there were a lot of other sections of road in the county that were in a lot worse shape than the one near my house.

    The same thing happens in Windows - there are thousands of features in Windows and a limited number of developers who can change those features.   Changing a dialog does not happen for free.  It takes time for the developers to fix UI bugs.  As an example, I just checked in a fix for a particularly tricky UI bug.  I started working on that fix in early October and it's now January.

    Remember, this dialog works just fine, it's just a visual inconsistency.  But it's going to take a developer some amount of time to fix the dialog.  Maybe it's only one day.  Maybe it's a week.  Maybe the fix requires coordination between multiple people (for example, changing an icon usually requires the time of both a developer AND a graphic designer).  That time could be spent working on fixing other bugs.  Every feature team goes through a triage process on incoming bugs to decide which bugs they should fix.  They make choices based on their limited budget (there are n developers on the team, there are m bugs to fix, each bug takes t time to fix on average, that means we need to fix (m*t)/n bugs before we can ship).

    Fixing theming bug like this takes time that could be spent fixing other bugs.  And (as I've said before) the dialog does work correctly, it's just outdated.

    So again I come back to the question: "Is fixing a working but ugly dialog really more important than all the other bugs?"  It's unfortunate but you have to make a choice.

     

    PS: Just because we have to make choices like this doesn't mean that you shouldn't send feedback like this.   Just like the neighbors complaining to the county about the road, it helps to let the relevant team know about the issue. Feedback like this is invaluable for the Windows team (that's what the "Send Feedback" link is there for after all).  Even if the team decides not to fix a particular bug in this release it doesn't mean that it won't be fixed in the next release.

  • Larry Osterman's WebLog

    Resilience is NOT necessarily a good thing

    • 66 Comments

    I just ran into this post by Eric Brechner who is the director of Microsoft's Engineering Excellence center.

    What really caught my eye was his opening paragraph:

    I heard a remark the other day that seemed stupid on the surface, but when I really thought about it I realized it was completely idiotic and irresponsible. The remark was that it's better to crash and let Watson report the error than it is to catch the exception and try to correct it.

    Wow.  I'm not going to mince words: What a profoundly stupid assertion to make.  Of course it's better to crash and let the OS handle the exception than to try to continue after an exception.

     

    I have a HUGE issue with the concept that an application should catch exceptions[1] and attempt to correct them.  In my experience handling exceptions and attempting to continue is a recipe for disaster.  At best, it takes an easily debuggable problem into one that takes hours of debugging to resolve.  At it's worst, exception handling can either introduce security holes or render security mitigations irrelevant.

    I have absolutely no problems with fail fast (which is what Eric suggests with his "Restart" option).  I think that restarting a process after the process crashes is a great idea (as long as you have a way to prevent crashes from spiraling out of control).  In Windows Vista, Microsoft built this functionality directly into the OS with the Restart Manager, if your application calls the RegisterApplicationRestart API, the OS will offer to restart your application if it crashes or is non responsive.  This concept also shows up in the service restart options in the ChangeServiceConfig2 API (if a service crashes, the OS will restart it if you've configured the OS to restart it).

    I also agree with Eric's comment that asserts that cause crashes have no business living in production code, and I have no problems with asserts logging a failure and continuing (assuming that there's someone who is going to actually look at the log and can understand the contents of the log, otherwise the  logs just consume disk space). 

     

    But I simply can't wrap my head around the idea that it's ok to catch exceptions and continue to run.  Back in the days of Windows 3.1 it might have been a good idea, but after the security fiascos of the early 2000s, any thoughts that you could continue to run after an exception has been thrown should have been removed forever.

    The bottom line is that when an exception is thrown, your program is in an unknown state.  Attempting to continue in that unknown state is pointless and potentially extremely dangerous - you literally have no idea what's going on in your program.  Your best bet is to let the OS exception handler dump core and hopefully your customers will submit those crash dumps to you so you can post-mortem debug the problem.  Any other attempt at continuing is a recipe for disaster.

     

    -------

    [1] To be clear: I'm not necessarily talking about C++ exceptions here, just structured exceptions.  For some C++ and C# exceptions, it's ok to catch the exception and continue, assuming that you understand the root cause of the exception.  But if you don't know the exact cause of the exception you should never proceed.  For instance, if your binary tree class throws a "Tree Corrupt" exception, you really shouldn't continue to run, but if opening a file throws a "file not found" exception, it's likely to be ok.  For structured exceptions, I know of NO circumstance under which it is appropriate to continue running.

     

    Edit: Cleaned up wording in the footnote.

  • Larry Osterman's WebLog

    Concurrency, part 11 - Hidden scalability issues

    • 21 Comments
    So you're writing a server.  You've done your research, and you've designed your system to be as scalable as you possibly can.

    All your linked lists are interlocked lists, you're app uses only one thread per CPU core, you're using fibers to manage your scheduling so that you make full use of your quanta, you've set your thread's processor affinity so that it's locked to a single CPU core, etc.

    So you're done, right?

    Well, no.  The odds are pretty good that you've STILL got concurrency issues.  But they were hidden from you because the concurrency issues aren't in your application, they're elsewhere in the system.

    This is what makes programming for scalability SO darned hard.

    So here are some of the common issues where scalability issues are hidden.

    The biggest one (from my standpoint, although the relevant people on the base team get on my case whenever I mention it) is the NT heap manager.  When you create a heap with HeapCreate, unless you specify the HEAP_NO_SERIALIZE flag, the heap will have a critical section associated with it (and the process heap is a serialized heap).

    What this means is that every time you call LocalAlloc() (or HeapAlloc, or HeapFree, or any other heap APIs), you're entering a critical section.  If your application performs a large number of allocations, then you're going to be acquiring and releasing this critical section a LOT.  It turns out that this single critical section can quickly become the hottest critical section in your process.   And the consequences of this can be absolutely huge.  When I accidentally checked in a change to the Exchange store's heap manager that reduced the number of heaps used by the Exchange store from 5 to 1, the overall performance of the store dropped by 15%.  That 15% reduction in performance was directly caused by serialization on the heap critical section.

    The good news is that the base team knows that this is a big deal, and they've done a huge amount of work to reduce contentions on the heap.   For Windows Server 2003, the base team added support for the "low fragmentation heap", which can be enabled with a call to HeapSetInformation.  One of the benefits of switching to the low fragmentation heap (along with the obvious benefit of reducing heap fragmentation) is that the LFH is significantly more scalable than the base heap.

    And there are other sources of contention that can occur below your application.  In fact, many of the base system services have internal locks and synchronization structures that could cause your application to block - for instance, if you didn't open your file handles for overlapped I/O, then the I/O subsystem acquires an auto-reset event across all file operations on the file.  This is done entirely under the covers, but can potentially cause scalability issues.

    And there are scalability issues that come from physics as well.  For example, yesterday, Jeff Parker asked about ripping CDs from Windows Media Player.  It turns out that there's no point in dedicating more than one thread to reading data from the CD, because the CDROM drive has only one head - it can't read from two locations simultaneously (and on CDROM drives, head motion is particularly expensive).  The same laws of physics hold true for all physical media - I touched on this in the answers to the Whats wrong with this code, part 9 post - you can't speed up hard disk copies by throwing more threads or overlapped I/O at the problem, because file copy speed is ultimately limited by the physical speed of the underlying media - and with only one spindle, it can only read or write to the drive one operation at a time.

    But even if you've identified all the bottlenecks in your application, and added disks to ensure that your I/O is as fast as possible, there STILL may be bottlenecks that you've not yet seen.

    Next time, I'll talk about those bottlenecks...

  • Larry Osterman's WebLog

    Why is it FILE_SHARE_READ and FILE_SHARE_WRITE anyway?

    • 19 Comments

    Raymond’s post about FILE_SHARE_* bits reminded me of the story about why the bits are FILE_SHARE_READ in the first place.

    MS-DOS had the very same file sharing semantics as NT does (ok, NT adds FILE_SHARE_DELETE, more on that later).  But on MS-DOS, the file sharing semantics were optional – you had to load in the share.com utility to enable them.  This was because on a single tasking operating system, there was only ever going to be one application running, so the sharing semantics were considered optional.  Unless you were running a file server, in which case Microsoft strongly suggested that you should load the utility.

    On MS-DOS, the sharing mode was controlled by the three “sharing mode” bits.  They legal values for “sharing mode” were:

                000 – Compatibility mode. Any process can open the file any number of times with this mode.  It fails if the file’s opened in any other sharing mode.
                001 – Deny All.  Fails if the file has been opened in compatibility mode or for read or write access, even if by the current process
                010 – Deny Write.  Fails if the file has been opened in compatibility mode or for write access by any other process
                011 – Deny Read – Fails if the file has been opened in compatibility mode or for read access by any other process.
                100 – Deny None – Fails if the file has been opened in compatibility mode by any other process.

    Coupled with the “sharing mode” bits is the four “access code” bits.  There were only three values defined for them, Read, Write, and Both (Read/Write).

    The original designers of the Win32 API set (in particular, the designer of the I/O subsystem) took one look at these permissions and threw up his hands in disgust.  In his opinion, there are two huge problems with these definitions:

    1)                  Because the sharing bits are defined as negatives, it’s extremely hard to understand what’s going to be allowed or denied.  If you open a file for write access in deny read mode, what happens?  What about deny write mode – Does it allow reading or not?

    2)                  Because the default is “compatibility” mode, it means that by default most applications can’t ensure the integrity of their data.  Instead of your data being secure by default, you need to take special actions to guarantee that nobody else messes with the data.

    So the I/O subsystem designer proposed that we invert the semantics of the sharing mode bits.  Instead of the sharing rights denying access, they GRANT access.  Instead of the default access mask being to allow access, the default is to deny access.  An application needs to explicitly decide that it wants to let others see its data while it’s manipulating the data.

    This inversion neatly solves a huge set of problems that existed while running multiple MS-DOS applications – if one application was running; another application could corrupt the data underneath the first application.

    We can easily explain FILE_SHARE_READ and FILE_SHARE_WRITE as being cleaner and safer versions of the DOS sharing functionality.  But what about FILE_SHARE_DELETE?  Where on earth did that access right come from?  Well, it was added for Posix compatibility.  Under the Posix subsystem, like on *nix, a file can be unlinked when it’s still opened.  In addition, when you rename a file on NT, the rename operation opens the source file for delete access (a rename operation, after all is a creation of a new file in the target directory and a deletion of the source file).

    But DOS applications don’t expect that files can be deleted (or renamed) out from under them, so we needed to have a mechanism in place to prevent the system from deleting (or renaming) files if the application cares about them.  So that’s where the FILE_SHARE_DELETE access right comes from – it’s a flag that says to the system “It’s ok for someone else to rename this file while it’s running”. 

    The NT loader takes advantage of this – when it opens DLL’s or programs for execution, it specifies FILE_SHARE_DELETE.  That means that you can rename the executable of a currently running application (or DLL).  This can come in handy when you want to drop in a new copy of a DLL that’s being used by a running application.  I do this all the time when working on winmm.dll.  Sine winmm.dll’s used by lots of processes in the system, including some that can’t be stopped, I can’t stop all the processes that reference the DLL, so instead, when I need to test a new copy of winmm, I rename winmm.dll to winmm.old, copy in a new copy of winmm.dll and reboot the machine.

     

  • Larry Osterman's WebLog

    Concurrency, Part 2 - Avoiding the problem

    • 27 Comments
    Yesterday's article on concurrency discussed the basic concepts of concurrency.  Now I'd like to start talking about how you deal with concurrency...

    The first, and most important thing to realize about concurrent programming is that it's all about two things: your data and your threads.  If you only have one thread, then you don't have to worry about concurrency issues.  If you have more than one thread, then you only have to worry about concurrency issues if more than one thread can simultaneously access that data.  And that's my first principle of concurrent programming:  If your data is never accessed on more than one thread, then you don't have to worry about concurrency.   Again, the guys who get concurrency are cringing with this principle- the reality is (of course) more complicated than that, I'll get back to why it's more complicated later (I need to introduce some more concepts beforehand).

    In Win32, in general, there are three ways that you can guarantee that your thread is the only one executing the data.

    The first is your stack.  On Win32, the data on your stack is owned by the thread (this might not be true for other architectures, I don't know :().  Unless you explicitly pass pointers to your stack to another thread, then you don't have to worry about other threads messing with your stack data, so you don't need to worry about protecting the data.

    The second way of ensuring that only one thread can access your data is to use ThreadLocalStorage, or TLS.  The idea behind TLS is that when your process starts, it allocates a "slot" in TLS.  That allocation returns you an index into a table, and you can stick whatever value you want to into that table.  When your thread starts up, you can allocate a block of memory, stick it into the table, and then, later on during the execution of the thread, you can go back and query the value of that block.  The block remains per-thread, and can be accessed without protecting the data.  This allows you to maintain per-thread context blocks which can be used to hold state that's more global than the stack.  Btw, the C runtime library allows you to declare variables in TLS by simply decorating them with __declspec(thread) - there are some caveats about using this, but the facility is available...

    The third way of ensuring that only one thread can access your data is simply to be careful in how you write your code.  As an example, in my last "What's wrong with this code" article, I purposely allocated the FileCopyBlock structures in one thread, put them on a queue and executed them in worker threads.  As a result, I didn't have to protect the FileCopyBlock fields - since only one thread could ever access the data at a time, they didn't need to be protected.  Now more than one thread accessed the data (the block was constructed on the main thread and destructed on the worker threads).  But at any given time, the blocks weren't accessed by more than one thread.  This principle can be applied in a number of different ways - my example was quite simple, but it wouldn't be difficult to imagine a FSM where the state was kept in a block that was enqueued and dequeued based on state transitions - the block would only ever be accessed by one thread at a time and thus wouldn't have to be protected.

     

    It turns out that you can write some fairly sophisticated multithreaded code without ever having to ever worry about synchronizing your shared data, just by being careful and setting up your data structures appropriately, you can do pretty amazing things.

    But, of course, there are times that you can't avoid having more than one thread accessing your data.  Tomorrow, I'll talk about some of the ways around that problem.

    Edit: Principal->Principle (thanks Mike :))

Page 5 of 33 (815 items) «34567»