Larry Osterman's WebLog

Confessions of an Old Fogey
Blog - Title

This one's for you John. The core OS team didn't forget you

This one's for you John. The core OS team didn't forget you

  • Comments 27
Way back when, back in the very early days of this blog (actually it was the 3rd post to my blog), I wrote a story about John Vert complaining about CTRL-C not working on network commands.

Well, yesterday I got a piece of email from one of the developers in COSD.  I've sanitized it a bit, but here's the important part:

Microsoft Windows [Version 6.0.<build>]
(C) Copyright 1985-2005 Microsoft Corp.

 d:\>dir \\<server>\dfg
The I/O operation has been aborted because of either a thread exit or an applica
tion request.

d:\>dir \\<server>\dfg
The I/O operation has been aborted because of either a thread exit or an applica
tion request. 

So John,  this one's for you, even though it's been 13 years since I worked on that code, your complaint wasn't ignored, and it's finally been fixed.

I have no idea what build will contain the fix, or even if the fix will make the final product, but it's getting there.

As I type this, I can just imagine the /. headline: "Microsoft takes 13 years to fix a bug".  The reality is WAY more complicated than that.  To actually make this fix work required a significant amount of change to the I/O subsystem and a number of changes to the way that I/O cancellation works. The biggest piece of the picture is the new CancelSynchronousIo API that was added to Vista to handle just this situation, without that support (as mentioned in my the original article), it wouldn't have been possible to fix the problem.

  • I hope 6.0 will finally allow you to "net stop rdr" shortly (!) after last mounted network resource (drive-letter) was "dismounted" ("net use x: /d") without either hanging it in an undefined state forever (stopping, but never stops, and can't be started) or simply BSODing the box. That last point also displays a really, *really* unhealthy inbreeding between the redirector ("Workstation") service and the kernel component(s). A simple user-mode app (even if it in this case _is_ the redirector service) being able to BSOD the system isn't exactly painting a flattering picture of Microsoft.

    Letting it silently time out (somewhere between 2 and 10 minutes I think) it works, but who counts the seconds for MS timing-related bugs when time-to-fix can be measured in decades. :->

    Not only on slashdot! :-)

    That MS needed over a decade to introduce even the concept of CSQ, not to mention how many years apparently were needed to actually start to use them, is IMHO more telling about priorities than a potential slashdot story about "Microsoft has after 15 years finally got CTRL+C working!".

    Could you after this manage to beat Creative (*) and their drivers into submission, I think you might be on to something. ;-)

    (*) There are other vendors just as bad, many worse if you count the abomination of NIC's as USB devices, but due to market penetration Creative has a place of its own I think.
  • Not to create the time machine to go back 13 years with a list of bugs.

    Mike
  • Mike, I've never seen the redirector take more than a couple of seconds to stop (I actually do that every few days, go figure), and I've never seen it CRASH.

    If you've reported the crashes to MS, the redirector team should have your crash data and can figure out what went wrong.

    And the quality of 3rd party drivers is a significant issue.
  • > If you've reported the crashes to MS

    I've had around a dozen kernel crashes (BSODs) where Windows didn't offer to report the crashes to Microsoft because whatever bug caused the network connection to not work also prevented reports of its own crash.

    In user mode I've had a few hundred process crashes where dumpprep.exe and another Dr. Watson process were executing and nearly hanging the CPU but they never offered to send crash reports because whatever bug caused the network connection to not work also prevented reports of its own crash.  These didn't cause BSODs but still the only way out was to reboot.
  • The behavior that annoys me the most about this is when I try to use tab-completion:

    > dir \\misspelled-server\share\<hits tab> <curses for 30 seconds>

    Will I be able to stop that?
  • Does anybody know how long it took other OSes to implement the ability to cancel synchronous IOs? Maybe I just don't know what to look for, but I couldn't find any other OS that implements it. All I can find are calls to cancel async IO (Solaris, Linux, VMS).

    It really is a shame that other systems don't implement synch IO cancellation, because it's really annoying when your whole group of Unix systems goes down due to a single NFS hard mount failure.
  • > The reality is WAY more complicated than that.

    Not really.  The richest software company in the world, with the best engineers money can buy, takes 13 years to make control-C work.  I think that's pretty simple actually.

    You can change the statement to say that your engineers were so incompetent they designed things so poorly that it took 13 years of valliant redesign and effort to fix the bug, but I'd argue that's even worse.
  • Vince,

    Have you ever had a class or read an in-depth book on operating systems engineering?

    Didn't think so.

    James
  • Vince -

    Something that Larry entertainingly points out in many of his blog posts is the simple fact that Writing Software is Hard.  Not everybody can do it, and of those that can, even fewer can do it well.

    The technical reasons behind getting this working, amidst all of the other complexities involved in writing an *operating system*, are undoubtedly good ones.  The decisions were also almost certainly colored by weighing the impact of not fixing the bug versus the risk and technical difficulty of fixing it.  Larry has given us a glimpse of the latter, and for understanding his reasons, I am a little bit better coder.

    Microsoft has a lot of resources, as you rightly point out, but they are not unlimited, and indeed, spread over all of the products that Microsoft creates, they are not overly large.  Microsoft is not in the business of writing flawless software (an impossible goal), they are in the business of shipping products (an achievable goal).  To do that, some hard decisions have to be made.  Not all bugs can be fixed.

    Writing Software is Hard.  If you honestly think that you can do better, then by all means do so, and compete in the marketplace.  However, your comment suggests that you do not have the first clue about how real software is made, and if all you can contribute to the conversation is uninformed anti-MS spew, please take it back to slashdot.
  • Why couldn't the problem have been solved earlier by calling TerminateProcess in the CTRL-C handler?
  • Why do people assume you have to hang out on Slashdot to be anti-MS?  I've been anti-MS since before slashdot was a glimmer in Rob Malda's eye.

    Taking 13 years to fix a bug like this is inexcusable.  If a company is going to hide its code and development processes and not have an open bug tracking system, then it will be judged by what info is realeased.  

    How am I supposed to believe all of those smarmy "you can do anything with our software" MS ads if all I wanted to do was get control-C to cancel some IO?  In any case the 13 year bug will likely be a 15 or more one, because I am sure it's not going to be fixed in any sort of release any time soon.
  • Vince, you're right.  13 years ago, a REALLY stupid decision was made by Microsoft.  We decided that it was reasonable to allow Windows to access a LAN.

    This decision is the root cause of this problem.  The problem is that the timeouts that are appropriate for networked devices are totally acceptable for human beings.

    It took us a while to realize this, there have actually been a steady stream of fixes in every single release of the OS since NT 3.1 that combined to improve the situation (for instance, in NT 3.1, you couldn't ctrl-c the NET USE command, in NT4 (I believe) support was added to allow that (it might have been Win2K)).

    The final piece of the puzzle was the CancelSynchronousIO API that was added for Vista.  Due to the way that cancelation was implemented in NT 3.1, it required a significant amount of effort to ensure that it worked correctly and reliably with existing drivers.

    Vince, I don't know you or your experience, but it's clear to me from your comments that you've never ever written software for platform with widespread use.  As Ryan mentions above, this stuff is HARD, especially if you want to get it right.

    For instance, the command interpreter guys could have executed their command interpreter on a different thread than the UI thread and just abandoned the operation on ctrl-c.  But that would have introduced even more problems (what do you do when that abandoned operation completes, and what if there were side effects of that abandoned operation).

    But they didn't do that because it was more important to fix it CORRECTLY than it was to hack around the problem.
  • "Writing Software is Hard.  If you honestly think that you can do better, then by all means do so, and compete in the marketplace."

    In this day and age and after all of the anti-competative and unlawful things MS has done to competitors why would I want to compete with them?  Just to give them things to copy for 'innovation' and add to the OS so I can go out of business?


    The Justice Department really screwed up what they had.  MS should have been split up and the OS made a regulated public service.  Then their apps would have had to stand on their own two legs without being propped up for years with the cash cow of the OS.  How many of them would have survived in that envoironment?
  • > Vince, I don't know you or your experience, but it's
    > clear to me from your comments that you've never
    > ever written software for platform with widespread
    > use.

    I like how people can somehow analyze my software experience from a few posts I make on a blog.

    If by "widespread use" you mean code that is in Windows, well of course not.

    If you mean "is currently running on millions of computers", then yes.  Code of mine is included in the Linux kernel.  You're free to download the Linux source and view it, critique it all you want.  

    I'll notice I can't view any kernel code that you've written, or for that matter any of the kernel code your company produces.  So I'm the one whose at a disadvantage when considering your programming skills.

    >  As Ryan mentions above, this stuff is HARD,
    > especially if you want to get it right.

    Well of course, if you want to be whiny about it.  Honestly, all programming is hard.  That's no excuse.
  • What vince[sic] is missing here is that we're talking about behavior on builtin commands to the shell.  Maybe we can debate whether "dir" should be builtin or not but it is and as such, this isn't just a "simple" decision to terminate a process.

    As Gabe points out above, (just about) nobody else has support for cancelling in-flight synchronous I/O.  (VMS had it indirectly since all sync I/O was actually async I/O followed by a EF wait but I'm not sure that sys$waitef actually was interruptable...)

    Oh, wait, that's right.  Don't feed the troll.  Someday I'll learn.
Page 1 of 2 (27 items) 12