Larry Osterman's WebLog

Confessions of an Old Fogey
Blog - Title

AARDvarks in your code.

AARDvarks in your code.

  • Comments 29

If there was ever a question that I’m a glutton for punishment, this post should prove it.

We were having an email discussion the other day, and someone asked:

Isn't there a similar story about how DOS would crash when used with [some non-MS thing] and only worked with [some MS thing]? I don't remember what the "thing" was though =)

Well, the only case I could think of where that was the case was the old AARD code in Windows.  Andrew Schulman wrote a great article on it back in the early 1990’s, which dissected the code pretty thoroughly.

The AARD code in Windows was code to detect when Windows was running on a cloned version of MS-DOS, and to disable Windows on that cloned operating system.  By the time that Windows 3.1 shipped, it had been pulled from Windows, but the vestiges of the code were left behind.  As Andrew points out, the code was obfuscated, and had debugger-hiding logic, but it could be reverse engineered, and Andrew did a great job of doing it.

I can’t speak as to why the AARD code was obfuscated, I have no explanation for that, it seems totally stupid to me.  But I’ve got to say that I totally agree with the basic concept of Windows checking for an alternative version of MS-DOS and refusing to run on it.

The thing is that the Windows team had a problem to solve, and they didn’t care how they solved it.  Windows decided that it owned every part of the system, including the internal data structures of the operating system.  It knew where those structures were located, it knew what the size of those data structures was, and it had no compunction against replacing those internal structures with its own version.  Needless to say, from a DOS developer’s standpoint, keeping Windows working was an absolute nightmare.

As a simple example, when Windows started up, it increased the size of MS-DOS’s internal file table (the SFT, that’s the table that was created by the FILES= line in config.sys).  It did that to allow more than 20 files to be opened on the windows system (a highly desirable goal for a multi-tasking operating system).  But it did that by using an undocumented API call, which returned a pointer to a set of “interesting” pointers in MS-DOS. It then indexed a known offset relative to that pointer, and replaced the value of the master SFT table with its own version of the SFT.  When I was working on MS-DOS 4.0, we needed to support Windows.  Well, it was relatively easy to guarantee that our SFT was at the location that Windows was expecting.  But the problem was that the MS-DOS 4.0 SFT was 2 bytes larger than the MS-DOS 3.1 SFT.   In order to get Windows to work, I had to change the DOS loader to detect when win.com was being loaded, and if it was being loaded, I looked at the code at an offset relative to the base code segment, and if it was a “MOV” instruction, and the amount being moved was the old size of the SFT, I patched the instruction in memory to reflect the new size of the SFT!  Yup, MS-DOS 4.0 patched the running windows binary to make sure Windows would still continue to work.

Now then, considering how sleazy Windows was about MS-DOS, think about what would happen if Windows ran on a clone of MS-DOS.  It’s already groveling internal MS-DOS data structures.  It’s making assumptions about how our internal functions work, when it’s safe to call them (and which ones are reentrant and which are not).  It’s assuming all SORTS of things about the way that MS-DOS’s code works.

And now we’re going to run it on a clone operating system.  Which is different code.  It’s a totally unrelated code base.

If the clone operating system isn’t a PERFECT clone of MS-DOS (not a good clone, a perfect clone), then Windows is going to fail in mysterious and magical ways.  Your app might lose data.  Windows might corrupt the hard disk.   

Given the degree with which Windows performed extreme brain surgery on the innards of MS-DOS, it’s not unreasonable for Windows to check that it was operating on the correct patient.

 

Edit: Given that most people aren't going to click on the link to the Schulman article, it makes sense to describe what the AARD check was :)

Edit: Fixed typo, thanks KC

  • "Same way that the onus is on AMD to keep compatibility with Intel," In light of current events, this could be amended to "on each to adopt each other's standards" now that Intel has bowed and implemented the AMD64's extended instruction set. Pedantic, I know. ;p

    Of course, you remember, copy-protection schemes were often far less consumer-friendly a decade or two ago. Who remembers non-standard disk sectors that sometimes worked and sometimes didn't? (Okay, so the RIAA resurrected that one.) There's a good chance someone told the coders that they had to make it only work on MSDOS, without telling them why, and they assumed it was to reduce piracy, so they threw in a bunch of code to make that more difficult. Someone just decided it wasn't worth testing clones for complete compatibility with subtle MSDOS quirks, especially in beta versions.
  • In the base note:

    > If the clone operating system isn’t a
    > PERFECT clone of MS-DOS [...]
    > Windows might corrupt the hard disk.

    Interesting observation. Now can you say why Windows 95 and Windows 2000 corrupt the hard disk when there's no worry about MS-DOS clones?

    For IDE disks for Windows 95A there was the patch REMIDEUP. Windows 95B didn't need this patch.

    For SCSI disks for Windows 95A and Windows 95B there was no patch. It took months and 11,000 yen just for train tickets to vendors in order to track this down, but now it takes just 10 minutes to reproduce. Why do all the fdisk commands of various Windows 95 versions create overlapping partitions on SCSI disks?
    (Exception: if the SCSI adapter has a BIOS, i.e. if it's a desktop machine and the SCSI adapter has a BIOS, then fdisk doesn't always corrupt the drive.)

    With Windows 2000, it's external IDE disks connected through PCMCIA IDE adapters. If the PCMCIA card and the drive are connected before Windows 2000 boots, then Windows 2000 detects phantom corruption during booting, runs CHKDSK, and proceeds to create real corruption. Why?
    (This doesn't happen if booting is completed before connecting the devices.)

    In one case there's some interaction with 16-bit code but it's not a DOS clone, it's an integral part of Windows 95. In the other case there's no 16-bit code. Why weren't measures taken to prevent disk corruption?
  • Norman,
    Neither Win95 or Win2000 mess with the internals of the underlying operating system.

    If there's a corruption issue in the operating system that's caused by a bug in the OS, then that happens. It's not good, but it happens.

    If there's a corruption issue in the operating system that's caused because a user application messed up an OS internal data structure, that's significantly different.

    In the Win95/Win2000 corruption, was it an end-user application that was corrupting the system, or was it a hardware combination that wasn't tested? It sounds like the latter, not the former.
  • "The Windows 3.1 team COULD have worked to ensure that Windows ran on all MS-DOS clones, you're right. But we're talking about an OS designed to run on machines with significantly less than 1M of RAM, it was far easier to just test with MS-DOS and just say that MS-DOS was required to run Windows."

    Don't get me wrong, I do understand what you're saying, about why it might technically make sense.

    What seems inconsistent to me (aside from the encryption, etc.) is that this doesn't match Microsoft's long established practices with respect to supporting 3rd party software. Despite targeting relatively small machines, even Windows 3.1 had compatibility flags to turn on buggy behavior on a per module basis. ( http://support.microsoft.com/default.aspx?scid=kb;en-us;82860&Product=win ) Despite the fact that it's lower level, Adding/Testing special case code for DR-DOS doesn't seem that much more significant than the compatibility flags, particularly given that the problems with DR-DOS were understood. Thus, we're left with the fact that DR-DOS, unlike all the software listed on the web page above, is unique in that it had a Microsoft engineer spend time specifically developing tricky, hidden code to display a strange error, rather spending time on actually working around the problem.

    If the distinction is that DR-DOS is an "Operating System" and Lotus Notes was an "Application" then it's just another example of using market leverage to control competition in particular ways.
  • 8/16/2004 7:53 AM Larry Osterman

    > Neither Win95 or Win2000 mess with the
    > internals of the underlying operating
    > system.

    Huh?

    In the W95 case, maybe you mean that the protected mode part doesn't mess with the real mode part, but that is still wrong. Microsoft released patches for IDE disks for W95A because the protected mode part DOES mess with the real mode part. Microsoft didn't release patches for SCSI disks for any W95 version because, well, we'll get to that.

    In the W2000 case, what can you possibly mean?

    > In the Win95/Win2000 corruption, was it an
    > end-user application that was corrupting the
    > system,

    No end-user application is involved at all. A fresh installation of W95 on any vendor's PC, attach a PCMCIA SCSI adapter made by any vendor, and attach a SCSI hard disk made by any vendor. If the PCMCIA SCSI adapter is made by anyone other than Adaptec then the vendor's SCSI driver has to be installed; Adaptec's driver was built into W95. The fdisk command is broken. The solution is to use the vendor's partitioner utility that the vendor intended for use under Windows 3.1. Too many vendors didn't know that W95's fdisk was broken so they told customers to use fdisk instead of the vendors' own utilities under W95.

    With a fresh installation of W2000, it isn't even necessary to use a vendor's driver. Windows 2000's built-in IDE driver handles PCMCIA IDE adapters. But something in W2000 is badly broken during boot time. Interesting that for Windows XP Microsoft provided a downloadable patch containing the characters W2K in its name, but for Windows 2000 Microsoft did not provide one. But I didn't test the thing under Windows XP because fortunately I was able to return the device during its warranty period, which was before XP came out.

    > or was it a hardware combination that wasn't
    > tested?

    No kidding. Microsoft built some drivers (and fdisk command) into its OSes, didn't test them, and produced disastrous results.

    Here in your blog entry you point out the possible disastrous effects of mixing earlier Windows systems with other vendors' DOS clones, so I ask why Microsoft didn't test Microsoft's own OSes in order to avoid the exact same disastrous effects?
  • Norman, there IS no underlying operating system when running Win2K and Win95. They are complete solutions, from the disk drivers to the user interface.

    Windows 3.1 was not a complete solution. It relied on the OS for file services and other stuff (memory management, etc).

  • 8/16/2004 7:53 AM Larry Osterman

    > Neither Win95 or Win2000 mess with the
    > internals of the underlying operating
    > system.

    8/17/2004 8:07 AM Larry Osterman

    > Norman, there IS no underlying operating
    > system when running Win2K and Win95.

    Right. Now that this is out of the way, can you say why Windows 95 and Windows 2000 were not tested to prevent corruption of hard disks? Your base note explained how bad things could be if Windows 3.1 were combined with another vendor's DOS instead of yours. Things could have been exactly as bad as things actually were (and are) with your company's Windows 95 or Windows 2000 all by themselves. You KNOW how bad disk corruption is. One question is why weren't these tested, but a bigger question is why weren't patches released after customers got to do your testing for you?
  • Norman, that's a truely silly question, IMHO.

    Of course the OS was tested for disk corruption. Not every possible scenario was tested, and obviously a bug was missed. But the testing was done.



  • "Norman, there IS no underlying operating system when running Win2K and Win95. "

    Windows 95 depends just as much on DOS as does WfWG3.11+Win32s, despite the fact that it packaged DOS 7.0 with Windows 4.0 in the same box. Even file system interrupts can get routed through to real mode DOS under the right circumstances (a real mode CD-ROM driver, for example, and I suppose a real mode DOS-based network stack too).

    Windows 2000 is totally different, of course.
  • That's true only because Win95 booted with DOS 7.0 - once it came up, all the DOS code was thrown away (unless there were 3rd party drivers like the CD-ROM driver mentioned) present.
  • 8/17/2004 5:26 PM Larry Osterman

    > Norman, that's a truely silly question,
    > IMHO.

    Compare it to the silliness of this. A retrospective essay on the AARD code asserted that there were valid technical reasons for the AARD code. One of the asserted reasons is that a mixture of code from Microsoft and other vendors might have been equally disastrous as code from Microsoft actually was all by Microsoft's self.

    I agree that the possibility was there, the effects would have been disastrous, and Microsoft's code was (and likely still is in Windows 2000) disastrous.

    But I think I'm not going to believe that this was one of the reasons for the AARD code.

    > Of course the OS was tested for disk
    > corruption. Not every possible scenario was
    > tested, and obviously a bug was missed.

    In Windows 95 obviously several bugs were missed, because there were more than one downloadable patch for W95A for bugs causing IDE disk corruption (in fact I probably named the wrong one a few days ago). Why was there no downloadable patch for the bug causing SCSI disk corruption? Microsoft did not remain permanently ignorant about this bug.

    In Windows 2000 maybe there were only one or two missed bugs, maybe. They are more subtle than the Windows 95 case.

    > But the testing was done.

    Indeed there are additional clues besides your statement, demonstrating that testing was done. One SCSI card vendor had performed testing before I got hit by it, but unfortunately the first four SCSI cards I experimented with were from vendors who didn't know about it. The fact became more widely known later, other SCSI card vendors tested, AND Microsoft almost surely tested. Why didn't Microsoft provide a downloadable patch?

    (Actually I have seen clues about an answer to this too, your company knowingly and willfully cared not a fig for the amount of damage your company did to end users outside of the US. But if you have any more useful answers, please let's hear them.)

    For Windows 2000 and PCMCIA IDE hard disks, observe that testing was done for Windows XP while Windows 2000 was still on the market. Why was a fix released for XP and not for 2000?

    There is one case that might or might not be a Windows 2000 bug. If the user attaches a SCSI disk that had been fdisk'ed by Windows 95, and opens Windows 2000 disk manager, Windows 2000 accurately determines that one of the logical partitions is corrupt but fails to determine that second one is corrupt. But if the user tells Windows 2000 disk manager to delete the corrupt logical partition, it actually deletes both, without warning. It is understandable how this might have been missed in testing. I don't know if this testing was actually done or not. I almost want to ask if you know, but then you'll answer this instead of the much higher priority questions that I asked above. Please, the others are far more important.
  • Sorry for two in a row. This appeared while I was editing my previous posting.

    8/18/2004 9:29 AM Larry Osterman replied to mschaef

    > That's true only because Win95 booted with
    > DOS 7.0 - once it came up, all the DOS code
    > was thrown away (unless there were 3rd party

    Some of the DOS code was not thrown away. The protected mode and real mode code interacted with each other, resulting in at least two bugs for which Microsoft provided downloadable fixes and at least one for which Microsoft didn't. In some cases 3rd party drivers interacted too but they were not at fault. The same bugs were manifest when the user tried devices whose drivers were all built into Windows 95.
  •  Good articles about .NET 2.0 and VS.NET 2005 - ASP.NET Whidbey- Storing User Information with ASP.NET 2.0 Profiles- ASP.NET Whidbey-...
  • I read Raymond Chen's blog from time to time, somewhat because he's a really conversational writer, and somewhat because he's got lots of interesting things to say about the history of Windows. I was amused by this post about MS-DOS...

Page 2 of 2 (29 items) 12