Larry Osterman's WebLog

Confessions of an Old Fogey
Blog - Title

AARDvarks in your code.

AARDvarks in your code.

  • Comments 29

If there was ever a question that I’m a glutton for punishment, this post should prove it.

We were having an email discussion the other day, and someone asked:

Isn't there a similar story about how DOS would crash when used with [some non-MS thing] and only worked with [some MS thing]? I don't remember what the "thing" was though =)

Well, the only case I could think of where that was the case was the old AARD code in Windows.  Andrew Schulman wrote a great article on it back in the early 1990’s, which dissected the code pretty thoroughly.

The AARD code in Windows was code to detect when Windows was running on a cloned version of MS-DOS, and to disable Windows on that cloned operating system.  By the time that Windows 3.1 shipped, it had been pulled from Windows, but the vestiges of the code were left behind.  As Andrew points out, the code was obfuscated, and had debugger-hiding logic, but it could be reverse engineered, and Andrew did a great job of doing it.

I can’t speak as to why the AARD code was obfuscated, I have no explanation for that, it seems totally stupid to me.  But I’ve got to say that I totally agree with the basic concept of Windows checking for an alternative version of MS-DOS and refusing to run on it.

The thing is that the Windows team had a problem to solve, and they didn’t care how they solved it.  Windows decided that it owned every part of the system, including the internal data structures of the operating system.  It knew where those structures were located, it knew what the size of those data structures was, and it had no compunction against replacing those internal structures with its own version.  Needless to say, from a DOS developer’s standpoint, keeping Windows working was an absolute nightmare.

As a simple example, when Windows started up, it increased the size of MS-DOS’s internal file table (the SFT, that’s the table that was created by the FILES= line in config.sys).  It did that to allow more than 20 files to be opened on the windows system (a highly desirable goal for a multi-tasking operating system).  But it did that by using an undocumented API call, which returned a pointer to a set of “interesting” pointers in MS-DOS. It then indexed a known offset relative to that pointer, and replaced the value of the master SFT table with its own version of the SFT.  When I was working on MS-DOS 4.0, we needed to support Windows.  Well, it was relatively easy to guarantee that our SFT was at the location that Windows was expecting.  But the problem was that the MS-DOS 4.0 SFT was 2 bytes larger than the MS-DOS 3.1 SFT.   In order to get Windows to work, I had to change the DOS loader to detect when win.com was being loaded, and if it was being loaded, I looked at the code at an offset relative to the base code segment, and if it was a “MOV” instruction, and the amount being moved was the old size of the SFT, I patched the instruction in memory to reflect the new size of the SFT!  Yup, MS-DOS 4.0 patched the running windows binary to make sure Windows would still continue to work.

Now then, considering how sleazy Windows was about MS-DOS, think about what would happen if Windows ran on a clone of MS-DOS.  It’s already groveling internal MS-DOS data structures.  It’s making assumptions about how our internal functions work, when it’s safe to call them (and which ones are reentrant and which are not).  It’s assuming all SORTS of things about the way that MS-DOS’s code works.

And now we’re going to run it on a clone operating system.  Which is different code.  It’s a totally unrelated code base.

If the clone operating system isn’t a PERFECT clone of MS-DOS (not a good clone, a perfect clone), then Windows is going to fail in mysterious and magical ways.  Your app might lose data.  Windows might corrupt the hard disk.   

Given the degree with which Windows performed extreme brain surgery on the innards of MS-DOS, it’s not unreasonable for Windows to check that it was operating on the correct patient.

 

Edit: Given that most people aren't going to click on the link to the Schulman article, it makes sense to describe what the AARD check was :)

Edit: Fixed typo, thanks KC

  • Windows worked on my copy of DRDOS for starters.

    But anyway, you have this all wrong. It's not your job to ensure that you are running on the right version of DOS, it's the responsibility of the clone OS (DR DOS) to make sure that they behave the same way that you do.

    Same way that the onus is on AMD to keep compatibility with Intel, and if you want to code something that's highly specific to Intel's architecture, that's wholly up to you, but please don't then police the architecture you run on.

    The most you have a right to do is say 'this program might not work correctly'.
  • matthew, you're right, I forgot to mention that it was disabled in the final release.

    And in principal, you're right, it would be dr-dos's responsibility.

    But there's a philosophical shift that has to occur.

    The original philosophy: "If we trash the user's machine because we're running on a non-MSDOS system, it's our fault, because we're sleazy."

    The new philosophy: "If we trash the user's system because we're running on a non-MSDOS system, it's the non-MSDOS system's fault, because they're not compatible enough".

    Both are valid, one puts the onus on Windows, the other on the clone manufacturer. Eventually they decided to put the onus on the clone instead of checking for the clone. The right decision IMHO. But I still understand (and agree with) the reasons behind the original decision.
  • Are there known cases where a customer had a problem (say corrupted data) because of running with DR-DOS? If MS didn't make it clear that only MS-DOS was supported, and a customer got burned because of it... well, then the original idea might have been better.
  • No, I'm not aware of cases where a customer had a problem. On the other hand, I wouldn't be aware of those cases, since I wasn't on the Windows team at the time.

    This is a REALLY fine line that you need to walk, as Raymond's (and some of my) posts have shown. The user's apps worked just fine with DR-DOS. Then they installed Windows and their system became unstable. Whose fault is it? DR-DOS's or Windows?

    The customer's typically going to blame the most recent piece of software installed. And believe me, there are REALLY subtle ways that Windows could have gone wrong.
  • My how times have changed. Ten years ago (or whenever it was) this was the central core of one of the Microsoft anti-trust investigations. Page after page was written about how this proved that Microsoft was anti-competitive.

    Today? Someone defending what Microsoft did at the time barely gets any comments at all.

    Can I apply this same principle to the current set of anti-trust investigations? Will it come out in 10 years that what Microsoft is accused of doing actually made sense?

    Seems to me that fame (or infamy in this case) is transient. And thats a good thing...



  • If it's reasonable for Windows to refuse to run on DR-DOS because it isn't MS-DOS, then it's also reasonable for Windows to:
    i) Say so, clearly. "Non-fatal error detected: error #2726" is hardly a model of straightforwardness.
    ii) Check for DR-DOS the documented way, using INT 21h AX=4452h.

    If, on the other hand, the AARD code is actually checking for behaviour that is known to cause problems and/or data corruption (rather than just a non-MS DOS which might or might not work) then presumably someone at Microsoft knows what those problems are, and could divulge them.
  • John,
    Checking INT 21/4452h works for DR-DOS. Does it work for other MS-DOS clones (there were a couple at the time, DR-DOS was the most popular)? It's easier to check for things that you know are true about MS-DOS than to look for a specific clone operating system.


  • "Will it come out in 10 years that what Microsoft is accused of doing actually made sense? "

    You're implying the the AARD code made sense? Nothing about it made sense. The fact it was encrypted and debugger protected didn't make any sense, and neither does Mr. Osterman's explanation, even though it's made with the benefit of 10 years of 20/20 hindsight.

    I see several main arguments against justifying AARD by saying it "protected the customer". The first is that if it was truly done in the intent of protecting the customer, why was it so heavily guarded? Why was the error messsage so cryptic? Second, if it was removed in the name of avoiding anti-trust charges, and there truly was a benefit to it, why weren't the specific benefits made known to customers anyway? Third, about which Raymond Chen has repeatedly blogged, Microsoft's long standing policy towards "protecting the customer" from buggy software has classically been bend over backwards to design their code to work with legacy software. AARD represents the entirely opposite approach.

    I'm not saying that Microsoft is necessarily the same company that they were 10 years ago, but AARD surely represents an organizational low point.
  • mschaef, my explanation wasn't made with 15 years of hindsight, I knew why the AARD code was a good idea at the time.

    I do NOT understand why it was encrypted. That makes absolutely no sense whatsoever, IMHO and only feeds the suspicion that the it was anti-competitive (if it wasn't anti-competitive, there would be no reason to hide the check).

    And Microsoft DID say that it was to protect the customer at the time. The public just didn't buy it. It didn't explain that the AARD code existed because Windows was a sleazy application that trampled on MS-DOS's data structures, all it said was that Windows was tightly coupled with MS-DOS.

    It was pulled because of the firestorm of criticism about it (not the first time that's happened, remember Intel's CPU ID?). But that doesn't mean that the idea wasn't a good one.

    I actually wrote this article because AARD is (in my opinion) a shining example of Microsoft going out of its way to protect the customer.

    Think about it: The customer has a system that worked perfectly running DR-DOS. Then they installed Windows. And all of a sudden, their hard disk got corrupted. Whos fault is that? It's WINDOWS fault - it was the last application running on the machine.

    The Windows team put the AARD check in to PROTECT customers. Not to hurt them. They pulled it out because people said it was anti-competitive (and a strong argument could be made to say that it was), but pulling it out that left customers at risk of data corruption.

    The Windows 3.1 team COULD have worked to ensure that Windows ran on all MS-DOS clones, you're right. But we're talking about an OS designed to run on machines with significantly less than 1M of RAM, it was far easier to just test with MS-DOS and just say that MS-DOS was required to run Windows. The test effort to get the OS working on cloned platforms would be significant.

    Let's give a modern example. Is it the responsibility of the Microsoft Word team to ensure that their application works under WINE?

    After all, WINE is a clone of the Win32 environment, so if Word doesn't work in that cloned environment, it's Words fault, not the fault of the cloned environment. Right?

    Nope. It's the WINE teams responsibility to ensure that their platform is compatible enough with the Windows platform to ensure that Word works.

    Similarly, it was DR-DOS's responsibility to ensure that DR-DOS was compatible enough to guarantee that Windows (don't forget: Windows was "just" another DOS application) ran on DR-DOS. The AARD check was a safetybelt to attempt to ensure that Windows only ran on operating systems that were tested with it, when the Windows team pulled it out, they moved the onus of supporting cloned versions of MS-DOS from Windows to the clone OS vendor.
  • "Checking INT 21/4452h works for DR-DOS. Does it work for other MS-DOS clones"?

    I don't think so. The only other one I can think of that might have been around then was RxDOS, and that can be detected by INT 21/3000h returning BL=5Eh.

    "The AARD check was a safetybelt to attempt to ensure that Windows only ran on operating systems that were tested with it"

    And Microsoft did all they could to make sure that DR DOS was not tested with it.
    <http://www.kegel.com/remedy/archive/final4.html>: "Microsoft had excluded DRI from beta testing DR DOS with the new Windows version 3.1 before its public release"

    So if Novell had been willing to put in the "significant" test effort to get the Windows beta working on DR DOS, they didn't get a chance to.

    "AARD is (in my opinion) a shining example..."

    I think you've picked a pretty poor example. Assuming for the purpose of argument that the idea is a good one, the implementation is shockingly bad.

    As for shining - I think the XOR business and debugger-jimmying hid it under rather too big a bushel for it to shine properly.
  • I completely agree that the XOR business and debugger-jimmying (good turn of phrase btw) was unconsionable. Btw, you're quoting an allegation of DRI in the comment "Microsoft had excluded DRI" - of course DRI claimed that Microsoft was being malicious, that's a court document where DRI is the plantiff and Microsoft's the defendant. Microsoft DID prevent ANY 3rd party OS vendors from seeing prereleases of Win 3.1, the judge indicated that if Microsoft had a version of MS-DOS that was designed to run with Windows it would be ok.

    And the conclusion of the discussion was that Microsoft DID have technical issues with DR-DOS. From the same document you cite:

    "I tracked down a serious incompatibility with DR-DOS 6 - They don't use the 'normal' devise driver interface for >32M partitions. Instead of setting the regular START SECTOR field to Offffh an then using a brand new 32-bit field the way MS-DOS has always done, they simply extended the start sector field by 16 bits.

    This seems like a foolish oversight on their part and will likely result in extensive incompatibilities when they try to run with 3rd part device drivers"

    Microsoft's request for a summary judgement was denied not because Microsoft had technical reasons for detecting DR-DOS, but because of other non technical issues (which were significant, don't get me wrong).

    The other thing to keep in mind is that the ruling sited was a motion denying a request for summary judgement - in other words, the judge was denying Microsoft's requests to throw the claims out, not that Microsoft's defenses were without merit.
  • "Btw, you're quoting an allegation of DRI"

    The statement that Microsoft didn't give betas to competing OS vendors isn't preceded by "it is alleged" or "Caldera asserts" like the rest of the paragraph, and it gets repeated at the start of the next section; so I think we can believe it.

    "And the conclusion of the discussion was that Microsoft DID have technical issues with DR-DOS"

    - and had come up with a workaround.

    'It is possible to make Bambi work, assuming we can come up with a reasonably safe method for detecting DRD6.'
  • The version of the quote I found was a "Caldera Alleges" indicating that Microsoft had singled out Caldera. Later on in the ruling the judge makes it clearer that ALL non MS-DOS OS's were excluded from the beta program, not just DR-DOS.

    It's possible to make Microsoft's MS-DOS drivers work on a non MS-DOS system. But is it necessary to make them work?

    An interesting question actually.

    One thing to keep in mind in reading the document (and I read most of it) is that there are two sets of voices quoted. One is the technical team (Phil Barrett, Aaron Reynolds, etc). And there's the management team (BradSi, etc).

    The technical team is almost entirely pointing out valid technical reasons why running Windows on DR-DOS is a bad idea. The management team's the people who are trying to find non-technical reasons to kill DR-DOS.

    I'm a technical guy, not a management guy (as if it's not obvious at this point). I see the technical reasons and understand them. I don't necessarily see (and I don't agree with) the non-technical reasons.

    The good news is that this was 15 years ago, the world and this company are very different places now.
  • "It's possible to make Microsoft's MS-DOS drivers work on a non MS-DOS system. But is it necessary to make them work?"

    That depends if it could have been fixed sufficiently reliably that it didn't fall over if confronted with a subsequent version of DR DOS without this feature. Unlike, say, those weird workarounds that MS-DOS 6 does if it doesn't like the OEM ID on a hard drive partition, which go funny if the last-but-one character of the OEM ID isn't a dot.
    This is digressing rather; but I suspect that the reason DR DOS did the >32M partition support differently from MS-DOS is that it was copying Compaq DOS 3.31, which does the same thing and (of course) predates MS-DOS 4.

    "The good news is that this was 15 years ago, the world and this company are very different places now. "

    The sheer *number* of jokey replies I could make to that...

    I think I'll settle for my 4th choice - "So AARD is the next CPLed project to go on Sourceforge then?"
Page 1 of 2 (29 items) 12