December, 2004

  • The Old New Thing

    Sometimes people don't like it when you enforce a standard

    • 50 Comments

    Your average computer user wouldn't recognize a standards document if they were hit in the face with it.

    I'm reminded of a beta bug report back in 1996 regarding how Outlook Express (then called "Microsoft Internet Mail and News") handled percent signs in email addresses (I think). The way Outlook Express did it was standards-conformant, and I sent the relevant portion of the RFC to the person who reported the bug. Here's what I got back:

    I have never read the RFC's (most people, I'm sure, haven't) but I know when something WORKS in one mail reader (Netscape) and DOESN'T WORK in another (MSIMN).

    The problem, restated to comply with your RFC:

    MS Internet Mail and News DO NOT HANDLE PERCENT SIGNS like the RFC says.

    That first sentence pretty much captures the reaction most of the world has to standards documents: They are meaningless. If Outlook Express doesn't behave the same way as Netscape, then it's a bug in Outlook Express, regardless of what the standards documents say.

    There are many "strangenesses" in the way Internet Explorer handles certain aspects of HTML when you don't run it in strict mode. For example, did you notice that the font you set via CSS for your BODY tag doesn't apply to tables? Or that invoking the submit method on a form does not fire the onsubmit event? That's because Netscape didn't do it either, and Internet Explorer had to be bug-for-bug compatible with Netscape because web sites relied on this behavior.

    The last paragraph in the response is particularly amusing. The person is using the word "RFC" as a magic word, not knowing what it means. Apparently if you want to say that something doesn't work as you expect, you say that it doesn't conform to the RFC. Whether your expectation agrees with the RFC is irrelevant. (By his own admission, the person who filed the bug didn't even read the RFC.)

  • The Old New Thing

    Don't save anything you can recalculate

    • 15 Comments

    Nowadays, a major barrier to performance for many classes of programs is paging. We saw earlier this year that paging can kill a server. Today, another example of how performance became tied to paging.

    The principle is "Don't save anything you can recalculate." This of course, seems counterintuitive: Shouldn't you save the answer so you don't have to recalculate it?

    The answer is, "It depends."

    If recalculating the answer isn't very expensive and has good data locality, then you may be better off recalculating it than saving it, especially if saving it reduces locality. For example, if the result is stored in a separate object, you now have to touch a second object—risking a page fault—to get the saved answer.

    Last time, we saw how Windows 95 applied this principle so that rebasing a DLL didn't thrash your machine. I'm told that the Access team used this principle to reap significant performance gains. Instead of caching results, they just threw them away and recalculated them the next time they were needed.

    Whether this technique works for you is hard to predict. If your program is processor-bound, then caching computations is probably a good idea. But if your program is memory-bound, then you may be better off getting rid of the cache, since the cache is just creating more memory pressure.

  • The Old New Thing

    How did Windows 95 rebase DLLs?

    • 23 Comments

    Windows 95 handled DLL-rebasing very differently from Windows NT.

    When Windows NT detects that a DLL needs to be loaded at an address different from its preferred load address, it maps the entire DLL as copy-on-write, fixes it up (causing all pages that contain fixups to be dumped into the page file), then restores the read-only/read-write state to the pages. (Larry Osterman went into greater detail on this subject earlier this year.)

    Windows 95, on the other hand, rebases the DLL incrementally. This is another concession to Windows 95's very tight memory requirements. Remember, it had to run on a 4MB machine. If it fixed up DLLs the way Windows NT did, then loading a 4MB DLL and fixing it up would consume all the memory on the machine, pushing out all the memory that was actually worth keeping!

    When a DLL needed to be rebased, Windows 95 would merely make a note of the DLL's new base address, but wouldn't do much else. The real work happened when the pages of the DLL ultimately got swapped in. The raw page was swapped off the disk, then the fix-ups were applied on the fly to the raw page, thereby relocating it. The fixed-up page was then mapped into the process's address space and the program was allowed to continue.

    This method has the advantage that the cost of fixing up a page is not paid until the page is actually needed, which can be a significant savings for large DLLs of mostly-dead code. Furthermore, when a fixed-up page needed to be swapped out, it was merely discarded, because the fix-ups could just be applied to the raw page again.

    And there you have it, demand-paging rebased DLLs instead of fixing up the entire DLL at load time. What could possibly go wrong?

    Hint: It's a problem that is peculiar to the x86.

    The problem is fix-up that straddle page boundaries. This happens only on the x86 because the x86 architecture is the weirdo, with variable-length instructions that can start at any address. If a page contains a fix-up that extends partially off the start of the page, you cannot apply it accurately until you know whether or not the part of the fix-up you can't see generated a carry. If it did, then you have to add one to your partial fix-up.

    To record this information, the memory manager associates a flag with each page of a relocated DLL that indicates whether the page contained a carry off the end. This flag can have one of three states:

    • Yes, there is a carry off the end.
    • No, there is no carry off the end.
    • I don't know whether there is a carry off the end.

    To fix up a page that contains a fix-up that extends partially off the start of the page, you check the flag for the previous page. If the flag says "Yes", then add one to your fix-up. If the flag says "No", then do not add one.

    But what if the flag says "I don't know?"

    If you don't know, then you have to go find out. Fault in the previous page and fix it up. As part of the computations for the fix-up, the flag will get to indicate whether there is a carry out the end. Once the previous page has been fixed up, you can check the flag (which will no longer be a "Don't know" flag), and that will tell you whether or not to add one to the current page.

    And there you have it, demand-paging rebased DLLs instead of fixing up the entire DLL at load time, even in the presence of fix-ups that straddle page boundaries. What could possibly go wrong?

    Hint: What goes wrong with recursion?

    The problem is that the previous page might itself have a fix-up that straddled a page boundary at its start, and the flag for the page two pages back might be in the "I don't know" state. Now you have to fault in and fix up a third page.

    Fortunately, in practice this doesn't go beyond three fix-ups. Three pages of chained fix-ups was the record.

    (Of course, another way to stop the recursion is to do only a partial fix-up of the previous page, applying only the straddling fix-up to see whether there is a carry out and not attempting to fix up the rest. But Windows 95 went ahead and fixed up the rest of the page because it figured, hey, I paid for this page, I may as well use it.)

    What was my point here? I don't think I have one. It was just a historical tidbit that I hoped somebody might find interesting.

  • The Old New Thing

    Excellent blog about Windows and Unicode

    • 23 Comments

    Michael Kaplan has probably forgotten more about Unicode than most people know. He knows about the mysterious placement of the Won character in the Korean character set, and the same for the Japanese Yen character, what the invariant locale is, why Korean text sorts strangely if you pass the NORM_IGNORENONSPACE flag, and other strange and wonderful dark corners of keyboard layouts, character sets, and collation.

    Around Microsoft, Michael is the local authority on Unicode. It's great that he's sharing his deep knowledge with the rest of us. (Note: I said "local" authority. Just because he's our main guy doesn't mean that he's your primary contact, too.)

  • The Old New Thing

    Optimization is often counter-intuitive

    • 25 Comments

    Anybody who's done intensive optimization knows that optimization is often counter-intuitive. Things you think would be faster often aren't.

    Consider, for example, the exercise of obtaining the current instruction pointer. There's the naïve solution:

    __declspec(noinline)
    void *GetCurrentAddress()
    {
      return _ReturnAddress();
    }
    
    ...
    void *currentInstruction = GetCurrentAddress();
    

    If you look at the disassembly, you'll get something like this:

    GetCurrentAddress:
        mov eax, [esp]
        ret
    
    ...
        call GetCurrentAddress
        mov [currentInstruction], eax
    

    "Pah," you say to yourself, "look at how inefficient that is. I can reduce that to two instructions. Watch:

    void *currentInstruction;
    __asm {
    call L1
    L1: pop currentInstruction
    }
    

    That's half the instruction count of your bloated girly-code."

    But if you sit down and race the two code sequences, you'll find that the function-call version is faster by a factor of two! How can that be?

    The reason is the "hidden variables" inside the processor. All modern processors contain much more state than you can see from the instruction sequence. There are TLBs, L1 and L2 caches, all sorts of stuff that you can't see. The hidden variable that is important here is the return address predictor.

    The more recent Pentium (and I believe also Athlon) processors maintain an internal stack that is updated by each CALL and RET instruction. When a CALL is executed, the return address is pushed both onto the real stack (the one that the ESP register points to) as well as to the internal return address predictor stack; a RET instruction pops the top address of the return address predictor stack as well as the real stack.

    The return address predictor stack is used when the processor decodes a RET instruction. It looks at the top of the return address predictor stack and says, "I bet that RET instruction is going to return to that address." It then speculatively executes the instructions at that address. Since programs rarely fiddle with return addresses on the stack, these predictions tend to be highly accurate.

    That's why the "optimization" turns out to be slower. Let's say that at the point of the CALL L1 instruction, the return address predictor stack looks like this:

    Return address
    predictor stack:
      caller1 -> caller2 -> caller3 -> ...
    Actual stack:   caller1 -> caller2 -> caller3 -> ...

    Here, caller1 is the function's caller, caller1 is the function's caller's caller, and so on. So far, the return address predictor stack is right on target. (I've drawn the actual stack below the return address predictor stack so you can see that they match.)

    Now you execute the CALL instruction. The return address predictor stack and the actual stack now look like this:

    Return address
    predictor stack:
      L1 -> caller1 -> caller2 -> caller3 -> ...
    Actual stack:   L1 -> caller1 -> caller2 -> caller3 -> ...

    But instead of executing a RET instruction, you pop off the return address. This removes it from the actual stack, but doesn't remove it from the return address predictor stack.

    Return address
    predictor stack:
      L1 -> caller1 -> caller2 -> caller3 -> ...
    Actual stack:   caller1 -> caller2 -> caller3 -> caller4 -> ...

    I think you can see where this is going.

    Eventually your function returns. The processor decodes your RET instruction and looks at the return address predictor stack and says, "My predictor stack says that this RET is going to return to L1. I will begin speculatively executing there."

    But oh no, the value on the top of the real stack isn't L1 at all. It's caller1. The processor's return address predictor predicted incorrectly, and it ended up wasting its time studying the wrong code!

    The effects of this bad guess don't end there. After the RET instruction, the return address predictor stack looks like this:

    Return address
    predictor stack:
      caller1 -> caller2 -> caller3 -> ...
    Actual stack:   caller2 -> caller3 -> caller4 -> ...

    Eventually your caller returns. Again, the processor consults its return address predictor stack and speculatively executes at caller1. But that's not where you're returning to. You're really returning to caller2.

    And so on. By mismatching the CALL and RET instructions, you managed to cause every single return address prediction on the stack to be wrong. Notice in the diagram that, in the absence of somebody playing games with the return address predictor stack of the type that created the problem initially, not a single prediction on the return address predictor stack will be correct. None of the predicted return addresses match up with actual return addresses.

    Your peephole optimization has proven to be shortsighted.

    Some processors expose this predictor more explictly. The Alpha AXP, for example, has several types of control flow instructions, all of which have the same logical effect, but which hint to the processor how it should maintain its internal predictor stack. For example, the BR instruction says, "Jump to this address, but do not push the old address onto the predictor stack." On the other hand, the JSR instruction says, "Jump to this address, and push the old address onto the predictor stack." There is also a RET instruction that says, "Jump to this address, and pop an address from the predictor stack." (There's also a fourth type that isn't used much.)

    Moral of the story: Just because something looks better doesn't mean that it necessarily is better.

  • The Old New Thing

    How to get more hits on Google than even Steve Ballmer

    • 36 Comments

    Ein deutscher Blogger namens Tony schrieb dass Robert Scoble is gaining on Steve Ballmer. Mit anderen Worten, dass eine Google-Suche nach "Robert Scoble" ungefähr 172.000 Seiten findet, während eine Google-Suche nach "Steve Ballmer" ungefähr 302.000 Seiten zeigt. Er fragte, ob jemand einen anderen Microsoft-Angestellten finden kann, der mehr Google-Ergebnisse als Robert Scoble bekommt.

    Na klar, das ist ganz leicht, und ich werde euch in das Geheimnis ziehen.

    Suche nach "David Smith". Ach, du meine Güte! 869.000 Ergebnisse! Noch mehr als Steve Ballmer!

    Du musst nur einen Person finden, der einen sehr gewöhnlichen Namen hat.

  • The Old New Thing

    The hunt for a faster syscall trap

    • 14 Comments

    The performance of the syscall trap gets a lot of attention.

    I was reminded of a meeting that took place between Intel and Microsoft over fifteen years ago. (Sadly, I was not myself at this meeting, so the story is second-hand.)

    Since Microsoft is one of Intel's biggest customers, their representatives often visit Microsoft to show off what their latest processor can do, lobby the kernel development team to support a new processor feature, and solicit feedback on what sort of features would be most useful to add.

    At this meeting, the Intel representatives asked, "So if you could ask for only one thing to be made faster, what would it be?"

    Without hesitation, one of the lead kernel developers replied, "Speed up faulting on an invalid instruction."

    The Intel half of the room burst out laughing. "Oh, you Microsoft engineers are so funny!" And so the meeting ended with a cute little joke.

    After returning to their labs, the Intel engineers ran profiles against the Windows kernel and lo and behold, they discovered that Windows spent a lot of its time dispatching invalid instruction exceptions. How absurd! Was the Microsoft engineer not kidding around after all?

    No he wasn't.

    It so happens that on the 80386 chip of that era, the fastest way to get from V86-mode into kernel mode was to execute an invalid instruction! Consequently, Windows/386 used an invalid instruction as its syscall trap.

    What's the moral of this story? I'm not sure. Perhaps it's that when you create something, you may find people using it in ways you had never considered.

  • The Old New Thing

    This Game Boy won't hurt a bit, just help the Powerpuff Girls count backwards from ten

    • 6 Comments

    Is there nothing a Game Boy can't do? We already learned that it can be played like a musical instrument. Now we discover that letting children play with a Game Boy before surgery is more effective than tranquilizers or a parent's hand at keeping them calm.

    You know when you go to the dentist and she asks you, "What flavor fluoride rinse would you like?" Perhaps someday the anaesthetist will ask you, "Do you want a Game Boy or a Nintendo?"

    (Drat, scooped by Slashdot again.)

  • The Old New Thing

    Why do dialog editors start assigning control IDs with 100?

    • 13 Comments

    When you use a dialog editor and insert new controls, they typically are assigned control IDs starting at around 100. Why?

    Because the small numbers are already taken.

    /*
     * Dialog Box Command IDs
     */
    #define IDOK                1
    #define IDCANCEL            2
    #define IDABORT             3
    #define IDRETRY             4
    #define IDIGNORE            5
    #define IDYES               6
    #define IDNO                7
    #define IDCLOSE             8
    #define IDHELP              9
    #define IDTRYAGAIN         10
    #define IDCONTINUE         11
    

    The dialog manager knows about these special values and assumes that if your dialog box has a control whose ID matches one of these special values, then it also behaves in a certain way.

    The dialog manager assumes that a control whose ID is IDOK is an OK button. If the user hits Enter, the default button will be pushed; if no default button can be found, then the OK button is pushed. Similarly, a control whose ID is IDCANCEL is assumed to be a Cancel button. If the user hits ESC or clicks the X button in the corner, then the Cancel button is pushed.

    If your dialog box has OK and Cancel buttons, make sure to give them the IDOK and IDCANCEL control IDs so that they act like proper OK and Cancel buttons. Conversely, any control with those IDs had better be proper OK and Cancel buttons.

  • The Old New Thing

    Scientists come one step closer to the perfect poppy-seed bagel

    • 10 Comments

    It's easy to distribute points evenly across a flat surface, but doing so over a curved surface is a much more complicated problem. Even spheres are hard. NPR's Scott Simon interviews mathematician Ed Saff who with colleague Doug Hardin has developed a new method of attacking this complex problem.

    Press release from Vanderbilt University. You can also download the paper (in PDF form) from Dr. Saff's home page, if you think you're smart enough to understand it. (Don't ask me for help. I have two degrees in mathematics and was in over my head by halfway through page 2. I couldn't even make it out of the Introduction.)

Page 2 of 4 (34 items) 1234