Holy cow, I wrote a book!
Nowadays, a major barrier to performance for many classes of programs is paging. We saw earlier this year that paging can kill a server. Today, another example of how performance became tied to paging.
The principle is "Don't save anything you can recalculate." This of course, seems counterintuitive: Shouldn't you save the answer so you don't have to recalculate it?
The answer is, "It depends."
If recalculating the answer isn't very expensive and has good data locality, then you may be better off recalculating it than saving it, especially if saving it reduces locality. For example, if the result is stored in a separate object, you now have to touch a second object—risking a page fault—to get the saved answer.
Last time, we saw how Windows 95 applied this principle so that rebasing a DLL didn't thrash your machine. I'm told that the Access team used this principle to reap significant performance gains. Instead of caching results, they just threw them away and recalculated them the next time they were needed.
Whether this technique works for you is hard to predict. If your program is processor-bound, then caching computations is probably a good idea. But if your program is memory-bound, then you may be better off getting rid of the cache, since the cache is just creating more memory pressure.
Windows 95 handled DLL-rebasing very differently from Windows NT.
When Windows NT detects that a DLL needs to be loaded at an address different from its preferred load address, it maps the entire DLL as copy-on-write, fixes it up (causing all pages that contain fixups to be dumped into the page file), then restores the read-only/read-write state to the pages. (Larry Osterman went into greater detail on this subject earlier this year.)
Windows 95, on the other hand, rebases the DLL incrementally. This is another concession to Windows 95's very tight memory requirements. Remember, it had to run on a 4MB machine. If it fixed up DLLs the way Windows NT did, then loading a 4MB DLL and fixing it up would consume all the memory on the machine, pushing out all the memory that was actually worth keeping!
When a DLL needed to be rebased, Windows 95 would merely make a note of the DLL's new base address, but wouldn't do much else. The real work happened when the pages of the DLL ultimately got swapped in. The raw page was swapped off the disk, then the fix-ups were applied on the fly to the raw page, thereby relocating it. The fixed-up page was then mapped into the process's address space and the program was allowed to continue.
This method has the advantage that the cost of fixing up a page is not paid until the page is actually needed, which can be a significant savings for large DLLs of mostly-dead code. Furthermore, when a fixed-up page needed to be swapped out, it was merely discarded, because the fix-ups could just be applied to the raw page again.
And there you have it, demand-paging rebased DLLs instead of fixing up the entire DLL at load time. What could possibly go wrong?
Hint: It's a problem that is peculiar to the x86.
The problem is fix-up that straddle page boundaries. This happens only on the x86 because the x86 architecture is the weirdo, with variable-length instructions that can start at any address. If a page contains a fix-up that extends partially off the start of the page, you cannot apply it accurately until you know whether or not the part of the fix-up you can't see generated a carry. If it did, then you have to add one to your partial fix-up.
To record this information, the memory manager associates a flag with each page of a relocated DLL that indicates whether the page contained a carry off the end. This flag can have one of three states:
To fix up a page that contains a fix-up that extends partially off the start of the page, you check the flag for the previous page. If the flag says "Yes", then add one to your fix-up. If the flag says "No", then do not add one.
But what if the flag says "I don't know?"
If you don't know, then you have to go find out. Fault in the previous page and fix it up. As part of the computations for the fix-up, the flag will get to indicate whether there is a carry out the end. Once the previous page has been fixed up, you can check the flag (which will no longer be a "Don't know" flag), and that will tell you whether or not to add one to the current page.
And there you have it, demand-paging rebased DLLs instead of fixing up the entire DLL at load time, even in the presence of fix-ups that straddle page boundaries. What could possibly go wrong?
Hint: What goes wrong with recursion?
The problem is that the previous page might itself have a fix-up that straddled a page boundary at its start, and the flag for the page two pages back might be in the "I don't know" state. Now you have to fault in and fix up a third page.
Fortunately, in practice this doesn't go beyond three fix-ups. Three pages of chained fix-ups was the record.
(Of course, another way to stop the recursion is to do only a partial fix-up of the previous page, applying only the straddling fix-up to see whether there is a carry out and not attempting to fix up the rest. But Windows 95 went ahead and fixed up the rest of the page because it figured, hey, I paid for this page, I may as well use it.)
What was my point here? I don't think I have one. It was just a historical tidbit that I hoped somebody might find interesting.
Michael Kaplan has probably forgotten more about Unicode than most people know. He knows about the mysterious placement of the Won character in the Korean character set, and the same for the Japanese Yen character, what the invariant locale is, why Korean text sorts strangely if you pass the NORM_IGNORENONSPACE flag, and other strange and wonderful dark corners of keyboard layouts, character sets, and collation.
Around Microsoft, Michael is the local authority on Unicode. It's great that he's sharing his deep knowledge with the rest of us. (Note: I said "local" authority. Just because he's our main guy doesn't mean that he's your primary contact, too.)
Anybody who's done intensive optimization knows that optimization
is often counter-intuitive.
Things you think would be faster often aren't.
Consider, for example, the exercise of obtaining the current instruction
There's the naïve solution:
void *currentInstruction = GetCurrentAddress();
If you look at the disassembly, you'll get something like this:
mov eax, [esp]
mov [currentInstruction], eax
"Pah," you say to yourself, "look at how inefficient that is.
I can reduce that to two instructions. Watch:
L1: pop currentInstruction
That's half the instruction count of your bloated
But if you sit down and race the two code sequences, you'll find
that the function-call version is faster by a factor of two!
How can that be?
The reason is the "hidden variables" inside the processor.
All modern processors contain much more state than you can see
from the instruction sequence. There are TLBs, L1 and L2
caches, all sorts of stuff that you can't see.
The hidden variable that is important here is the
return address predictor.
The more recent Pentium (and I believe also Athlon)
processors maintain an internal
stack that is updated by each CALL
and RET instruction.
When a CALL is executed, the return address is pushed both onto
the real stack (the one that the ESP register points to) as well
as to the internal return address predictor stack; a RET
instruction pops the top address of the return address predictor stack
as well as the real stack.
The return address predictor stack is used when the processor
decodes a RET instruction.
It looks at the top of the return address predictor stack and
says, "I bet that RET instruction is going to return to that
It then speculatively executes the instructions at that address.
Since programs rarely fiddle with return addresses on the stack,
these predictions tend to be highly accurate.
That's why the "optimization" turns out to be slower.
Let's say that at the point of the CALL L1
instruction, the return address predictor stack looks like this:
Here, caller1 is the function's caller,
caller1 is the function's caller's caller,
and so on. So far, the return address predictor stack is right on target.
(I've drawn the actual stack below the return address predictor stack
so you can see that they match.)
Now you execute the CALL instruction.
The return address predictor stack and the actual stack
now look like this:
But instead of executing a RET instruction,
you pop off the return address. This removes it from the
actual stack, but doesn't remove it from the return address
I think you can see where this is going.
Eventually your function returns. The processor decodes your
RET instruction and looks at the
return address predictor stack and says,
"My predictor stack says that this RET is going
to return to L1. I will begin speculatively executing there."
But oh no, the value on the top of the real stack
isn't L1 at all.
It's caller1. The processor's return address predictor
predicted incorrectly, and it ended up wasting its time studying
the wrong code!
The effects of this bad guess don't end there.
After the RET instruction, the return address
predictor stack looks like this:
Eventually your caller returns. Again, the processor consults its
return address predictor stack and speculatively executes
at caller1. But that's not where you're returning
to. You're really returning to caller2.
And so on. By mismatching the
CALL and RET instructions,
you managed to cause every single return address prediction on the stack
to be wrong. Notice in the diagram that,
in the absence of somebody playing games with the return address
predictor stack of the type that created the problem initially,
not a single prediction on the return address
predictor stack will be correct.
None of the predicted return addresses match up with actual return
Your peephole optimization has proven to be shortsighted.
Some processors expose this predictor more explictly.
The Alpha AXP, for example, has several types of control flow
instructions, all of which have the same logical effect,
but which hint to the processor how it should maintain its
internal predictor stack.
For example, the
BR instruction says, "Jump to this address, but do not
push the old address onto the predictor stack."
On the other hand, the
JSR instruction says, "Jump to this address, and push
the old address onto the predictor stack."
There is also a RET instruction that says,
"Jump to this address, and pop an address from the predictor stack."
(There's also a fourth type that isn't used much.)
Moral of the story: Just because something looks better
doesn't mean that it necessarily is better.
Ein deutscher Blogger
Robert Scoble is gaining on Steve Ballmer.
Mit anderen Worten, dass eine Google-Suche nach
"Robert Scoble" ungefähr 172.000 Seiten findet,
während eine Google-Suche nach
"Steve Ballmer" ungefähr 302.000 Seiten zeigt.
Er fragte, ob jemand
einen anderen Microsoft-Angestellten finden kann,
der mehr Google-Ergebnisse als Robert Scoble bekommt.
Na klar, das ist ganz leicht, und ich werde euch in das Geheimnis ziehen.
Ach, du meine Güte! 869.000 Ergebnisse!
Noch mehr als Steve Ballmer!
Du musst nur einen Person finden, der einen sehr gewöhnlichen Namen hat.
The performance of the syscall trap gets a lot of attention.
I was reminded of a meeting that took place between Intel and
Microsoft over fifteen years ago. (Sadly, I was not myself at this
meeting, so the story is second-hand.)
Since Microsoft is one of
Intel's biggest customers, their representatives often
visit Microsoft to show off what their latest processor can do,
lobby the kernel development team to support a new processor feature,
and solicit feedback on what sort of features would be most useful
At this meeting, the Intel representatives asked,
"So if you could ask for only one thing to be made faster,
what would it be?"
Without hesitation, one of the lead kernel developers replied,
"Speed up faulting on an invalid instruction."
The Intel half of the room burst out laughing.
"Oh, you Microsoft engineers are so funny!"
And so the meeting ended with a cute little joke.
After returning to their labs, the Intel engineers ran profiles
against the Windows kernel and lo and behold, they discovered
that Windows spent a lot of its time dispatching invalid instruction
exceptions. How absurd! Was the Microsoft engineer not kidding
around after all?
No he wasn't.
It so happens that on the 80386 chip of that era, the fastest
way to get from V86-mode into kernel mode was to execute an invalid
instruction! Consequently, Windows/386 used an invalid instruction
as its syscall trap.
What's the moral of this story? I'm not sure.
Perhaps it's that when you create something,
you may find people using it in ways you had never considered.
Is there nothing a Game Boy can't do? We already learned that it can be played like a musical instrument. Now we discover that letting children play with a Game Boy before surgery is more effective than tranquilizers or a parent's hand at keeping them calm.
You know when you go to the dentist and she asks you, "What flavor fluoride rinse would you like?" Perhaps someday the anaesthetist will ask you, "Do you want a Game Boy or a Nintendo?"
(Drat, scooped by Slashdot again.)
When you use a dialog editor and insert new
controls, they typically are assigned control IDs
starting at around 100. Why?
Because the small numbers are already taken.
* Dialog Box Command IDs
#define IDOK 1
#define IDCANCEL 2
#define IDABORT 3
#define IDRETRY 4
#define IDIGNORE 5
#define IDYES 6
#define IDNO 7
#define IDCLOSE 8
#define IDHELP 9
#define IDTRYAGAIN 10
#define IDCONTINUE 11
The dialog manager knows about these special values
and assumes that if your dialog box has a control whose
ID matches one of these special values, then it also
behaves in a certain way.
The dialog manager assumes that
a control whose ID is IDOK is an OK button.
If the user hits Enter, the default button will be pushed;
if no default button can be found, then the OK button is pushed.
Similarly, a control whose ID is IDCANCEL is assumed to be
a Cancel button.
If the user hits ESC or clicks the X button in the corner,
then the Cancel button is pushed.
If your dialog box has OK and Cancel buttons, make sure to
give them the IDOK and IDCANCEL control IDs so that they
act like proper OK and Cancel buttons. Conversely, any
control with those IDs had better be proper OK and Cancel buttons.
It's easy to distribute points evenly across a flat surface, but doing so over a curved surface is a much more complicated problem. Even spheres are hard. NPR's Scott Simon interviews mathematician Ed Saff who with colleague Doug Hardin has developed a new method of attacking this complex problem.
The Canadian Medical Association Journal traditionally runs an offbeat research paper in their Christmas edition, for which there is apparently huge competition. This year, Tintin goes to the neurologist. The feedback is fun to read too. (External news coverage here and here.)
My first exposure to Tintin was—of course—in Sweden. (Why "of course"? Because it seems that everything I do ties back to Sweden somehow...)
While browsing through a music store's clearance bin, I found an audio dramatization of Den svarta ön. I recognized "Tintin" as the name of a popular children's character, though I myself had never read any of the stories.
I started listening to the CD and found the story amazingly dull. However, I chalked this up to my bad Swedish listening comprehension, figuring that if only I understood more of it, the story would be more enjoyable.
Some months later, I tested this theory: I went to the library, found a copy of The Black Island in English translation, and read it.
It was an amazingly dull story.
During my most recent trip to Taiwan, the person I was telling this story to couldn't figure out what children's character I was talking about. We happened to be in a bookstore and I stumbled across a copy of the same story in Chinese translation. (The Chinese translation of Tintin's name is 丁 丁 - dīng-dīng, in case anybody else finds themselves in the same jam.) Of course, having found the book, I had to buy it; it's sort of become a collection now. Someday I'll try to read it, but not quite yet. My Chinese is barely at phrase-book level right now.
It has been pointed out that even though Tintin is ostensibly a journalist, over his 45-year career he filed but one story. You'd think his editor would be kind of upset by now.
I also have copies of Harry Potter and the Philosopher's Stone in all (but one) of the various languages I know or am trying to learn. And I'm counting the American and British English versions as different. Because they are.