Holy cow, I wrote a book!
Although the International Electronic Commission established the
term kibibyte for 1024 bytes, with the abbreviation KiB,
Windows Explorer continues to use the abbreviation KB.
Why doesn't Explorer get with the program?
Because nobody else is on the program either.
If you look around you, you'll find that nobody (to within experimental error)
uses the terms kibibyte and KiB.
When you buy computer memory, the amount is specified in megabytes
and gigabytes, not mebibytes and gibibytes.
The storage capacity printed on your blank CD is indicated in
Every document on the Internet
(to within experimental error)
which talks about memory and storage
uses the terms kilobyte/KB, megabyte/MB, gigabyte/GB, etc.
You have to go out of your way
to find people who use
the terms kibibyte/KiB, mebibyte/MiB, gibibyte/GiB, etc.
In other words, the entire computing industry has ignored
the guidance of the IEC.
Explorer is just following existing practice.
Everybody (to within experimental error)
refers to 1024 bytes as a kilobyte, not a kibibyte.
If Explorer were to switch to the term kibibyte,
it would merely be
showing users information in a form they cannot understand,
and for what purpose?
So you can feel superior because you know what
that term means and other people don't.
For an explanation of other storage units, you can consult
this helpful chart from xkcd.
When I discussed years ago
why operating system files tend to follow the old 8.3 file name convention,
I neglected to mention why the old MS-DOS filename convention was 8.3
and not, say, 11.2 or 16.16.
It's a holdover from CP/M.
As I noted when I
discussed the old MS-DOS wildcard matching rules,
worked hard at being compatible with CP/M.
And CP/M used 8.3 filenames.
Why did CP/M use 8.3 filenames?
I don't know.
There's nothing obvious in the
CP/M directory format that explains why those two reserved
bytes couldn't have been used to extend the file name to 10.3.
But maybe they figured that eight was a convenient number.
One of the nasty gotchas of the CreateProcess function is
that the lpCommandLine parameter must be a mutable pointer.
If you pass a pointer to memory that cannot be written to
(for example, a pointer to a page that is marked
PAGE_READONLY), then you might crash.
wonders why this parameter is so weird.
Basically, somebody back in the 1980's wanted to avoid allocating memory.
(Another way of interpreting this is that somebody tried to be a bit too
The CreateProcess temporarily modifies the
string you pass as the lpCommandLine in its attempt to figure
out where the program name ends and the command line arguments begin.
Now, it could have made a copy of the string and made its temporary
modifications to the copy,
but hey, if you modify the input string directly, then you save yourself
an expensive memory allocation operation.
Back in the old days,
people worried about avoiding memory allocations,
so this class of micro-optimization is the sort of thing people worried
about as a matter of course.
Of course, nowadays, it seems rather antiquated.
Now, there may also be good technical reasons
(as opposed to merely performance considerations)
for avoiding allocating memory on the heap.
When a program crashes, the
just in time debugger is launched
with the CreateProcess function,
and you don't want to allocate memory on the heap if the reason the
program crashed is that the heap is corrupted.
Otherwise, you can get yourself into a recursive crash loop:
While trying to launch the debugger, you crash, which means you try
to launch the debugger to debug the new crash, which again crashes,
and so on.
The original authors of
the CreateProcess function were careful to avoid
allocating memory off the heap, so that in the case the function is
being asked to launch the debugger, it won't get waylaid by a corrupted heap.
Whether these concerns are still valid today I am not sure,
but it was those concerns that influenced the original design and
therefore the interface.
Why is it that only the Unicode version is affected?
Well, because the ANSI version of the function just converts its
strings to Unicode and the calls the Unicode version of the function.
Consequently, the ANSI version of the function
happens to implement the workaround as a side effect of its original
The string passed to the Unicode version of the function is a temporary
Why is it okay for the ANSI version of the CreateProcess
to allocate a temporary string from the heap when the Unicode function
One of the
five things every Win32 programmer needs to know
is that memory latency
can throw your big-O computations out the window.
Back in 2003, I ran into a concrete example of this.
Somebody started with the algorithm presented in
Fast Algorithms for Sorting and Searching Strings
Jon L. Bentley
implemented it in C#, and compared the performance
against a HashTable, TernaryTree
Surprisingly, the hash table won on insertion and retrieval of
tens of thousands of randomly-generated strings.
Remember, big-O notation hides the constants,
and those constants can get pretty big.
What's more, the impact of those constants is critical
for normal-sized workloads.
The big-O notation allows you to compare algorithms
when the data sets become extremely large,
but you have to keep the constants in mind
to see when the balance tips.
The central point of my presentation at the PDC was
that complexity analysis typically ignores memory bandwidth effects
and assumes that all memory accesses perform equally.
This is rarely true in practice.
As we saw, leaving L2 is a big hit on performance,
and accessing the disk is an even greater hit.
The tree doesn't rebalance,
so inserting strings in alphabetical order
will result in a bad search tree.
(They do call this out in their paper.)
To locate a k-character string,
Bentley-Sedgewick traverses at least k pointers, usually more.
("How much more" depends on how many prefixes are shared.
More shared prefixes = more pointers traversed.)
It also requires k(4p) bytes of memory to store that string,
where p is the size of a pointer.
Remember those pesky constants.
High constant factor overhead starts to kill you
when you have large datasets.
More on those constants:
Complexity analysis assumes that
an add instruction executes in the same amount of time as a memory access.
This is not true in practice,
but the difference is a bounded constant factor,
so it can be ignored for big-O purposes.
Note, however, that that constant often exceeds
one million if you take a page fault.
One million is a big constant.
Going back to memory bandwidth effects:
At each node, you access one character and one pointer.
So you use only 6 bytes out of a 64-byte cache line.
You're wasting 90% of your bus bandwidth and you will certainly fall out of L2.
Bentley-Sedgewick says that this is beaten out
by not traversing the entire string being searched for in the case of a miss.
I.e., their algorithm is tuned for misses.
If you expect most of your probes to be misses, this can be a win.
(The entire string is traversed on a hit,
of course, so there is no gain for hits.)
Note also that this "cost" for traversing the string on a miss
is overstated due to memory bandwidth effects.
The characters in a string are contiguous,
so traversing the string costs you only L/64 cache lines,
where L is the length of the string,
and one potential page fault,
assuming your string is less than 4KB.
Traversing the tree structure costs you at least L cache lines
and probably more depending on your branching factor,
as well as L potential page faults.
Let's look at the testing scenario again.
The testing was only on hits,
so the improved performance on misses was overlooked entirely.
the algorithm takes advantage of strings with common prefixes,
but the testing scenario used randomly-generated strings,
which generates a data set opposite from the one the algorithm
was designed for,
since randomly-generated strings are spread out across the problem space
instead of being clustered with common prefixes.
Those are some general remarks; here are some performance notes
specific to the CLR.
I don't know whether it does or not, but
I would not be surprised if System.String.GetHashCode
caches the hash value in the string,
which would mean that the cost of computing the hash
is shared by everybody who uses it in a hashing operation.
(Therefore, if you count the cost incurred only by the algorithm,
hashing is free.)
Note also that Bentley-Sedgewick's insert() function
stores the object back into the tree in the recursive case.
Most of the time, the value being stored is the same
as the value that was already there.
This dirties the cache line for no reason
(forcing memory bandwidth to be wasted on a flush)
painful for the CLR—you hit
and end up dirting a whole boatload of
A very small change avoids this problem:
p->eqkid = insert(p->eqkid, ++s);
Tptr newkid = insert(p->eqkid, ++s);
if (p->eqkid != newkid) p->eqkid = newkid;
(and similarly in the other branches).
"This code is short but subtle, and worth careful study." How very true.
Note also that if you use their "sleazy hack" of
coercing a string to a Tptr,
you had to have changed the type of eqkid from
Tptr to object.
This introduces a CLR type-check into the inner loop.
Congratulations, you just tubed the inner loop performance.
Now go to the summary at the end of the article.
Notice that the article doesn't claim that ternary trees
are faster than hashing for successful searches.
So if that's all you're testing, you're testing the wrong thing.
One of the big benefits of ternary trees
is the new operations available (4 and 5),
but if you're not going to perform those operations,
then you ended up paying for something you don't use.
There are several morals of the story.
Mind you, Bentley and Sedgewick are not morons. They know all this.
[Typo fixed 11am, thanks Nathan_works and Jachym Kouba.]
the question isn't whether blogs.msdn.com site is under
attack but rather how bad the attack is right now.
There are a number of regular culprits, like
but those sites tend to focus on the most recent few articles.
A new category of trackback spammer is here:
The I'm going to scrape your entire site and create a trackback
for every article trackback spammer.
I'm pretty sure this will continue for at least the next week.
I think I'm going to have to write a script that auto-deletes
all these bogus trackbacks.
If you spend time in kernel mode, you're accustomed to seeing functions
with two-letter (or occasionally, three-letter) prefixes
that indicate which component they belong to.
What does the "Zw" mean?
The people who chose the letters wanted to pick something that
was unlikely to collide with anything.
Perhaps they had a prior bad experience with having chosen
a prefix, only to find that somebody ahead of them claimed it already?
if there's an easy way to consume all the virtual address space below 4GB,
short of, well, actually allocating it.
"It seems like there should be a cleaner way to do this."
If you want to consume all the virtual address space, then call
VirtualAlloc until you turn blue.
Programs shouldn't care what address they get back from a memory
they should handle values below 2GB and
with equal facility.
It's not like there's a
(Is there a
What would be the point of such a function?
Once you call it, you have run out of memory!
If Mr. Chaboud is talking about keeping programs away from
bottom 4GB of virtual address space on a 64-bit machine,
then a much easier way to do this is to
set the AllocationPreference
configuration setting to specify that memory should be allocated
from high addresses first.
(But I don't think that's the scenario that prompted the original
question, because on 64-bit Windows,
the default heap is above the 4GB boundary,
so there would be no need to exhaust the heap in order to consume
the memory at virtual addresses below 4GB.)
Pavel Lebedinsky points out
that the default heap is below 4GB on 64-bit machines.
It used to be above the 4GB boundary on earlier versions of 64-bit
Windows, but I guess they changed it.
NPR's Monkey See blog for finding this:
These fancy-dancy IP phones never cease to make me wonder
what our world has come to.
I remember when a telephone was a bucket of carbon granules
and a momentary switch,
and the rotary dial was just a mechanism for going on and off hook
at a precise frequency.
(Indeed, sometimes for fun, I'll pulse-dial a phone
by tapping on the hook.)
The other day, somebody sent out an email message:
I'm amusing myself watching the "CPU Load" graph on the phone.
Then again, I'm easily amused.
Naturally, the CPU meter is useless,
because you can only switch to it when you're not using the phone.
Once you pick up the handset, the CPU meter dismisses itself
so you can see information about the call you're on.
Oh wait, you can go through the menus to switch back to the
CPU meter after it auto-dismisses itself.
Woo-hoo, now I can tell people,
"Sorry, can you talk slower?
My phone's CPU is maxing out."
This what-barely-qualifies-as-amusement didn't last long.
A year later, the units were replaced with a different
model phone that didn't have a CPU meter.
I wrote some time ago that
the window manager generates spurious WM_MOUSEMOVE
in order to let programs know that the mouse has entered the window
even if the reason was a window hierarchy reshuffle rather than
a physical motion of the pointing device.
But some people have noticed that
that explanation fails to account for all the
WM_MOUSEMOVE messages that are being delivered.
In particular, the reasoning fails to explain why a stream of
WM_MOUSEMOVE messages is being generated.
So where is this infinite stream of WM_MOUSEMOVE messages
coming from, even when the window hierarchy is stable?
They're most likely
coming from some third party so-called enhancement software.
The Windows performance and mobility teams keep a close eye on these
sort of continuous phenomena.
The performance folks are interested
because this continuous stream of messages
suck away resources that could be used for something more productive.
It's not just the cost in CPU.
The memory in the message handling code path can't be paged out
since it's being hit all the time.
The context switches force CPU caches to be flushed.
Algorithms which had been tuned to reside entirely within the L2 cache
now find themselves going out to slower main memory.
Meanwhile, the mobility team is concerned because all this continuous
activity runs down your battery,
prevents the CPU from going into a low-power state,
prevents your screen saver from kicking in.
If you find a continuous stream of WM_MOUSEMOVE messages,
then there is some continuous activity going on that is causing it.
It might be some software that is polling the mouse in order to
provide "extra value".
For example, they might check the mouse cursor position and try to guess
what it is positioned over,
but they never realize that "You know, if nothing has changed,
the answer is probably the same as it as last time you checked."
Or, as the author of linked posting above eventually figured out,
it might be a buggy wireless mouse which not only is sucking down your CPU
and preventing your screen saver from running, but is also draining
your wireless mouse battery!