Holy cow, I wrote a book!
A bunch of us were going through some Windows crashes
that people sent in by clicking the "Send Error Report" button
in the crash dialog.
And there were huge numbers of them that made no sense whatsoever.
For example, there would be code sequences like this:
mov ecx, dword ptr [someValue]
mov eax, dword ptr [otherValue]
cmp ecx, eax
Yet when we looked at the error report, the ecx
and eax registers were equal!
There were other crashes of a similar nature,
where the CPU simply lots its marbles and
did something "impossible".
We had to mark these crashes as "possibly hardware failure".
Since the crash reports are sent anonymously,
we have no way of contacting the submitter to ask them
(The ones that the group I was in was investigating were failures that
were hit only once or twice, but were of the type that
were deemed worthy of close
investigation because the types of errors they uncovered—if
One of my colleagues had a large collection of failures
where the program crashed at the instruction
xor eax, eax
How can you crash on an instruction that simply sets a register to zero?
And yet there were hundreds of people crashing in precisely this way.
He went through all the published errata to see whether any of them
would affect an "xor eax, eax" instruction. Nothing.
He sent email to some Intel people he knew to see if they could
think of anything.
[Aside from overclocking, of course.
- Added because people apparently take my stories hyperliterally
and require me to spell out the tiniest detail,
even the stuff that is so obvious that it should go without saying.
I didn't want to give away the story's punch line too soon!]
They said that the only [other] thing they could think of was that perhaps
somebody had mis-paired RAM on their motherboard, but their
description of what sorts of things go wrong when you mis-pair
didn't match this scenario.
Since the failure rate for this particular error was comparatively
high (certainly higher than the one or two I was getting for
the failures I was looking at),
he requested that the next ten people to encounter this error
be given the opportunity to leave their email address and telephone
number so that he could call them and ask follow-up questions.
Some time later, he got word that ten people took him up on this
offer, and he sent each of them e-mail asking them various questions
about their hardware configurations, including whether they were
[- Continuing from above aside:
See? Obviously overclocking was considered as a possibility.]
Five people responded saying, "Oh, yes, I'm overclocking.
Is that a problem?"
The other half said, "What's overclocking?"
He called them and walked them through some configuration information
and was able to conclude that they were indeed all overclocked.
But these people were not overclocking on purpose.
The computer was already overclocked
when they bought it.
These "stealth overclocked" computers
came from small, independent "Bob's Computer Store"-type shops,
not from one of the major computer manufacturers or retailers.
For both groups, he suggested that they stop overclocking or at least
not overclock as aggressively.
And in all cases, the people reported that their computer that used
to crash regularly now runs smoothly.
Moral of the story:
There's a lot of overclocking out there,
and it makes Windows look bad.
I wonder if it'd be possible to detect overclocking from software
and put up a warning in the crash dialog,
"It appears that your computer is overclocked.
This may cause random crashes.
Try running the CPU at its rated speed to improve stability."
But it takes only one false positive to get people saying,
"Oh, there goes Microsoft
blaming other people for its buggy software again."