September, 2004

  • The Old New Thing

    A visual history of spam (and virus) email

    • 156 Comments

    I have kept every single piece of spam and virus email since mid-1997. Occasionally, it comes in handy, for example, to add naïve Bayesian spam filter to my custom-written email filter. And occasionally I use it to build a chart of spam and virus email.

    The following chart plots every single piece of spam and virus email that arrived at my work email address since April 1997. Blue dots are spam and red dots are email viruses. The horizontal axis is time, and the vertical axis is size of mail (on a logarithmic scale). Darker dots represent more messages. (Messages larger than 1MB have been treated as if they were 1MB.)

    Note that this chart is not scientific. Only mail which makes it past the corporate spam and virus filters show up on the chart.

    Why does so much spam and virus mail get through the filters? Because corporate mail filters cannot take the risk of accidentally classifying valid business email as spam. Consequently, the filters have to make sure to remove something only if they has extremely high confidence that the message is unwanted.

    Okay, enough dawdling. Let's see the chart.

    Overall statistics and extrema:

    • First message in chart: April 22, 1997.
    • Last message in chart: September 10, 2004.
    • Smallest message: 372 bytes, received March 11, 1998.
      From: 15841.
      To: 15841.
      Subject: About your account...
      Content-Type: text/plain; charset=ISO-8859-1
      Content-Transfer-Encoding: 7bit
      
      P
      
    • Largest message: 1,406,967 bytes, received January 8, 2004. HTML mail with a lot of text including 41 large images. A slightly smaller version was received the previous day. (I guess they figured that their first version wasn't big enough, so they sent out an updated version the next day.)
    • Single worst spam day by volume: January 8, 2004. That one monster message sealed the deal.
    • Single worst spam day by number of messages: August 22, 2002. 67 pieces of spam. The vertical blue line.
    • Single worst virus day: August 24, 2003. This is the winner both by volume (1.7MB) and by number (49). The red splotch.
    • Totals: 227.6MB of spam in roughly 19,000 messages. 61.8MB of viruses in roughly 3500 messages.

    Things you can see on the chart:

    • Spam went ballistic starting in 2002. You could see it growing in 2001, but 2002 was when it really took off.
    • Vertical blue lines are "bad spam days". Vertical red lines are "bad virus days".
    • Horizontal red lines let you watch the lifetime of a particular email virus. (This works only for viruses with a fixed-size payload. Viruses with variable-size payload are smeared vertically.)
    • The big red splotch in August 2003 around the 100K mark is the Sobig virus.
    • The horizontal line in 2004 that wanders around the 2K mark is the Netsky virus.
    • For most of this time, the company policy on spam filtering was not to filter it out at all, because all the filters they tried had too high a false-positive rate. (I.e., they were rejecting too many valid messages as spam.) You can see that in late 2003, the blue dot density diminished considerably. That's when mail administrators found a filter whose false-positive rate was low enough to be acceptable.

    As a comparison, here's the same chart based on email received at one of my inactive personal email addresses.

    This particular email address has been inactive since 1995; all the mail it gets is therefore from harvesting done prior to 1995. (That's why you don't see any red dots: None of my friends have this address in their address book since it is inactive.) The graph doesn't go back as far because I didn't start saving spam from this address until late 2000.

    Overall statistics and extrema:

    • First message in chart: September 2, 2000.
    • Last message in chart: September 10, 2004.
    • Smallest message: 256 bytes, received July 24, 2004.
      Received: from dhcp065-025-005-032.neo.rr.com ([65.25.5.32]) by ...
               Sat, 24 Jul 2004 12:30:35 -0700
      X-Message-Info: 10
      
    • Largest message: 3,661,900 bytes, received April 11, 2003. Mail with four large bitmap attachments, each of which is a Windows screenshot of Word with a document open, each bitmap showing a different page of the document. Perhaps one of the most inefficient ways of distributing a four-page document.
    • Single worst spam day by volume: April 11, 2003. Again, the monster message drowns out the competition.
    • Single worst spam day by number of messages: October 3, 2003. 74 pieces of spam.
    • Totals: 237MB of spam in roughly 35,000 messages.

    I cannot explain the mysterious "quiet period" at the beginning of 2004. Perhaps my ISP instituted a filter for a while? Perhaps I didn't log on often enough to pick up my spam and it expired on the server? I don't know.

    One theory is that the lull was due to uncertainty created by the CAN-SPAM Act, which took effect on January 1, 2004. I don't buy this theory since there was no significant corresponding lull at my other email account, and follow-up reports indicate that CAN-SPAM was widely disregarded. Even in its heyday, compliance was only 3%.

    Curiously, the trend in spam size for this particular account is that it has been going down since 2002. In the previous chart, you could see a clear upward trend since 1997. My theory is that since this second dataset is more focused on current trends, it missed out on the growth trend in the late 1990's and instead is seeing the shift in spam from text to <IMG> tags.

  • The Old New Thing

    Why does my mouse/touchpad sometimes go berzerk?

    • 67 Comments

    Each time you move a PS/2-style mouse, the mouse send three bytes to the computer. For the sake of illustration, let's say the three bytes are x, y, and buttons.

    The operating system sees this byte stream and groups them into threes:

    x y b x y b x y b x y b

    Now suppose the cable is a bit jiggled loose and one of the "y"s gets lost. The byte stream loses an entry, but the operating system doesn't know this has happened and keeps grouping them in threes.

    x y b x b x y b x y b x

    The operating system is now out of sync with the mouse and starts misinterpreting all the data. It receives a "y b x" from the mouse and treats the y byte as the x-delta, the b byte as the y-delta, and the x byte as the button state. Result: A mouse that goes crazy.

    Oh wait, then there are mice with wheels.

    When the operating system starts up, it tries to figure out whether the mouse has a wheel and convinces it to go into wheel mode. (You can influence this negotiation from Device Manager.) If both sides agree on wheeliness, then the mouse generates four bytes for each mouse motion, which therefore must be interpreted something like this:

    x y b w x y b w x y b w x y b w

    Now things get really interesting when you introduce laptops into the mix.

    Many laptop computers have a PS/2 mouse port into which you can plug a mouse on the fly. When this happens, the built-in pointing device is turned off and the PS/2 mouse is used instead. This happens entirely within the laptop's firmware. The operating system has no idea that this switcheroo has happened.

    Suppose that when you turned on your laptop, there was a wheel mouse connected to the PS/2 port. In this case, when the operating system tries to negotiate with the mouse, it sees a wheel and puts the mouse into "wheel mode", expecting (and fortunately receiving) four-byte packets.

    Now unplug your wheel mouse so that you revert to the touchpad, and let's say your touchpad doesn't have a wheel. The touchpad therefore spits out three-byte mouse packets when you use it. Uh-oh, now things are really messed up.

    The touchpad is sending out three-byte packets, but the operating system thinks it's talking to that mouse that was plugged in originally and continues to expect four-byte packets.

    You can imagine the mass mayhem that ensues.

    Moral of the story: If you're going to hot-plug a mouse into your laptop's PS/2 port, you have a few choices.

    • Always use a nonwheel mouse, so that you can plug and unplug with impunity, since the nonwheel mouse and the touchpad both use three-byte packets.
    • If you turn on the laptop with no external mouse, then you can go ahead and plug in either a wheeled or wheel-less mouse. Plugging in a wheel-less mouse is safe because it generates three-byte packets just like the touchpad. And plugging in a wheeled mouse is safe because the wheeled mouse was not around for the initial negotiation, so it operates in compatibility mode (i.e., it pretends to be a wheel-less mouse). In this case, the mouse works, but you lose the wheel.
    • If you turn on the laptop with a wheel mouse plugged in, never unplug it because once you do, the touchpad will take over and send three-byte packets and things will go berzerk.

    Probably the easiest way out is to avoid the PS/2 mouse entirely and just use a USB mouse. This completely sidesteps the laptop's PS/2 switcheroo.

  • The Old New Thing

    The x86 architecture is the weirdo

    • 67 Comments

    The x86 architecture does things that almost no other modern architecture does, but due to its overwhelming popularity, people think that the x86 way is the normal way and that everybody else is weird.

    Let's get one thing straight: The x86 architecture is the weirdo.

    The x86 has a small number (8) of general-purpose registers; the other modern processors have far more. (PPC, MIPS, and Alpha each have 32; ia64 has 128.)

    The x86 uses the stack to pass function parameters; the others use registers.

    The x86 forgives access to unaligned data, silently fixing up the misalignment. The others raise a misalignment exception, which can optionally be emulated by the supervisor at an amazingly huge performance penalty.

    The x86 has variable-sized instructions. The others use fixed-sized instructions. (PPC, MIPS, and Alpha each have fixed-sized 32-bit instructions; ia64 has fixed-sized 41-bit instructions. Yes, 41-bit instructions.)

    The x86 has a strict memory model, where external memory access matches the order in which memory accesses are issued by the code stream. The others have weak memory models, requiring explicit memory barriers to ensure that issues to the bus are made (and completed) in a specific order.

    The x86 supports atomic load-modify-store operations. None of the others do.

    The x86 passes function return addresses on the stack. The others use a link register.

    Bear this in mind when you write what you think is portable code. Like many things, the culture you grow up with is the one that feels "normal" to you, even if, in the grand scheme of things, it is one of the more bizarre ones out there.

  • The Old New Thing

    Why isn't the original window order always preserved when you undo a Show Desktop?

    • 52 Comments

    A commenter asked why the original window order is not always preserved when you undo a Show Desktop.

    The answer is "Because the alternative is worse."

    Guaranteeing that the window order is restored can result in Explorer hanging.

    When the windows are restored when you undo a Show Desktop, Explorer goes through and asks each window that it had minimized to restore itself. If each window is quick to respond, then the windows are restored and the order is preserved.

    However, if there is a window that is slow to respond (or even hung), then it loses its chance and Explorer moves on to the next window in the list. That way, a hung window doesn't cause Explorer to hang, too. But it does mean that the windows restore out of order.

  • The Old New Thing

    How to find the Internet Explorer binary

    • 45 Comments

    For some reason, some people go to enormous lengths to locate the Internet Explorer binary so they can launch it with some options.

    The way to do this is not to do it.

    If you just pass "IEXPLORE.EXE" to the ShellExecute function [link fixed 9:41am], it will go find Internet Explorer and run it.

    ShellExecute(NULL, "open", "iexplore.exe",
                 "http://www.microsoft.com", NULL,
                 SW_SHOWNORMAL);
    

    The ShellExecute function gets its hands dirty so you don't have to.

    (Note: If you just want to launch the URL generically, you should use

    ShellExecute(NULL, "open", "http://www.microsoft.com",
                 NULL, NULL, SW_SHOWNORMAL);
    

    so that the web page opens in the user's preferred web browser. Forcing Internet Explorer should be avoided under normal circumstances; we are forcing it here because the action is presumably being taken response to an explicit request to open the web page specifically in Internet Explorer.)

    If you want to get your hands dirty, you can of course do it yourself. It involves reading the specification from the other side, this time the specification on how to register your program's name and path ("Registering Application Path Information").

    The document describes how a program should enter its properties into the registry so that the shell can launch it. To read it backwards, then, interpret this as a list of properties you (the launcher) need to read from the registry.

    In this case, the way to run Internet Explorer (or any other program) the same way ShellExecute does is to look in HKEY_LOCAL_MACHINE\Software\Microsoft\Windows\CurrentVersion\App Paths\IEXPLORE.EXE (substituting the name of the program if it's not Internet Explorer you're after). The default value is the full path to the program and the the "Path" value specifies a custom path that you should prepend to the environment before launching the target program.

    When you do this, don't forget to call the ExpandEnvironmentStrings function if the registry value's type is REG_EXPAND_SZ. (Lots of people forget about REG_EXPAND_SZ.)

    Of course, my opinion is that it's much easier just to let ShellExecute do the work for you.

  • The Old New Thing

    Why does Windows keep your BIOS clock on local time?

    • 44 Comments

    Even though Windows NT uses UTC internally, the BIOS clock stays on local time. Why is that?

    There are a few reasons. One is a chain of backwards compatibility.

    In the early days, people often dual-booted between Windows NT and MS-DOS/Windows 3.1. MS-DOS and Windows 3.1 operate on local time, so Windows NT followed suit so that you wouldn't have to keep changing your clock each time you changed operating systems.

    As people upgraded from Windows NT to Windows 2000 to Windows XP, this choice of time zone had to be preserved so that people could dual-boot between their previous operating system and the new operating system.

    Another reason for keeping the BIOS clock on local time is to avoid confusing people who set their time via the BIOS itself. If you hit the magic key during the power-on self-test, the BIOS will go into its configuration mode, and one of the things you can configure here is the time. Imagine how confusing it would be if you set the time to 3pm, and then when you started Windows, the clock read 11am.

    "Stupid computer. Why did it even ask me to change the time if it's going to screw it up and make me change it a second time?"

    And if you explain to them, "No, you see, that time was UTC, not local time," the response is likely to be "What kind of totally propeller-headed nonsense is that? You're telling me that when the computer asks me what time it is, I have to tell it what time it is in London? (Except during the summer in the northern hemisphere, when I have to tell it what time it is in Reykjavik!?) Why do I have to remember my time zone and manually subtract four hours? Or is it five during the summer? Or maybe I have to add. Why do I even have to think about this? Stupid Microsoft. My watch says three o'clock. I type three o'clock. End of story."

    (What's more, some BIOSes have alarm clocks built in, where you can program them to have the computer turn itself on at a particular time. Do you want to have to convert all those times to UTC each time you want to set a wake-up call?)

  • The Old New Thing

    Sometimes the bug isn't apparent until late in the game

    • 42 Comments

    I didn't debug it personally, but I know the people who did. During Windows XP development, a bug arrived on a computer game that crashed only after you got to one of the higher levels.

    After many saved and restored games, the problem was finally identified.

    The program does its video work in an offscreen buffer and transfers it to the screen when it's done. When it draws text with a shadow, it first draws the text in black, offset down one and right one pixel, then draws it again in the foreground color.

    So far so good.

    Except that it didn't check whether moving down and right one pixel was going to go beyond the end of the screen buffer.

    That's why it took until one of the higher levels before the bug manifested itself. Not until then did you accomplish a mission whose name contained a lowercase letter with a descender! Shifting the descender down one pixel caused the bottom row of pixels in the character to extend past the video buffer and start corrupting memory.

    Once the problem was identified, fixing it was comparatively easy. The application compatibility team has a bag of tricks, and one of them is called "HeapPadAllocation". This particular compatibility fix adds padding to every heap allocation so that when a program overruns a heap buffer, all that gets corrupted is the padding. Enable that fix for the bad program (specifying the amount of padding necessary, in this case, one row's worth of pixels), and run through the game again. No crash this time.

    What made this interesting to me was that you had to play the game for hours before the bug finally surfaced.

  • The Old New Thing

    How does Windows exploit hyperthreading?

    • 42 Comments

    It depends which version of Windows you're asking about.

    For Windows 95, Windows 98, and Windows Me, the answer is simple: Not at all. These are not multiprocessor operating systems.

    For Windows NT and Windows 2000, the answer is "It doesn't even know." These operating systems are not hyperthreading-aware because they were written before hyperthreading was invented. If you enable hyperthreading, then each of your CPUs looks like two separate CPUs to these operating systems. (And will get charged as two separate CPUs for licensing purposes.) Since the scheduler doesn't realize the connection between the virtual CPUs, it can end up doing a worse job than if you had never enabled hyperthreading to begin with.

    Consider a dual-hyperthreaded-processor machine. There are two physical processors A and B, each with two virtual hyperthreaded processors, call them A1, A2, B1, and B2.

    Suppose you have two CPU-intensive tasks. As far as the Windows NT and Windows 2000 schedulers are concerned, all four processors are equivalent, so it figure it doesn't matter which two it uses. And if you're unlucky, it'll pick A1 and A2, forcing one physical processor to shoulder two heavy loads (each of which will probably run at something between half-speed and three-quarter speed), leaving physical processor B idle; completely unaware that it could have done a better job by putting one on A1 and the other on B1.

    Windows XP and Windows Server 2003 are hyperthreading-aware. When faced with the above scenario, those schedulers will know that it is better to put one task on one of the A's and the other on one of the B's.

    Note that even with a hyperthreading-aware processor, you can concoct pathological scenarios where hyperthreading ends up a net loss. (For example, if you have four tasks, two of which rely heavily on L2 cache and two of which don't, you'd be better off putting each of the L2-intensive tasks on separate processors, since the L2 cache is shared by the two virtual processors. Putting them both on the same processor would result in a lot of L2-cache misses as the two tasks fight over L2 cache slots.)

    When you go to the expensive end of the scale (the Datacenter Servers, the Enterprise Servers), things get tricky again. I refer still-interested parties to the Windows Support for Hyper-Threading Technology white paper.

    Update 06/2007: The white paper appears to have moved.

  • The Old New Thing

    Why is the page size on ia64 8K?

    • 31 Comments

    On x86 machines, Windows chooses a page size of 4K because that was the only page size supported by that architecture at the time the operating system was designed. (4MB pages were added to the CPU later, in the Pentium as I recall, but clearly that is too large for everyday use.)

    For the ia64, Windows chose a page size of 8K. Why 8K?

    It's a balance between two competing objectives. Large page sizes allow more efficient I/O since you are reading twice as much data at one go. However large page sizes also increase the likelihood that the extra I/O you perform is wasted because of poor locality.

    Experiments were run on the ia64 with various page sizes (even with 64K pages, which were seriously considered at one point), and 8K provided the best balance.

    Note that changing the page size creates all sorts of problems for compatibility. There are large numbers of programs out there that blindly assume that the page size is 4K. Boy are they in for a surprise.

  • The Old New Thing

    Even in computing, simultaneity is relative

    • 31 Comments

    Einstein discovered that simultaneity is relative. This is also true of computing.

    People will ask, "Is it okay to do X on one thread and Y on another thread simultaneously?" Here are some examples:

    • X = "close a handle" and Y = "use that handle".
    • X = "call UnregisterWaitForSingleObject on a handle", Y = "call UnregisterWaitForSingleObject on that same handle".

    You can answer this question knowing nothing about the internal behavior of those operations. All you need to know are some physics and the answers to much simpler questions about what is valid sequential code.

    Let's do a thought experiment with simultaneity.

    Since simultaneity is relative, any code that does X and Y simultaneously can be observed to have performed X before Y or Y before X, depending on your frame of reference. That's how the universe works.

    So if it were okay to do them simultaneously, then it must also be okay to do them one after the other, since they do occur one after the other if you walk past the computer in the correct direction.

    Is it okay to use a handle after closing it? Is it okay to unregister a wait event twice?

    The answer to both questions is "No," and therefore it isn't okay to do them simultaneously either.

    If you don't like using physics to solve this problem, you can also do it from a purely technical perspective.

    Invoking a function is not an atomic operation. You prepare the parameters, you call the entry point, the function does some work, it returns. Even if you somehow manage to get both threads to reach the function entry point simultaneously (even though as we know from physics there is no such thing as true simultaneity), there's always the possibility that one thread will get pre-empted immediately after the "call" instruction has transferred control to the first instruction of the target function, while the other thread continues to completion. After the second thread runs to completion, the pre-empted thread gets scheduled and begins execution of the function body.

    Under this situation, you effectively called the two functions one after the other, despite all your efforts to call them simultaneously. Since you can't prevent this scenario from occurring, you have to code with the possibility that it might actually happen.

    Hopefully this second explanation will satisfy the people who don't believe in the power of physics. Personally, I prefer using physics.

Page 1 of 4 (31 items) 1234