June, 2009

  • The Old New Thing

    Happening to be at the same post-concert restaurant as symphony performers

    • 13 Comments

    Getting out of the parking garage immediately after the symphony ends is usually a bit of an ordeal, so my symphony group tends to linger downtown for some coffee or dessert, allowing time for the post-symphony traffic jams to clear out. The other night, we went to the Union restaurant just a block from Benaroya Hall. (Verdict: Not a good dessert restaurant.)

    After we placed our orders, violinists Elisa Barston (my current "favorite symphony performer") and Mikhail Shmidt came in and were seated at the table next to us. I tried not to stare, but even though I've sat quite close to Elisa Barston before, this time I was even closer. (There were two other people at their table, but I couldn't tell who they were because they were facing away from me.)

    As another coincidence, it was during the concert we had just attended that Mikhail Shmidt dropped his bow. I didn't see it occur; my guess is that it fell off his lap during a pizzicato section. It was during a relatively loud part of the piece so you had to be pretty close to the stage to hear it fall. I heard the clatter, looked over, and saw the violinist calmly bend over and pick it up. He and his standmate played the next few bars with smiles on their faces, clearly amused by the minor snafu, something to make the performance of an orchestral warhorse a bit more memorable. (I have to admit, I was amused too.)

    One thing I noticed was that whereas my group was out for dessert and coffee, the symphony performers were having dinner. Makes sense. During a performance, you don't want to be overcome by indigestion, a food coma, or an urgent call of nature. (Or receive a phone call for that matter.)

    As we got up to leave, a member of our group struck her head on a low-hanging lamp. In the brief hubbub, Elisa Barston reassured us, "That's okay, I hit my head on that lamp all the time." Which means either that she comes to Union often, or (more likely) she just said that to make us feel less embarrassed.

    This was the third time we found ourselves at the same post-concert restaurant as a symphony performer.

    0. I saw Joshua Roman heading into Wild Ginger.
    1. Gerard Schwarz was having drinks in the lounge at Union.
    2. Maria Larionoff, Gerard Schwarz, the evening's guest soloist, and a number of other people sat at a large round table near ours at The Capital Grill.
    3. Elisa Barston, Mikhail Shmidt, and two others having dinner at Union.

    I guess if I see Susan Gulkis Assadi at a post-symphony restaurant, I'll have completed the entire set of string section principals. (I'm ignoring principal bass, because, well, everybody does...)

    I counted Joshua Roman as a zero in the list above, since we saw him go in, but we were heading to Purple, so it didn't count as a "same restaurant" encounter.

    Bonus chatter 1: I happened to have attended the open rehearsal for this same concert earlier in the week, so I got to hear the second piece on the program performed two-and-a-half times: The first time was by the orchestra without soloist. Then with the soloist. The -and-a-half was all the fragments the conductor rehearsed repeatedly. This was an interesting way to be exposed to the piece for the first time: Hearing the orchestral part first biased my reaction to the piece. Instead of hearing it as a violin concerto, I perceived it more as an orchestral work with solo violin accompaniment. Other interesting things from the rehearsal:

    • String section leaders have much better access to the conductor than other sections. During the rehearsal, the concertmaster and principal second had brief chats with the conductor without even leaving their seats. By comparison, the piccolo player had to hang around near the podium to catch the conductor's attention as he returned from a break.
    • It's an interesting contrast seeing the soloist turn up for the rehearsal in jeans, pink shirt, ponytail, and glasses. Whereas for the performance, she's in a blue dress and wearing contacts. (But she still had the ponytail.)
    • The rehearsal ends promptly at 12:30. Right in the middle of a phrase, boom, everything stops. Union rules.

    Bonus chatter 2: I was tempted to title this entry When out on the stage there arose such a clatter, but the dropped bow was such a small part of the article, and the clatter was relatively insigificant, so it would have been a bit of a misrepresentation.

  • The Old New Thing

    Why do some file operations take file names and others take handles?

    • 29 Comments

    Commenter Brian Friesen asks why some functions (like SetFileAttributes) take a file name, while others (like SetFileTime) take a handle and why we can't have two versions of every API, one for each pattern.

    Second question first: No need to wait for the kernel folks to write such a function; you can already do it yourself!

    // Following "pidgin Hungarian" naming convention, which appears
    // to be dominant in <winbase.h>...
    BOOL SetFileTimesByNameW(LPCWSTR lpFileName,
                             CONST FILETIME *lpCreationTime,
                             CONST FILETIME *lpLastAccessTime,
                             CONST FILETIME *lpLastWriteTime)
    {
      BOOL fRc = FALSE;
      HANDLE hFile = CreateFileW(lpFileName,
                                 FILE_WRITE_ATTRIBUTES,
                                 FILE_SHARE_READ | FILE_SHARE_WRITE |
                                 FILE_SHARE_DELETE,
                                 NULL, OPEN_EXISTING,
                                 FILE_ATTRIBUTE_NORMAL, NULL);
      if (hFile != INVALID_HANDLE_VALUE) {
        fRc = SetFileTime(hFile, lpCreationTime, lpLastAccessTime,
                          lpLastWriteTime);
        CloseHandle(hFile);
      }
      return fRc;
    }
    

    Actually, there wouldn't be two versions of every API but three: (1) handle-based, (2) ANSI, (3) Unicode. (Unless you decide to follow the lead of the NLS team and simply stop adding new ANSI interfaces, in which case you're back to two.)

    Back to the first question: Why are there some functions that take file names and some functions that take handles? Because that's how MS-DOS did it back in 1983! The function to get file attributes took a file name, but the function to modify the file times took a handle. This convention carried over to Windows because Windows preserved the old MS-DOS calling convention for performing disk operations, even though by Windows for Workgroups was released in 1992, you weren't really talking to MS-DOS any more. All that was left was the interface.

    The NT team rewrote the operating system from scratch, and under the NT model, practically nothing was done by name. Nearly everything was done by handle. For example, to delete a file, you opened a handle to it and set the "delete on close" attribute, and then you closed the handle.

    Of course, when NT became Windows NT, they had to adapt their interface to the old Windows way of doing things, although (whisper) under the covers, they just open the file and perform the operation on the handle, like our wrapper function above does.

  • The Old New Thing

    Fortune cookie fortunes are getting less and less interesting all the time

    • 42 Comments

    I remember reading a story about the history of fortune cookie fortunes, and how the pool of fortunes has been getting smaller because people keep complaining about them. In the article, they gave as an example that the fortune "You will meet a stranger" was removed from the fortune library because somebody complained that it was too scary.

    The effects of this trend toward meaningless fortunes continue to be felt. A few years ago, I opened a fortune cookie and looked at the slip of paper inside. It read simply

    Someone will give you something.
  • The Old New Thing

    A concrete illustration of practical running time vs big-O notation

    • 29 Comments

    One of the five things every Win32 programmer needs to know is that memory latency can throw your big-O computations out the window. Back in 2003, I ran into a concrete example of this.

    Somebody started with the algorithm presented in Fast Algorithms for Sorting and Searching Strings by Jon L. Bentley and Robert Sedgewick, implemented it in C#, and compared the performance against a HashTable, TernaryTree and SortedList. Surprisingly, the hash table won on insertion and retrieval of tens of thousands of randomly-generated strings. Why?

    Remember, big-O notation hides the constants, and those constants can get pretty big. What's more, the impact of those constants is critical for normal-sized workloads. The big-O notation allows you to compare algorithms when the data sets become extremely large, but you have to keep the constants in mind to see when the balance tips.

    The central point of my presentation at the PDC was that complexity analysis typically ignores memory bandwidth effects and assumes that all memory accesses perform equally. This is rarely true in practice. As we saw, leaving L2 is a big hit on performance, and accessing the disk is an even greater hit.

    The tree doesn't rebalance, so inserting strings in alphabetical order will result in a bad search tree. (They do call this out in their paper.) To locate a k-character string, Bentley-Sedgewick traverses at least k pointers, usually more. ("How much more" depends on how many prefixes are shared. More shared prefixes = more pointers traversed.) It also requires k(4p) bytes of memory to store that string, where p is the size of a pointer. Remember those pesky constants. High constant factor overhead starts to kill you when you have large datasets.

    More on those constants: Complexity analysis assumes that an add instruction executes in the same amount of time as a memory access. This is not true in practice, but the difference is a bounded constant factor, so it can be ignored for big-O purposes. Note, however, that that constant often exceeds one million if you take a page fault. One million is a big constant.

    Going back to memory bandwidth effects: At each node, you access one character and one pointer. So you use only 6 bytes out of a 64-byte cache line. You're wasting 90% of your bus bandwidth and you will certainly fall out of L2.

    Bentley-Sedgewick says that this is beaten out by not traversing the entire string being searched for in the case of a miss. I.e., their algorithm is tuned for misses. If you expect most of your probes to be misses, this can be a win. (The entire string is traversed on a hit, of course, so there is no gain for hits.)

    Note also that this "cost" for traversing the string on a miss is overstated due to memory bandwidth effects. The characters in a string are contiguous, so traversing the string costs you only L/64 cache lines, where L is the length of the string, and one potential page fault, assuming your string is less than 4KB. Traversing the tree structure costs you at least L cache lines and probably more depending on your branching factor, as well as L potential page faults.

    Let's look at the testing scenario again. The testing was only on hits, so the improved performance on misses was overlooked entirely. What's more, the algorithm takes advantage of strings with common prefixes, but the testing scenario used randomly-generated strings, which generates a data set opposite from the one the algorithm was designed for, since randomly-generated strings are spread out across the problem space instead of being clustered with common prefixes.

    Those are some general remarks; here are some performance notes specific to the CLR.

    I don't know whether it does or not, but I would not be surprised if System.String.GetHashCode caches the hash value in the string, which would mean that the cost of computing the hash is shared by everybody who uses it in a hashing operation. (Therefore, if you count the cost incurred only by the algorithm, hashing is free.)

    Note also that Bentley-Sedgewick's insert() function stores the object back into the tree in the recursive case. Most of the time, the value being stored is the same as the value that was already there. This dirties the cache line for no reason (forcing memory bandwidth to be wasted on a flush) and—particularly painful for the CLR—you hit the write barrier and end up dirting a whole boatload of cards. A very small change avoids this problem: Change

        p->eqkid = insert(p->eqkid, ++s); 
    

    to

        Tptr newkid = insert(p->eqkid, ++s);
        if (p->eqkid != newkid) p->eqkid = newkid;
    

    (and similarly in the other branches). "This code is short but subtle, and worth careful study." How very true.

    Note also that if you use their "sleazy hack" of coercing a string to a Tptr, you had to have changed the type of eqkid from Tptr to object. This introduces a CLR type-check into the inner loop. Congratulations, you just tubed the inner loop performance.

    Now go to the summary at the end of the article.

    1. "Ternary trees do not incur extra overhead for insertion or successful searches." I'm not sure what "extra" means here, but hash tables have the same behavior.
    2. "Ternary trees are usually substantially faster than hashing for unsuccessful searches." Notice that they are optimized for misses.
    3. "Ternary trees gracefully grow and shrink; hash tables need to be rebuilt after large size changes." True, but the CLR hashtable does this so you don't have to. Somebody wrote it for you.
    4. "Ternary trees support advanced searches such as partial-match and near-neighbor search." These operations weren't tested.
    5. "Ternary trees support many other operations, such as traversal to report items in sorted order." These operations weren't tested either.

    Notice that the article doesn't claim that ternary trees are faster than hashing for successful searches. So if that's all you're testing, you're testing the wrong thing. One of the big benefits of ternary trees is the new operations available (4 and 5), but if you're not going to perform those operations, then you ended up paying for something you don't use.

    There are several morals of the story.

    1. Constants are important.
    2. Memory bandwith is important.
    3. Performance optimizations for unmanged code do not necessarily translate to managed code.
    4. What are you really testing?

    Mind you, Bentley and Sedgewick are not morons. They know all this.

    [Typo fixed 11am, thanks Nathan_works and Jachym Kouba.]

  • The Old New Thing

    Spam trackback attack week 2 statistics

    • 17 Comments

    The trackback spam attack is well into its second week now. The people who run blogs.msdn.com blocked all access from the IP address block, which not only blocks trackbacks but also prevents them from reading the content (and therefore prevents them from scraping).

    Undaunted, the sites just moved to a new IP address.

    Site From To Count Rate (pings/hr)
    asp-net-hosting.simplynetdev.com 6/02/2009 07:32 AM 6/02/2009 07:32 AM 1
    microsoft-sharepoint.simplynetdev.com 6/02/2009 07:42 AM 6/02/2009 07:42 AM 1
    outdoorceilingfansite.info 6/02/2009 10:31 AM 6/02/2009 11:07 AM 18 28
    woodtvstand.info 6/02/2009 02:18 PM 6/02/2009 05:50 PM 188 53
    dsecure.net 6/02/2009 06:46 PM 6/02/2009 06:46 PM 1
    patiochairsite.info 6/02/2009 7:28 PM 6/02/2009 7:50 PM 17 44
    hammockstandsite.info 6/02/2009 9:22 PM 6/02/2009 9:35 PM 10 42
    indoorgrillsrecipes.info 6/02/2009 11:04 PM 6/02/2009 11:19 PM 8 28
    portablegreenhousesite.info 6/03/2009 12:58 AM 6/03/2009 01:30 AM 24 43
    888phonecards.com 6/03/2009 02:53 AM 6/03/2009 02:53 AM 1
    baby-parenting.co.uk 6/03/2009 03:29 AM 6/03/2009 03:29 AM 1
    uniformstores.info 6/03/2009 03:23 AM 6/03/2009 03:51 AM 24 49
    asp-net-hosting.simplynetdev.com 6/03/2009 07:12 AM 6/03/2009 07:12 AM 1
    microsoft-sharepoint.simplynetdev.com 6/03/2009 07:16 AM 6/03/2009 07:16 AM 1
    castironbakeware.info 6/03/2009 12:44 PM 6/03/2009 12:45 PM 2 60
    backyardshed.info 6/03/2009 12:47 PM 6/03/2009 12:47 PM 1
    mesotheliomalawyer4u.com 6/04/2009 11:35 PM 6/04/2009 11:35 PM 1
    catastrophiceffects.blog-giant.com 6/06/2009 01:21 AM 6/06/2009 01:21 AM 1
    weakbladder.info 6/07/2009 06:15 PM 6/07/2009 06:55 PM 40 59
    greenteafeatburner.info 6/07/2009 06:56 PM 6/07/2009 07:39 PM 125 173
    besteyecreamsite.info 6/07/2009 07:40 PM 6/07/2009 08:30 PM 53 62
    jointpainreliefs.info [sic] 6/08/2009 11:17 AM 6/08/2009 12:00 PM 41 56
    insomniacuresite.info 6/08/2009 02:48 PM 6/08/2009 03:51 PM 155 147
    menopausereliefsite.info 6/08/2009 05:00 PM 6/08/2009 05:15 PM 45 176
    cellulitecreamsite.info 6/08/2009 07:31 PM 6/08/2009 08:17 PM 118 153
    toenailfungusite.info [sic] 6/08/2009 10:44 PM 6/08/2009 11:23 PM 62 94
    hairgrowthproducts.info 6/09/2009 01:13 AM 6/09/2009 01:48 AM 59 99
    quickdietsite.info 6/09/2009 03:47 AM 6/09/2009 04:43 AM 94 100
    weakbladder.info 6/09/2009 11:15 AM 6/09/2009 12:00 PM 60 79
    greenteafatburner.info 6/09/2009 12:01 PM 6/09/2009 12:44 PM 135 187
    besteyecreamsite.info 6/09/2009 12:44 PM 6/09/2009 01:05 PM 51 143
    jointpainreliefs.info [sic] 6/09/2009 01:06 PM 6/09/2009 01:11 PM 7 72
    jointpainreliefs.info [sic] 6/09/2009 04:48 PM 6/09/2009 05:08 PM 48 141
    insomniacuresite.info 6/09/2009 05:11 PM 6/09/2009 06:05 PM 167 184
    menopausereliefsite.info 6/09/2009 06:04 PM 6/09/2009 06:18 PM 33 137
    cellulitecreamsite.info 6/09/2009 06:18 PM 6/09/2009 06:59 PM 133 193
    toenailfungusite.info [sic] 6/09/2009 06:59 PM 6/09/2009 07:36 PM 77 123
    hairgrowthproducts.info 6/09/2009 07:37 PM 6/09/2009 08:01 PM 64 158
    quickdietsite.info 6/09/2009 08:02 PM 6/09/2009 08:46 PM 133 180
  • The Old New Thing

    Why does Explorer use the term KB instead of KiB?

    • 74 Comments

    Although the International Electronic Commission established the term kibibyte for 1024 bytes, with the abbreviation KiB, Windows Explorer continues to use the abbreviation KB. Why doesn't Explorer get with the program?

    Because nobody else is on the program either.

    If you look around you, you'll find that nobody (to within experimental error) uses the terms kibibyte and KiB. When you buy computer memory, the amount is specified in megabytes and gigabytes, not mebibytes and gibibytes. The storage capacity printed on your blank CD is indicated in megabytes. Every document on the Internet (to within experimental error) which talks about memory and storage uses the terms kilobyte/KB, megabyte/MB, gigabyte/GB, etc. You have to go out of your way to find people who use the terms kibibyte/KiB, mebibyte/MiB, gibibyte/GiB, etc.

    In other words, the entire computing industry has ignored the guidance of the IEC.

    Explorer is just following existing practice. Everybody (to within experimental error) refers to 1024 bytes as a kilobyte, not a kibibyte. If Explorer were to switch to the term kibibyte, it would merely be showing users information in a form they cannot understand, and for what purpose? So you can feel superior because you know what that term means and other people don't.

    For an explanation of other storage units, you can consult this helpful chart from xkcd.

  • The Old New Thing

    Foreign languages can be used as a secret code, but it's not always a good secret code

    • 41 Comments

    Some years ago, I went out to dinner with a group of friends to a Chinese restaurant, and when the check was delivered to the table, one of my friends looked at it and handed it to me. "It appears to be written in some sort of secret code."

    It was written in Chinese.

    I pointed out that they probably chose the worst possible code in the world, seeing as they chose something known to over a billion people.

    Using a foreign language as a secret code is common when walking around out in public. You can say whatever you want about the people around you, and you can have a certain degree of confidence that your secret code will not be broken if you choose a language sufficiently obscure relative to the situation.

    A colleague of mine told me that in his college days, he went on a trip to Germany with a few friends, one of whom spoke German. (I'm led to believe that knowing German comes in handy in Germany.) They boarded a train and introduced themselves to a pair of German students who shared the cabin with them. The German students spoke English, so there was some opportunity for small talk, but eventually the conversation petered out and the two groups conversed among themselves.

    The German students started talking to each other about their cabinmates, saying things that were not entirely complimentary. The German-speaking member of the other group leaned over to another member of his group and announced in a stage whisper, "They think we don't understand German."

    The German students promptly shut up.

  • The Old New Thing

    Why does MS-DOS use 8.3 filenames instead of, say, 11.2 or 16.16?

    • 45 Comments

    When I discussed years ago why operating system files tend to follow the old 8.3 file name convention, I neglected to mention why the old MS-DOS filename convention was 8.3 and not, say, 11.2 or 16.16.

    It's a holdover from CP/M.

    As I noted when I discussed the old MS-DOS wildcard matching rules, MS-DOS worked hard at being compatible with CP/M. And CP/M used 8.3 filenames.

    Why did CP/M use 8.3 filenames? I don't know. There's nothing obvious in the CP/M directory format that explains why those two reserved bytes couldn't have been used to extend the file name to 10.3. But maybe they figured that eight was a convenient number.

  • The Old New Thing

    Mixed messages from the IT department regarding email safety

    • 20 Comments

    TipTalk wrote some time ago about the urban legend of the Reading Pane. In the conclusion, the article mentions the "read all standard mail in plain text" setting. And that reminded me of a story.

    Some time ago, the IT department sent out a message to all users on the subject of email safety. The gist of the message was that for increased safety, everybody should go to that options dialog and check the box to set Outlook to read email in plain text mode.

    It's not clear whether they expected anybody to take their message seriously, though: They sent the message in HTML format.

  • The Old New Thing

    On the importance of sanity-checking values where money is involved

    • 13 Comments

    Last year, one of my colleagues noticed that one particular company's stock, which normally trades in the mid- to upper-$20 range, showed one day of extremely unusual results, very similar to the last example in this series of funny screenshots.

    XYZ Company (NASD:XYZ) - Delayed quote data
    28.06 Daily Range 27.92 – 100000.00
    1:18PM ET

    Closer inspection revealed that there was an order to buy 28 shares at $100,000.

    Obviously, somebody got the "number of shares" and "bid price" fields backwards and ended up losing over two million dollars for the error. If this were an NYSE stock, this error would have been caught because trades on the New York Stock Exchange are still executed by human beings on a trading floor, so somebody would have said, "Wait, the two fields are clearly in reverse order" and fixed the error or at least double-checked with the buyer before executing it. (At least I hope so. Maybe the people on the floor don't care.)

    But nope, this is NASDAQ, and some lucky market maker walked off with over two million dollars. Maybe the unlucky trader's software will now display a warning if the price entered is more than, say, twice the current market price. (With appropriate exceptions for penny stocks.)

    In response to the unusual price activity, another colleage chimed in, "Rats! This always happens to me. I had a limit order at $100,001."

Page 3 of 5 (41 items) 12345