• The Old New Thing

    Counting array elements which are below a particular limit value using SSE

    • 12 Comments

    Some time ago, we looked at how doing something can be faster than not doing it. That is, we observed the non-classical effect of the branch predictor. I took the branch out of the inner loop, but let's see how much further I can push it.

    The trick I'll employ today is using SIMD in order to operate on multiple pieces of data simultaneously. Take the original program and replace the count­them function with this one:

    int countthem(int boundary)
    {
     __m128i xboundary = _mm_cvtsi32_si128(boundary);
     __m128i count = _mm_setzero_si128();
     for (int i = 0; i < 10000; i++) {
      __m128i value =  _mm_cvtsi32_si128(array[i]);
      __m128i test = _mm_cmplt_epi32(value, xboundary);
      count = _mm_sub_epi32(count, test);
     }
     return _mm_cvtsi128_si32(count);
    }
    

    Now, this program doesn't actually use any parallel operations, but it's our starting point. For each 32-bit value, we load it, compare it agains the boundary value, and accumulate the result. The _mm_cmplt_epi32 function compares the four 32-bit integers in the first parameter against the four 32-bit integers in the second parameter, producing four new 32-bit integers. Each of the new 32-bit integers is 0xFFFFFFFF if the corresponding first parameter is less than the second, or it is 0x00000000 if it is greater than or equal.

    In this case, we loaded up the value we care about, then compare it against the boundary value. The result of the comparison is either 32 bits of 0 (for false) or 32 bits of 1 (for true), so this merely sets test equal to 0xFFFFFFFF if the value is less than the boundary; otherwise 0x0000000. Since 0xFFFFFFFF is the same as a 32-bit -1, we subtract the value so that the count goes up by 1 if the value is less than the boundary.

    Finally, we convert back to a 32-bit integer and return it.

    With this change, the running time drops from 2938 time units to 2709, an improvement of 8%.

    So far, we have been using only the bottom 32 bits of the 128-bit XMM registers. Let's turn on the parallelism.

    int countthem(int boundary)
    {
     __m128i *xarray = (__m128i*)array;
     __m128i xboundary = _mm_set1_epi32(boundary);
     __m128i count = _mm_setzero_si128();
     for (int i = 0; i < 10000 / 4; i++) {
      __m128i value = _mm_loadu_si128(&xarray[i]);
      __m128i test = _mm_cmplt_epi32(value, xboundary);
      count = _mm_sub_epi32(count, test);
     }
     __m128i shuffle1 = _mm_shuffle_epi32(count, _MM_SHUFFLE(1, 0, 3, 2));
     count = _mm_add_epi32(count, shuffle1);
     __m128i shuffle2 = _mm_shuffle_epi32(count, _MM_SHUFFLE(2, 3, 0, 1));
     count = _mm_add_epi32(count, shuffle2);
     return _mm_cvtsi128_si32(count);
    }
    

    We take our 32-bit integers and put them in groups of four, so instead of thinking of them as 10000 32-bit integers, we think of them as 2500 128-bit blocks, each block containing four lanes, with each lane holding one 32-bit integers.

    Lane 3 Lane 2 Lane 1 Lane 0
    xarray[0] array[3] array[2] array[1] array[0]
    xarray[1] array[7] array[6] array[5] array[4]
    xarray[2499] array[9999] array[9998] array[9997] array[9996]

    Now we can run our previous algorithm in parallel on each lane.

    Lane 3 Lane 2 Lane 1 Lane 0
    xboundary boundary boundary boundary boundary
     
    test array[3] < boundary array[2] < boundary array[1] < boundary array[0] < boundary
    test array[7] < boundary array[6] < boundary array[5] < boundary array[4] < boundary
    test array[9999] < boundary array[9998] < boundary array[9997] < boundary array[9996] < boundary
     
    count = Σtest Lane 3 totals Lane 2 totals Lane 1 totals Lane 0 totals

    The xboundary variable contains a copy of the boundary in each of the four 32-bit lanes. We load the values from the array four at a time¹ and compare them (in parallel) against the boundary, then we tally them (in parallel). The result of the loop is that each lane of count performs a count of values for its lane.

    After we complete the loop, we combine the parallel results by adding the lanes together. We do this by shuffling the values around and performing more parallel adds. The _mm_shuffle_epi32 function lets you rearrange the lanes of an XMM register. The _MM_SHUFFLE macro lets you specify how you want the shuffle to occur. For example, _MM_SHUFFLE(1, 0, 3, 2) says that we want lanes 1, 0, 3 then 2 of the original value. (You can shuffle a value into multiple destination lanes; for example, _MM_SHUFFLE(0, 0, 0, 0) says that you want four copies of lane 0. That's how we created xboundary.)

    Lane 3 Lane 2 Lane 1 Lane 0
    count Lane 3 totals Lane 2 totals Lane 1 totals Lane 0 totals
    shuffle1 Lane 1 totals Lane 0 totals Lane 3 totals Lane 2 totals
     
    count += shuffle1 Lane 3 + Lane 1 Lane 2 + Lane 0 Lane 1 + Lane 3 Lane 0 + Lane 2
    shuffle2 Lane 2 + Lane 0 Lane 3 + Lane 1 Lane 0 + Lane 2 Lane 1 + Lane 3
     
    count += shuffle2 Lane 3 + Lane 1 +
    Lane 2 + Lane 0
    Lane 2 + Lane 0 +
    Lane 3 + Lane 1
    Lane 1 + Lane 3 +
    Lane 0 + Lane 2
    Lane 0 + Lane 2 +
    Lane 1 + Lane 3

    At the end of the shuffling and adding, we have calculated the sum of all four lanes. (For style points, I put the answer in all the lanes.)

    This new version runs in 688 time units, or 3.9 times faster than the previous one. This makes sense because we are counting four values at each iteration. The overall improvement is 4.3×.

    Let's see if we can reduce the loop overhead by doing some unrolling.

    #define GETVALUE(n) __m128i value##n = _mm_loadu_si128(&xarray[i+n])
    #define GETTEST(n) __m128i test##n = _mm_cmplt_epi32(value##n, xboundary)
    #define GETCOUNT(n)  count = _mm_sub_epi32(count, test##n)
    
    int countthem(int boundary)
    {
     __m128i *xarray = (__m128i*)array;
     __m128i xboundary = _mm_set1_epi32(boundary);
     __m128i count = _mm_setzero_si128();
     for (int i = 0; i < 10000 / 4; i += 4) {
      GETVALUE(0); GETVALUE(1); GETVALUE(2); GETVALUE(3);
       GETTEST(0);  GETTEST(1);  GETTEST(2);  GETTEST(3);
      GETCOUNT(0); GETCOUNT(1); GETCOUNT(2); GETCOUNT(3);
     }
     __m128i shuffle1 = _mm_shuffle_epi32(count, _MM_SHUFFLE(1, 0, 3, 2));
     count = _mm_add_epi32(count, shuffle1);
     __m128i shuffle2 = _mm_shuffle_epi32(count, _MM_SHUFFLE(2, 3, 0, 1));
     count = _mm_add_epi32(count, shuffle2);
     return _mm_cvtsi128_si32(count);
    }
    

    We unroll the loop fourfold. At each iteration, we load 16 values from memory, and then accumulate the totals. We fetch all the memory values first, then do the comparisons, then accumulate the results. If we had written it as GETVALUE immediately followed by GETTEST, then the _mm_cmplt_epi32 would have stalled waiting for the result to arrive from memory. By interleaving the operations, we get some work done instead of stalling.

    This version runs in 514 time units, an improvement of 33% over the previous version and an overall improvement of 5.7×.

    Can we unroll even further? Let's try fivefold.

    int countthem(int boundary)
    {
     __m128i *xarray = (__m128i*)array;
     __m128i xboundary = _mm_set1_epi32(boundary);
     __m128i count = _mm_setzero_si128();
     for (int i = 0; i < 10000 / 4; i += 5) {
      GETVALUE(0); GETVALUE(1); GETVALUE(2); GETVALUE(3); GETVALUE(4);
       GETTEST(0);  GETTEST(1);  GETTEST(2);  GETTEST(3);  GETTEST(4);
      GETCOUNT(0); GETCOUNT(1); GETCOUNT(2); GETCOUNT(3); GETCOUNT(4);
     }
     __m128i shuffle1 = _mm_shuffle_epi32(count, _MM_SHUFFLE(1, 0, 3, 2));
     count = _mm_add_epi32(count, shuffle1);
     __m128i shuffle2 = _mm_shuffle_epi32(count, _MM_SHUFFLE(2, 3, 0, 1));
     count = _mm_add_epi32(count, shuffle2);
     return _mm_cvtsi128_si32(count);
    }
    

    Huh? This version runs marginally slower, at 528 time units. So I guess further unrolling won't help any more. (For example, if you unroll a loop so much that you have more live variables than registers, the compiler will need to spill registers to memory. The x86 has eight XMM registers available, so you can easily cross that limit.)

    But wait, there's still room for tweaking. We have been using _mm_cmplt_epi32 to perform the comparison, expecting the compiler to generate code like this:

        ; suppose xboundary is in xmm0 and count is in xmm1
        movdqu   xmm2, xarray[i] ; xmm2 = value
        pcmpltd  xmm2, xmm0      ; xmm2 = test
        psubd    xmm1, xmm2
    

    If you crack open your Intel manual, you'll see that there is no PCMPLTD instruction. The compiler intrinsic is emulating the instruction by flipping the parameters and using PCMPGTD.

    _mm_cmplt_epi32(x, y) ↔ _mm_cmpgt_epi32(y, x)
    

    But the PCMPGTD instruction writes the result back into the first parameter. In other words, it always takes the form

    y = _mm_cmpgt_epi32(y, x);
    

    In our case, y is xboundary, but we don't want to modify xboundary. As a result, the compiler needs to introduce a temporary register:

        movdqu   xmm2, xarray[i] ; xmm2 = value
        movdqa   xmm3, xmm0      ; xmm3 = copy of xboundary
        pcmpgtd  xmm3, xmm2      ; xmm3 = test
        psubd    xmm1, xmm3
    

    We can take an instruction out of the sequence by switching to _mm_cmpgt_epi32 and adjusting our logic accordingly, taking advantage of the fact that

    x < y ⇔ ¬(x ≥ y) ⇔ ¬(x > y − 1)
    

    assuming the subtraction does not underflow. Fortunately, it doesn't in our case since boundary ranges from 0 to 10, and subtracting 1 does not put us in any danger of integer underflow.

    With this rewrite, we can switch to using _mm_cmpgt_epi32, which is more efficient for our particular scenario. Since we are now counting the values which don't meet our criteria, we need to take our final result and subtract it from 10000.

    #define GETTEST(n) __m128i test##n = _mm_cmpgt_epi32(value##n, xboundary1)
    
    int countthem(int boundary)
    {
     __m128i *xarray = (__m128i*)array;
     __m128i xboundary1 = _mm_set1_epi32(boundary - 1);
     __m128i count = _mm_setzero_si128();
     for (int i = 0; i < 10000 / 4; i += 5) {
      GETVALUE(0); GETVALUE(1); GETVALUE(2); GETVALUE(3); GETVALUE(4);
       GETTEST(0);  GETTEST(1);  GETTEST(2);  GETTEST(3);  GETTEST(4);
      GETCOUNT(0); GETCOUNT(1); GETCOUNT(2); GETCOUNT(3); GETCOUNT(4);
     }
     __m128i shuffle1 = _mm_shuffle_epi32(count, _MM_SHUFFLE(1, 0, 3, 2));
     count = _mm_add_epi32(count, shuffle1);
     __m128i shuffle2 = _mm_shuffle_epi32(count, _MM_SHUFFLE(2, 3, 0, 1));
     count = _mm_add_epi32(count, shuffle2);
     return 10000 - _mm_cvtsi128_si32(count);
    }
    

    Notice that we have two subtractions which cancel out. We are subtracting the result of the comparison, and then we subtract the total from 10000. The two signs cancel out, and we can use addition for both. This saves an instruction in the return because subtraction is not commutative, but addition is.

    #define GETCOUNT(n) count = _mm_add_epi32(count, test##n)
    
    int countthem(int boundary)
    {
     __m128i *xarray = (__m128i*)array;
     __m128i xboundary1 = _mm_set1_epi32(boundary - 1);
     __m128i count = _mm_setzero_si128();
     for (int i = 0; i < 10000 / 4; i += 5) {
      GETVALUE(0); GETVALUE(1); GETVALUE(2); GETVALUE(3); GETVALUE(4);
       GETTEST(0);  GETTEST(1);  GETTEST(2);  GETTEST(3);  GETTEST(4);
      GETCOUNT(0); GETCOUNT(1); GETCOUNT(2); GETCOUNT(3); GETCOUNT(4);
     }
     __m128i shuffle1 = _mm_shuffle_epi32(count, _MM_SHUFFLE(1, 0, 3, 2));
     count = _mm_add_epi32(count, shuffle1);
     __m128i shuffle2 = _mm_shuffle_epi32(count, _MM_SHUFFLE(2, 3, 0, 1));
     count = _mm_add_epi32(count, shuffle2);
     return 10000 + _mm_cvtsi128_si32(count);
    }
    

    You can look at the transformation this way: The old code considered the glass half empty. It started with zero and added 1 each time it found an entry that passed the test. The new code considers the glass half full. It assumes each entry passes the test, and it subtracts one each time it finds an element that fails the test.

    This version runs in 453 time units, an improvement of 13% over the fourfold unrolled version and an improvement of 6.5× overall.

    Okay, let's unroll sixfold, just for fun.

    int countthem(int boundary)
    {
     __m128i *xarray = (__m128i*)array;
     __m128i xboundary = _mm_set1_epi32(boundary - 1);
     __m128i count = _mm_setzero_si128();
     int i = 0;
     {
        GETVALUE(0); GETVALUE(1); GETVALUE(2); GETVALUE(3);
         GETTEST(0);  GETTEST(1);  GETTEST(2);  GETTEST(3);
        GETCOUNT(0); GETCOUNT(1); GETCOUNT(2); GETCOUNT(3);
     }
     i += 4;
     for (; i < 10000 / 4; i += 6) {
      GETVALUE(0); GETVALUE(1); GETVALUE(2);
      GETVALUE(3); GETVALUE(4); GETVALUE(5);
       GETTEST(0);  GETTEST(1);  GETTEST(2);
       GETTEST(3);  GETTEST(4);  GETTEST(5);
      GETCOUNT(0); GETCOUNT(1); GETCOUNT(2);
      GETCOUNT(3); GETCOUNT(4); GETCOUNT(5);
     }
     __m128i shuffle1 = _mm_shuffle_epi32(count, _MM_SHUFFLE(1, 0, 3, 2));
     count = _mm_add_epi32(count, shuffle1);
     __m128i shuffle2 = _mm_shuffle_epi32(count, _MM_SHUFFLE(2, 3, 0, 1));
     count = _mm_add_epi32(count, shuffle2);
     return 10000 + _mm_cvtsi128_si32(count);
    }
    

    Since 10000 / 4 % 6 = 4, we have four values that don't fit in the loop. We deal with those values up front, and then enter the loop to get the rest.

    This version runs in 467 time units, which is 3% slower than the previous version. So I guess it's time to stop unrolling. Let's go back to the previous version which ran faster.

    The total improvement we got after all this tweaking is speedup of 6.5× over the original jumpless version. And most of that improvement (5.7×) came from unrolling the loop fourfold.

    Anyway, no real moral of the story today. I just felt like tinkering.

    Notes

    ¹ The _mm_loadu_si128 intrinsic is kind of weird. Its formal argument is a __m128i*, but since it is for loading unaligned data, the formal argument really should be __m128i __unaligned*. The problem is that the __unaligned keyword doesn't exist on x86 because prior to the introduction of MMX and SSE, x86 allowed arbitrary misaligned data. Therefore, you are in this weird situation where you have to use an aligned pointer to access unaligned data.

    Bonus chatter: Clang at optimization level 3 does autovectorization. It doesn't know some of the other tricks, like converting x + 1 to x - (-1), thereby saving an instruction and a register.

  • The Old New Thing

    A user's SID can change, so make sure to check the SID history

    • 23 Comments

    It doesn't happen often, but a user's SID can change. For example, when I started at Microsoft, my account was in the SYS-WIN4 domain, which is where all the people on the Windows 95 team were placed. At some point, that domain was retired, and my account moved to the REDMOND domain. We saw some time ago that the format of a user SID is

    S-1- version number (SID_REVISION)
    -5- SECURITY_NT_AUTHORITY
    -21- SECURITY_NT_NON_UNIQUE
    -w-x-y- the entity (machine or domain) that issued the SID
    -z the unique user ID for that entity

    The issuing entity for a local account on a machine is the machine to which the account belongs. The issuing entity for a domain account is the domain.

    If an account moves between domains, the issuing entity changes, which means that the old SID is not valid. A new SID must be issued.

    Wait, does this mean that if my account moves between domains, then I lose access to all my old stuff? All my old stuff grants access to my old SID, not my new SID.

    Fortunately, this doesn't happen, thanks to the SID history. When your account moves to the new domain, the new domain controller remembers all the previous SIDs you used to have. When you authenticate against the domain controller, it populates your token with your SID history. In my example, it means that my token not only says "This is user number 271828 on the REDMOND domain", it also says "This user used to be known as number 31415 on the SYS-WIN4 domain." That way, when the system sees an object whose ACL says, "Grant access to user 31415 on the SYS-WIN4 domain," then it should grant me access to that object.

    The existence of SID history means that recognizing users when they return is more complicated than a simple Equal­Sid, because Equal­Sid will say that "No, S-1-5-21-REDMOND-271828 is not equal to S-1-5-21-SYS-WIN4-31415," even though both SIDs refer to the same person.

    If you are going to remember a SID and then try to recognize a user when they return, you need to search the SID history for a match, in case the user changed domains between the two visits. The easiest way to do this is with the Access­Check function. For example, suppose I visited your site while I belong to the SYS-WIN4 domain, and you remembered my SID. When I return, you create a security descriptor that grants access to the SID you remembered, and then you ask Access­Check, "If I had an object that granted access only to this SID, would you let this guy access it?"

    (So far, this is just recapping stuff I discussed a few months ago. Now comes the new stuff.)

    There are a few ways of building up the security descriptor. In all the cases, we will create a security descriptor that grants the specified SID some arbitrary access, and then we will ask the operating system whether the current user has that access.

    My arbitrary access shall be

    #define FROB_ACCESS     1 // any single bit less than 65536
    

    One way to build the security descriptor is to let SDDL do the heavy lifting: Generate the string D:(A;;1;;;⟨SID⟩) and then pass it to String­Security­Descriptor­To­Security­Descriptor.

    Another is to build it up with security descriptor functions. I defer to the sample code in MSDN for an illustration.

    The hard-core way is just to build the security descriptor by hand. For a security descriptor this simple, the direct approach involves the least amount of code. Go figure.

    The format of the security descriptor we want to build is

    struct ACCESS_ALLOWED_ACE_MAX_SIZE
    {
        ACCESS_ALLOWED_ACE Ace;
        BYTE SidExtra[SECURITY_MAX_SID_SIZE - sizeof(DWORD)];
    };
    

    The ACCESS_ALLOWED_ACE_MAX_SIZE structure represents the maximum possible size of an ACCESS_ALLOWED_ACE. The ACCESS_ALLOWED_ACE leaves a DWORD for the SID (Sid­Start), so we add additional bytes afterward to accommodate the largest valid SID. If you wanted to be more C++-like, you could make ACCESS_ALLOWED_ACE_MAX_SIZE derive from ACCESS_ALLOWED_ACE.

    struct ALLOW_ONLY_ONE_SECURITY_DESCRIPTOR
    {
        SECURITY_DESCRIPTOR_RELATIVE Header;
        ACL Acl;
        ACCESS_ALLOWED_ACE_MAX_SIZE Ace;
    };
    
    const ALLOW_ONLY_ONE_SECURITY_DESCRIPTOR c_sdTemplate = {
      // SECURITY_DESCRIPTOR_RELATIVE
      {
        SECURITY_DESCRIPTOR_REVISION,           // Revision
        0,                                      // Reserved
        SE_DACL_PRESENT | SE_SELF_RELATIVE,     // Control
        FIELD_OFFSET(ALLOW_ONLY_ONE_SECURITY_DESCRIPTOR, Ace.Ace.SidStart),
                                                // Offset to owner
        FIELD_OFFSET(ALLOW_ONLY_ONE_SECURITY_DESCRIPTOR, Ace.Ace.SidStart),
                                                // Offset to group
        0,                                      // No SACL
        FIELD_OFFSET(ALLOW_ONLY_ONE_SECURITY_DESCRIPTOR, Acl),
                                                // Offset to DACL
      },
      // ACL
      {
        ACL_REVISION,                           // Revision
        0,                                      // Reserved
        sizeof(ALLOW_ONLY_ONE_SECURITY_DESCRIPTOR) -
        FIELD_OFFSET(ALLOW_ONLY_ONE_SECURITY_DESCRIPTOR, Acl),
                                                // ACL size
        1,                                      // ACE count
        0,                                      // Reserved
      },
      // ACCESS_ALLOWED_ACE_MAX_SIZE
      {
        // ACCESS_ALLOWED_ACE
        {
          // ACE_HEADER
          {
            ACCESS_ALLOWED_ACE_TYPE,            // AceType
            0,                                  // flags
            sizeof(ACCESS_ALLOWED_ACE_MAX_SIZE),// ACE size
          },
          FROB_ACCESS,                          // Access mask
        },
      },
    };
    

    Our template security descriptor says that it is a self-relative security descriptor with an owner, group and DACL, but no SACL. The DACL consists of a single ACE. We set up everything in the ACE except for the SID. We point the owner and group to that same SID. Therefore, this security descriptor is all ready for action once you fill in the SID.

    BOOL IsInSidHistory(HANDLE Token, PSID Sid)
    {
      DWORD SidLength = GetLengthSid(Sid);
    
      if (SidLength > SECURITY_MAX_SID_SIZE) {
        // Invalid SID. That's not good.
        // Somebody is playing with corrupted data.
        // Stop before anything bad happens.
        RaiseFailFastException(nullptr, nullptr, 0);
      }
    
      ALLOW_ONLY_ONE_SECURITY_DESCRIPTOR Sd = c_sdTemplate;
      CopyMemory(&Sd.Ace.Ace.SidStart, Sid, SidLength);
    

    As you can see, generating the security descriptor is a simple matter of copying our template and then replacing the SID. The next step is performing an access check of the token against that SID.

      const static GENERIC_MAPPING c_GenericMappingFrob = {
        FROB_ACCESS,
        FROB_ACCESS,
        FROB_ACCESS,
        FROB_ACCESS,
      };
      PRIVILEGE_SET PrivilegeSet;
      DWORD PrivilegeSetSize = sizeof(PrivilegeSet);
      DWORD GrantedAccess = 0;
      BOOL AccessStatus = 0;
      return AccessCheck(&Sd, Token, FROB_ACCESS,
        const_cast<PGENERIC_MAPPING>(&c_GenericMappingFrob),
        &PrivilegeSet, &PrivilegeSetSize,
        &GrantedAccess, &AccessStatus) &&
        AccessStatus;
    }
    

    So let's take this guy out for a spin. Since I don't know what is in your SID history, I'm going to pick something that should be in your token already (Authenticated Users) and something that shouldn't (Local System).

    // Note: Error checking elided for expository purposes.
    
    void CheckWellKnownSid(HANDLE Token, WELL_KNOWN_SID_TYPE type)
    {
      BYTE rgbSid[SECURITY_MAX_SID_SIZE];
      DWORD cbSid = sizeof(rgbSid);
      CreateWellKnownSid(type, NULL, rgbSid, &cbSid);
      printf("Is %d in SID history? %d\n", type,
             IsInSidHistory(Token, rgbSid));
    }
    
    int __cdecl wmain(int argc, wchar_t **argv)
    {
      HANDLE Token;
      // In real life you had better error-check these calls,
      // to avoid a security hole.
      ImpersonateSelf(SecurityImpersonation);
      OpenThreadToken(GetCurrentThread(), TOKEN_QUERY, TRUE, &Token);
      RevertToSelf();
    
      CheckWellKnownSid(Token, WinAuthenticatedUserSid);
      CheckWellKnownSid(Token, WinLocalSystemSid);
      CloseHandle(Token);
    
      return 0;
    }
    

    Related reading: Hey there token, long time no see! (Did you do something with your hair?)

  • The Old New Thing

    Some light reading on lock-free programming

    • 8 Comments

    Today is a holiday in the United States, so I'm going to celebrate by referring you to other things to read.

    I'm going to start with a presentation by Bruce Dawson at GDC 2009, which is basically multiple instances of the question "Is this code correct?", and the answer is always "No!" Although the title of the talk is Lockless Programming in Games, the information is relevant to pretty much everybody. I can't find a recording of the presentation, but you can download the PowerPoint slides or view them in your browser. But I recommend downloading the PowerPoint slides and reading the notes, because the notes explain the slides. [Update: Ah, you can see the notes in the browser by clicking the Notes button at the bottom. So download whichever you prefer. Just make sure you read the notes.]

    A more game-focused presentation by Bruce Dawson has the more general title Coding for Multiple Cores. Download the PowerPoint sides or view them in your browser.

    Then there is the MSDN white paper that he authored, Lockless Programming Considerations for Xbox 360 and Microsoft Windows.

    Finally, there's Herb Sutter's two-part talk atomic<> Weapons, part 1 and part 2.

    That should keep you busy for a while.

  • The Old New Thing

    If 16-bit Windows had a single input queue, how did you debug applications on it?

    • 32 Comments

    After learning about the bad things that happened if you synchronized your application's input queue with its debugger, commenter kme wonders how debugging worked in 16-bit Windows, since 16-bit Windows didn't have asynchronous input? In 16-bit Windows, all applications shared the same input queue, which means you were permanently in the situation described in the original article, where the application and its debugger (and everything else) shared an input queue and therefore would constantly deadlock.

    The solution to UI deadlocks is to make sure the debugger doesn't have any UI.

    At the most basic level, the debugger communicated with the developer through the serial port. You connected a dumb terminal to the other end of the serial port. Mine was a Wyse 50 serial console terminal. All your debugging happened on the terminal. You could disassemble code, inspect and modify registers and memory, and even patch new code on the fly. If you wanted to consult source code, you needed to have a copy of it available somewhere else (like on your other computer). It was similar to using the cdb debugger, where the only commands available were r, db, eb, u, and a. Oh, and bp to set breakpoints.

    Now, if you were clever, you could use a terminal emulator program so you didn't need a dedicated physical terminal to do your debugging. You could connect the target computer to your development machine and view the disassembly and the source code on the same screen. But you weren't completely out of the woods, because what did you use to debug your development machine if it crashed? The dumb terminal, of course.¹

    Target machine
    Debugger
    Development machine
    Debugger
    Wyse 50
    dumb terminal

    I did pretty much all my Windows 95 debugging this way.

    If you didn't have two computers, another solution was to use a debugger like CodeView. CodeView avoided the UI deadlock problem by not using the GUI to present its UI. When you hit a breakpoint or otherwise halted execution of your application, CodeView talked directly to the video driver to save the first 4KB of video memory, then switched into text mode to tell you what happened. When you resumed execution, it restored the video memory, then switched the video card back into graphics mode, restored all the pixels it captured, then resumed execution as if nothing had happened. (If you were debugging a graphics problem, you could hit F3 to switch temporarily to graphics mode, so you could see what was on the screen.)

    If you were really fancy, you could spring for a monochrome adapter, either the original IBM one or the Hercules version, and tell CodeView to use that adapter for its debugging UI. That way, when you broke into the debugger, you could still see what was on the screen! We had multiple monitors before it was cool.

    ¹ Some people were crazy and cross-connected their target and development machines.

    Target machine
    Debugger
    Development machine
    Debugger

    This allowed them to use their target machine to debug their development machine and vice versa. But if your development machine crashed while it was debugging the target machine, then you were screwed.

  • The Old New Thing

    What is the difference between Full Windows Touch Support and Limited Touch Support?

    • 19 Comments

    In the System control panel and in the PC Info section of the PC & Devices section of PC Settings, your device's pen and touch support can be reported in a variety of ways. Here is the matrix:

    No pen Pen
    No touch No Pen or Touch Input Pen Support
    Single touch Single Touch Support Pen and Single Touch Support
    Limited multi-touch Limited Touch Support with N Touch Points Pen and Limited Touch Support with N Touch Points
    Full multi-touch Full Touch Support with N Touch Points Pen and Full Touch Support with N Touch Points

    The meaning of No touch and Single touch are clear, but if a device supports multiple touch points, what makes the system report it as having Limited versus Full touch support?

    A device with Full touch support is one that has passed Touch Hardware Quality Assurance (THQA). You can read about the Windows Touch Test Lab (WTTL) to see some of the requirements for full touch support.

    If you have a touch device without full touch support, then Windows will lower its expectations from the device. For example, it will not use the timestamps on the device packets, and it will increase the tolerances for edge gestures.

    Note that if test signing is enabled, then all multitouch drivers are treated as having full touch support. (This lets you test your driver in Full mode before submitting it to THQA.)

  • The Old New Thing

    The crazy world of stripping diacritics

    • 25 Comments

    Today's Little Program strips diacritics from a Unicode string. Why? Hey, I said that Little Programs require little to no motivation. It might come in handy in a spam filter, since it was popular, at least for a time, to put random accent marks on spam subject lines in order to sneak past keyword filters. (It doesn't seem to be popular any more.)

    This is basically a C-ization of the C# code originally written by Michael Kaplan. Don't forget to read the follow-up discussion that notes that this can result in strange results.

    First, let's create our dialog box. Note that I intentionally give it a huge font so that the diacritics are easier to see.

    // scratch.h
    
    #define IDD_SCRATCH 1
    #define IDC_SOURCE 100
    #define IDC_SOURCEPOINTS 101
    #define IDC_DEST 102
    #define IDC_DESTPOINTS 103
    
    // scratch.rc
    
    #include <windows.h>
    #include "scratch.h"
    
    IDD_SCRATCH DIALOGEX 0, 0, 320, 88
    STYLE DS_MODALFRAME | WS_POPUP | WS_CAPTION | WS_SYSMENU
    Caption "Stripping diacritics"
    FONT 20, "MS Shell Dlg"
    BEGIN
        LTEXT "Original:", -1, 4, 8, 38, 10
        EDITTEXT IDC_SOURCE, 46, 6, 270, 12, ES_AUTOHSCROLL
        LTEXT "", IDC_SOURCEPOINTS, 46, 22, 270, 12
        LTEXT "Modified:", -1, 4, 40, 38, 10
        EDITTEXT IDC_DEST, 46, 38, 270, 12, ES_AUTOHSCROLL
        LTEXT "", IDC_DESTPOINTS, 46, 54, 270, 12
        DEFPUSHBUTTON "OK", IDOK, 266, 70, 50, 14
    END
    

    Now the program that uses the dialog box.

    // scratch.cpp
    
    #define STRICT
    #define UNICODE
    #define _UNICODE
    #include <windows.h>
    #include <windowsx.h>
    #include <strsafe.h>
    #include "scratch.h"
    
    #define MAXSOURCE 64
    
    void SetDlgItemCodePoints(HWND hwnd, int idc, PCWSTR psz)
    {
      wchar_t szResult[MAXSOURCE * 4 * 5];
      szResult[0] = 0;
      PWSTR pszResult = szResult;
      size_t cchResult = ARRAYSIZE(szResult);
      HRESULT hr = S_OK;
      for (; SUCCEEDED(hr) && *psz; psz++) {
        wchar_t szPoint[6];
        hr = StringCchPrintf(szPoint, ARRAYSIZE(szPoint), L"%04x ", *psz);
        if (SUCCEEDED(hr)) {
          hr = StringCchCatEx(pszResult, cchResult, szPoint, &pszResult, &cchResult, 0);
        }
      }
      SetDlgItemText(hwnd, idc, szResult);
    }
    

    The Set­Dlg­Item­Code­Points function takes a UTF-16 string and prints all the code points. This is just to help visualize the result; it's not part of the actual diacritic-removal algorithm.

    void OnUpdate(HWND hwnd)
    {
      wchar_t szSource[MAXSOURCE];
      GetDlgItemText(hwnd, IDC_SOURCE, szSource, ARRAYSIZE(szSource));
      wchar_t szDest[MAXSOURCE * 4];
    
      int cchActual = NormalizeString(NormalizationKD,
                                      szSource, -1,
                                      szDest, ARRAYSIZE(szDest));
      if (cchActual <= 0) szDest[0] = 0;
    
      WORD rgType[ARRAYSIZE(szDest)];
      GetStringTypeW(CT_CTYPE3, szDest, -1, rgType);
    
      PWSTR pszWrite = szDest;
      for (int i = 0; szDest[i]; i++) {
        if (!(rgType[i] & C3_NONSPACING)) {
          *pszWrite++ = szDest[i];
        }
      }
      *pszWrite = 0;
    
      SetDlgItemText(hwnd, IDC_DEST, szDest);
      SetDlgItemCodePoints(hwnd, IDC_SOURCEPOINTS, szSource);
      SetDlgItemCodePoints(hwnd, IDC_DESTPOINTS, szDest);
    }
    

    Okay, here's where the actual work happens. We put the source string into Normalization Form KD. This decomposes the diacritics so that we can identify them with Get­String­TypeW and then strip them out.

    Of course, in real life, you wouldn't hard-code the array sizes like I did here, but this is just a Little Program, and Little Programs are allowed to take shortcuts.

    The rest of the program is just a framework to get into that function.

    INT_PTR CALLBACK DlgProc(HWND hwnd, UINT wm,
                             WPARAM wParam, LPARAM lParam)
    {
      switch (wm)
      {
      case WM_INITDIALOG:
        return TRUE;
    
      case WM_COMMAND:
        switch (GET_WM_COMMAND_ID(wParam, lParam)) {
        case IDC_SOURCE:
          switch (GET_WM_COMMAND_CMD(wParam, lParam)) {
        case EN_UPDATE:
          OnUpdate(hwnd);
          break;
        }
        break;
        case IDOK:
          EndDialog(hwnd, 0);
          return TRUE;
      }
      break;
    
      case WM_CLOSE:
        EndDialog(hwnd, 0);
        return TRUE;
      }
    
      return FALSE;
    }
    
    int WINAPI wWinMain(HINSTANCE hinst, HINSTANCE hinstPrev,
                       LPWSTR lpCmdLine, int nShowCmd)
    {
      DialogBox(hinst, MAKEINTRESOURCE(IDD_SCRATCH), nullptr, DlgProc);
      return 0;
    }
    

    Okay, let's take this program for a spin. Here are some interesting characters to try:

    Original character Resulting character
    ª 00AA Feminine ordinal indicator a 0061 Latin small letter a
    ¹ 00B1 Superscript one 1 0031 Digit one
    ½ 00BD Vulgar fraction one half 1⁄2 0031 2044 0032 Digit one + Fraction slash + Digit two
    ı 0131 Latin small letter dotless i ı 0131 Latin small letter dotless i
    Ø 00D8 Latin capital letter O with stroke Disappears!
    ł 0142 Latin small letter l with stroke ł 0142 Latin small letter l with stroke
    ŀ 0140 Latin small letter l with middle dot 006C 00B7 Latin small letter l + middle dot
    æ 00E6 Latin small letter ae æ 00E6 Latin small letter ae
    Ή 0389 Greek capital letter Eta with tonos Η 0397 Greek capital letter Eta
    А 0410 Cyrillic capital letter А А 0410 Cyrillic capital letter А
    Å 00C5 Latin capital letter A with ring above A 0041 Latin capital letter A
    FF21 Fullwidth Latin capital letter A A 0041 Latin capital letter A
    2460 Circled digit one 1 0031 Digit one
    2780 Dingbat circled sans-serif digit one 2780 Dingbat circled sans-serif digit one
    ® 00AE Registered sign ® 00AE Registered sign
    24c7 Circled Latin capital letter R R 0052 Latin capital letter R
    𝖕 D835 DD95 Mathematical bold Fraktur small p p 0070 Latin small letter p
    FF6C Halfwidth Katakana letter small Ya 30E3 Katakana letter small Ya
    30E3 Katakana letter small Ya 30E3 Katakana letter small Ya
    30B4 Katakana letter Go 30B3 Katakana letter Ko
    201C Left double quotation mark 201C Left double quotation mark
    201D Right double quotation mark 201D Right double quotation mark
    201E Double low-9 quotation mark 201E Double low-9 quotation mark
    201F Double high-reversed-9 quotation mark 201F Double high-reversed-9 quotation mark
    2033 Double prime ′′ 2032 2032 Prime + Prime
    2035 Reverse prime 2035 Reverse prime
    2039 Single left-pointing angle quotation mark 2039 Single left-pointing angle quotation mark
    « 00AB Left-pointing double angle quotation mark « 00AB Left-pointing double angle quotation mark
    2014 Em-dash 2014 Em-dash
    203C Double exclamation mark !! 0021 0021 Exclamation mark + Exclamation mark

    There are some interesting quirks here. Mind you, this is what the Unicode Consortium says, so if you think they are wrong, you can take it up with them.

    The superscript-like characters are converted to their plain versions. Enclosed alphabetics are also converted, but not the ® symbol. Fullwidth forms of Latin letters are converted to their halfwidth equivalents. On the other hand, halfwidth Katakana characters are expanded to their fullwidth equivalents. But small Katakana does not convert to their large equivalents.

    The Ø disappears completely! What's up with that? The character code for Ø is reported as C3_ALPHA | C3_NONSPACING | C3_DIACRITIC, and since we are removing nonspacing characters, this causes it to be removed. (Why is Ø nonspacing? It occupies space!) For whatever reason, it does not decompose into O + Combining Solidus Overlay. On the other hand, the Polish ł remains intact because it is reported as C3_ALPHA | C3_DIACRITIC. Poland wins and Norway loses?

    The diacritic removal ignores linguistic rules. The Swedish Å decomposes into a capital A and a combining ring above, even though in Swedish, the character is considered nondecomposable. (Just like the capital letter Q in English does not decompose into an O and a tail.) Katakana Go suffers a similar ignoble fate, converting to Katakana Ko, which is linguistically nonsensical. But then again, removing diacritics is already linguistically nonsensical. Nonsensical operation is nonsensical.

    There is no attempt to unify look-alike characters from different scripts. Look-alike characters in the Greek and Cyrillic alphabets are not mapped to their Latin doppelgängers.

    The infamous Turkish dotless i does not turn into a dotted i. (And the lowercase Latin i does not decompose into a combining dot and a dotless i.)

    Finally, I tried a selection of punctuation marks. Most of them pass through unchanged, with the exception of the double prime and double exclamation mark which each decompose into a pair of singles. (But double quotation marks do not decompose into a pair of singles.)

    Okay, but the goal of this exercise was spam detection, so we are actually interested in mapping as far as possible all the way down to plain ASCII. We'd like to convert, for example, the look-alike characters in the Cyrillic and Greek alphabets to the Latin characters they resemble.

    So let's try something else. If we want to convert to ASCII, then just convert to ASCII!

    #define CP_ASCII 20127
    void OnUpdate(HWND hwnd)
    {
      wchar_t szSource[MAXSOURCE];
      GetDlgItemText(hwnd, IDC_SOURCE, szSource, ARRAYSIZE(szSource));
      char szDest[MAXSOURCE * 2];
      int cchActual = WideCharToMultiByte(CP_ASCII, 0, szSource, -1,
                                  szDest, ARRAYSIZE(szDest), 0, 0);
      if (cchActual <= 0) szDest[0] = 0;
    
      SetDlgItemTextA(hwnd, IDC_DEST, szDest);
      SetDlgItemCodePoints(hwnd, IDC_SOURCEPOINTS, szSource);
    }
    

    We can extend the table above with a new column.

    Original character KD character ASCII character
    ª 00AA Feminine ordinal indicator a 0061 Latin small letter a a 0061 Latin small letter a
    ¹ 00B1 Superscript one 1 0031 Digit one 1 0031 Digit one
    ½ 00BD Vulgar fraction one half 1⁄2 0031 2044 0032 Digit one + Fraction slash + Digit two ? No conversion
    ı 0131 Latin small letter dotless i ı 0131 Latin small letter dotless i i 0069 Latin small letter i
    Ø 00D8 Latin capital letter O with stroke Disappears! O 004F Latin capital letter O
    ł 0142 Latin small letter l with stroke ł 0142 Latin small letter l with stroke l 006C Latin small letter l
    ŀ 0140 Latin small letter l with middle dot 006C 00B7 Latin small letter l + middle dot ? No conversion
    æ 00E6 Latin small letter ae æ 00E6 Latin small letter ae a 0061 Latin small letter a
    Ή 0389 Greek capital letter Eta with tonos Η 0397 Greek capital letter Eta ? No conversion
    А 0410 Cyrillic capital letter А А 0410 Cyrillic capital letter А ? No conversion
    Å 00C5 Latin capital letter A with ring above A 0041 Latin capital letter A A 0041 Latin capital letter A
    FF21 Fullwidth Latin capital letter A A 0041 Latin capital letter A A 0041 Latin capital letter A
    2460 Circled digit one 1 0031 Digit one ? No conversion
    2780 Dingbat circled sans-serif digit one 2780 Dingbat circled sans-serif digit one ? No conversion
    ® 00AE Registered sign ® 00AE Registered sign R 0052 Latin capital letter R
    24c7 Circled Latin capital letter R R 0052 Latin capital letter R ? No conversion
    𝖕 D835 DD95 Mathematical bold Fraktur small p p 0070 Latin small letter p ?? No conversion
    FF6C Halfwidth Katakana letter small Ya 30E3 Katakana letter small Ya ? No conversion
    30E3 Katakana letter small Ya 30E3 Katakana letter small Ya ? No conversion
    30B4 Katakana letter Go 30B3 Katakana letter Ko ? No conversion
    201C Left double quotation mark 201C Left double quotation mark " 0022 Quotation mark
    201D Right double quotation mark 201D Right double quotation mark " 0022 Quotation mark
    201E Double low-9 quotation mark 201E Double low-9 quotation mark " 0022 Quotation mark
    201F Double high-reversed-9 quotation mark 201F Double high-reversed-9 quotation mark ? No conversion
    2033 Double prime ′′ 2032 2032 Prime + Prime ? No conversion
    2032 Prime 2032 Prime ' 0027 Apostrophe
    2035 Reverse prime 2035 Reverse prime ` 0060 Grave accent
    2039 Single left-pointing angle quotation mark 2039 Single left-pointing angle quotation mark < 003C Less-than sign
    « 00AB Left-pointing double angle quotation mark « 00AB Left-pointing double angle quotation mark < 003C Less-than sign
    2014 Em-dash 2014 Em-dash - 002D Hyphen-minus
    203C Double exclamation mark !! 0021 0021 Exclamation mark + Exclamation mark ? No conversion

    There are some interesting differences here.

    Some characters fail to convert to ASCII outright. This is not unexpected for the Japanese characters, is mildly unexpected for the look-alikes in the Cyrillic and Greek alphabets, and is surprising for some characters like double prime, double exclamation point, enclosed alphanumerics, and vulgar fractions because they had ASCII decompositions in Normalization Form KD, but converting directly into ASCII refused to use them.

    But the dotless i gets its dot back.

    Another weird thing you might notice is that the æ converts to just the a. This goes contrary to the expectations of American English, because words which historically use the æ and œ are largely respelled in American English to use just the e. (Encyclopædia → encyclopedia, fœtus → fetus.) Mysteries abound.

    If your real goal is to map every character to its nearest ASCII look-alike, then all these code page games are just beating around the bush. The way to go is to use the Unicode Confusables database. There is a huge data file and instructions on how to use it. There's also a nice Web site that lets you explore the confusables database interactively.

    Or you could just take the sledgehammer approach: If there are a significant number of characters outside the Latin alphabet and punctuation and you are expecting English text, then just reject it as likely spam.

    ಠ_ಠ

  • The Old New Thing

    Is it wrong to call SHFileOperation from a service? Revised

    • 28 Comments

    My initial reaction to this question was to say, "I don't know if I'd call it wrong, but I'd call it highly inadvisable."

    I'd like to revise my guidance.

    It's flat-out wrong, at least in the case where you call it while impersonating.

    The registry key HKEY_CURRENT_USER is bound to the current user at the time the key is first accessed by a process:

    The mapping between HKEY_CURRENT_USER and HKEY_USERS is per process and is established the first time the process references HKEY_CURRENT_USER. The mapping is based on the security context of the first thread to reference HKEY_CURRENT_USER. If this security context does not have a registry hive loaded in HKEY_USERS, the mapping is established with HKEY_USERS\.Default. After this mapping is established it persists, even if the security context of the thread changes.

    Emphasis mine.

    This means that if you impersonate a user, and then access HKEY_CURRENT_USER, then that binds HKEY_CURRENT_USER to the impersonated user. Even if you stop impersonating, future references to HKEY_CURRENT_USER will still refer to that user.

    This is probably not what you expected.

    The shell takes a lot of settings from the current user. If you impersonate a user and then call into the shell, your service is now using that user's settings, which is effectively an elevation of privilege: An unprivileged user is now modifying settings for a service. For example, if the user has customized the Print verb for text files, and you use Shell­Execute to invoke the print verb on a text document, you are at the mercy of whatever the user's print verb is bound to. Maybe it runs Notepad, but maybe it runs pwnz0rd.exe. You don't know.

    Similarly, the user might have a per-user registered copy hook or namespace extension, and now you just loaded a user-controlled COM object into your service.

    In both cases, this is known to insiders as hitting the jackpot.

    Okay, so what about if you call Shell­Execute or some other shell function while not impersonating? You might say, "That's okay, because the current user's registry is the service user, not the untrusted attacker user." But look at that sentence I highlighted up there. Once HKEY_CURRENT_USER get bound to a particular user, it remains bound to that user even after impersonation ends. If somebody else inadvisedly called a shell function while impersonating, and that shell function happens to be the first one to access HKEY_CURRENT_USER, then your call to a shell function while not impersonating will still use that impersonated user's registry. Congratulations, you are now running untrusted code, and you're not even impersonating any more!

    So my recommendation is don't do it. Don't call shell functions while impersonating unless the function is explicitly documented as supporting impersonation. (The only ones I'm aware of that fall into this category are functions like SHGet­Folder­Path which accept an explicit token handle.) Otherwise, you may have created (or in the case of copy hooks, definitely created) a code injection security vulnerability in your service.

  • The Old New Thing

    A library loaded via LOAD_LIBRARY_AS_DATAFILE (or similar flags) doesn't get to play in any reindeer module games

    • 23 Comments

    If you load a library with the LOAD_LIBRARY_AS_DATA­FILE flag, then it isn't really loaded in any normal sense. In fact, it's kept completely off the books.

    If you load a library with the LOAD_LIBRARY_AS_DATA­FILE, LOAD_LIBRARY_AS_DATA­FILE_EXCLUSIVE, or LOAD_LIBRARY_AS_IMAGE_RESOURCE flag (or any similar flag added in the future), then the library gets mapped into the process address space, but it is not a true module. Functions like Get­Module­Handle, Get­Module­File­Name, Enum­Process­Modules and Create­Toolhelp32­Snapshot will not see the library, because it was never entered into the database of loaded modules.

    These "load library as..." flags don't actually load the library in any meaningful sense. They just take the file and map it into memory manually without updating any module tracking databases. This functionality was overloaded into the Load­Library­Ex function, which in retrospect was probably not a good idea, because people expect Load­Library­Ex to create true modules, but these flags create pseudo-modules, a term I made up just now.

    It would have been less confusing in retrospect if the "load library as..." functionality were split into another function like Load­File­As­Pseudo­Module. Okay, that's a pretty awful name, but that's not the point. The point is to put the functionality in some function that doesn't have the word library in its name.

    Okay, so now that we see that these pseudo-modules aren't true modules, and they don't participate in any reindeer module games. So what use are they?

    Basically, the only thing you can do with a pseudo-module is access its resources with functions like Find­Resource, Load­Resource, and Enum­Resource­Types. Note that this indirectly includes functions like Load­String, and Format­Message which access resources behind the scenes.

    So maybe a better name for the function would have been Load­File­For­Resources, since that's all the pseudo-module is good for.

  • The Old New Thing

    Distinguishing between normative and positive statements to help people answer your question

    • 28 Comments

    Often, we get questions from a customer that use the word should in an ambiguous way:

    Our program creates a widget whose flux capacitor should have reverse polarity. Attached is a sample program that shows how we create the widget with Create­Widget. However, the resulting widget still has a flux capacitor with standard polarity. Can you help us?

    The phrase should have reverse polarity is ambiguous. The question could be

    We would like to create a widget whose flux capacitor has reverse polarity. Attached is a sample program that shows how to create a widget whose flux capacitor has standard polarity. How should we modify it in order to get reverse polarity?

    Or the question might be

    We would like to create a widget whose flux capacitor has reverse polarity. Attached is a sample program that attempts to do so, but the resulting widget has a flux capacitor with standard polarity. The polarity flag appears to be ignored. Are are we doing something wrong, or is this a bug in Windows?

    The first is a normative statement: "This is what we would like to happen." The second is a positive statement: "This is what is happening."

    The distinction is important because the two types of statements require very different types of responses. If have a program that does X, and you want to change it to do Y, then you're asking for help working through the Y feature, clarifying the documentation, informing you which flags you need to pass, and so on. But if you have a program that tries to do Y and fails, then you're asking for help debugging your code and possibly identifying a bug in the operating system.

    Being clear with your request means that you can avoid wasting a lot of time when the wrong set of people are called in to help you out.

    Here's another example of vague use of the word should:

    We're trying to do XYZ. We've been told that it is blocked for security reasons, but there should be a way to do this.

    In this case, it is not clear what the customer means by the phrase should be a way to do this. It could be

    We're trying to do XYZ. We've been told that it is blocked for security reasons, but we think that Windows should be changed to allow our scenario. How can we file a change request with the Windows security team to make an exception for us?

    Or the customer might be trying to say

    We're trying to do XYZ. We've been told that it is blocked for security reasons, but we think that there is a way to get the effect of XYZ without triggering the security issue. Can you help us find it?

    Note that in both cases, the customer either failed to asked a question or made some statements and asked for nonspecific advice, which is effectly the same as not asking a question. If they had remembered to ask a question, then that question would have clarified what they intended by the word should.

    Bonus chatter: A physicist classmate of mine got a chuckle out of the phrase flux capacitor because it combines two physics terms in an impressive-sounding but mostly nonsensical way.

    A capacitor is a device which stores electric potential. In the hydraulic analogy of electricity, a capacitor is a rubber diaphragm that separates two parts of a pipe, but which "stores" water flow by stretching and "discharges" the water flow by returning to its rest position.

    Flux is cross-sectional flow per unit time. Water flux is volumetric flow rate (liters per second per square meter): it measures how vigorously the water flows across a boundary. Magnetic flux measures the strength of a magnetic field.

    The combination is nonsensical because the units don't match. A capacitor stores potential, whereas flux is measured in current or magnetic field strength. But if you generalize the term capacitor to mean "a thing that stores stuff", then a flux capacitor is a device which stores a magnetic field.

    Such devices already exist today. They are called magnets.

  • The Old New Thing

    File version information does not appear in the property sheet for some files

    • 26 Comments

    A customer reported that file version information does not appear on the Details page of the property sheet which appears when you right-click the file and select Properties. They reported that the problem began in Windows 7.

    The reason that the file version information was not appearing is that the file's extension was .xyz. Older versions of Windows attempted to extract file version information for all files regardless of type. I believe it was Windows Vista that changed this behavior and extracted version information only for known file types for Win32 modules, specifically .cpl, .dll, .exe, .ocx, .rll, and .sys. If the file's extension is not on the list above, then the shell will not sniff for version information.

    If you want to register a file type as eligible for file version extraction, you can add the following registry key:

    HKEY_LOCAL_MACHINE
     \Software
      \Microsoft
        \Windows
          \CurrentVersion
            \PropertySystem
              \PropertyHandlers
                \.XYZ
                 (Default) = REG_SZ:"{66742402-F9B9-11D1-A202-0000F81FEDEE}"
    

    (Thanks in advance for complaining about this change in behavior. This always happens whenever I post in the Tips/Support category about how to deal with a bad situation. Maybe I should stop trying to explain how to deal with bad situations.)

Page 3 of 439 (4,382 items) 12345»