September, 2004

Larry Osterman's WebLog

Confessions of an Old Fogey
  • Larry Osterman's WebLog

    What is this thing called, SID?

    • 24 Comments

    One of the core data structures in the NT security infrastructure is the security identifier, or SID.

    NT uses two data types to represent the SID, a PSID, which is just an alias for VOID *, and a SID, which is a more complicated structure (declared in winnt.h).

    The contents of a SID can actually be rather fascinating.  Here’s the basic SID structure:

    typedef struct _SID {
       BYTE  Revision;
       BYTE  SubAuthorityCount;
       SID_IDENTIFIER_AUTHORITY IdentifierAuthority;
       DWORD SubAuthority[ANYSIZE_ARRAY];
    } SID, *PISID;

    Not a lot there, but some fascinating stuff none the less.  First let’s consider the Revision.  That’s always set to 1, for existing versions of NT.  There may be a future version of NT that defines other values, but not yet.

    The next interesting field in a version 1 SID is the IdentifierAuthority.  The IdentifierAuthority is an array of 6 bytes, which describes which system “owns” the SID.  Essentially the IdentifierAuthority defines the meaning of the fields in the SubAuthority array, which is an array of DWORDs that is SubAuthorityCount in length (SubAuthorityCount can be any number between 1 and SID_MAX_SUB_AUTHORITIES (15 currently).  NT’s access check logic and SID validation logic treats the sub authority array as an opaque data structure, which can allow a resource manager to define their own semantics for the contents of the SubAuthority (this is strongly NOT recommended btw).

    The “good stuff” in the SID (the stuff that makes a SID unique) lives in the SubAuthority array in the SID.  Each entry in the SubAuthority array is known as a RID (for Relative ID), more on this later. 

    NT defines a string representation of the SID by constructing a string S-<Revision>-<IdentifierAuthority>-<SubAuthority0>-<SubAuthority1>-…-<SubAuthority<SubAuthorityCount>>.  For the purposes of constructing a string sid, the IdentifierAuthority is treated as a 48bit number.  You can convert between a binary SID and back by using the ConvertSidToStringSid and ConvertStringSidToSid APIs.

    NT defines 6 IdentifierAuthorities, they are:

    #define SECURITY_NULL_SID_AUTHORITY         {0,0,0,0,0,0}
    #define SECURITY_WORLD_SID_AUTHORITY        {0,0,0,0,0,1}
    #define SECURITY_LOCAL_SID_AUTHORITY        {0,0,0,0,0,2}
    #define SECURITY_CREATOR_SID_AUTHORITY      {0,0,0,0,0,3}
    #define SECURITY_NON_UNIQUE_AUTHORITY       {0,0,0,0,0,4}
    #define SECURITY_NT_AUTHORITY               {0,0,0,0,0,5}
    #define SECURITY_RESOURCE_MANAGER_AUTHORITY {0,0,0,0,0,9}

    Taken in turn, they are:

    ·         SECURITY_NULL_SID_AUTHORITY: The “NULL” Sid authority is used to hold the “null” account SID, or S-1-0-0. 

    ·         SECURITY_WORLD_SID_AUTHORITY: The “World” Sid authority is used for the “Everyone” group, there’s only one SID in that group, S-1-1-0.

    ·         SECURITY_LOCAL_SID_AUTHORITY: The “Local” Sid authority is used for the “Local” group, again, there’s only one SID in that group, S-1-2-0.

    ·         SECURITY_CREATOR_SID_AUTHORITY: This Sid authority is responsible for the CREATOR_OWNER, CREATOR_GROUP, CREATOR_OWNER_SERVER and CREATOR_GROUP_SERVER well known SIDs, S-1-3-0, S-1-3-1, S-1-3-2 and S-1-3-3.
    The SIDs under the CREATOR_SID_AUTHORITY are sort-of “meta-SIDs”.  Basically, when ACL inheritance is run, any ACEs that are owned by the SECURITY_CREATOR_SID_AUTHORITY are replaced (duplicated if the ACEs are inheritable) by ACEs that reflect the relevant principal that is performing the inheritance.  So a CREATOR_OWNER ACE will be replaced by the owner SID from the token of the user that’s performing the inheritance.

    ·         SECURITY_NON_UNIQUE_AUTHORITY:  Not used by NT

    ·         SECURITY_RESOURCE_MANAGER_AUTHORITY:  The “resource manager” authority is a catch-all that’s used for 3rd party resource managers. 

    ·         SECURITY_NT_AUTHORITY: The big kahuna.  This describes accounts that are managed by the NT security subsystem.

    There are literally dozens of well known SIDs under the SECURITY_NT_AUTHORITY sub authority.  They range from NETWORK (S-1-5-2), a group added to the token of all users connected to the machine via a network, to S-1-5-5-X-Y, which is the SID for all authenticated NT users (X and Y will be replaced by values specific to your per-machine logon instance).

    Each domain controller allocates RIDs for that domain, each principal created gets its own RID.  In general, for NT principals, the SID for each user in a domain will be identical, except for the last RID (that’s why it’s a “relative” ID – the value in SubAuthority[n] is relative to SubAuthority[n-1]).  In Windows NT (before Win2000), RID allocation was trivial – user accounts could only be created at the primary domain controller (there was only one  PDC, with multiple backup domain controllers) so the PDC could manage the list of RIDs that was allocated easily.  For Windows 2000 and later, user accounts can be created on any domain controller, so the RID allocation algorithm is somewhat more complicated.

    Clearly a great deal of effort is made to ensure uniqueness of SIDs, if SIDs did not uniquely identify a user, then “bad things” would happen.

    If you look in WINNT.H, you can find definitions for many of the RIDs for the builtin NT accounts, to form a SID for one of those accounts, you’d initialize a SID with the SECURITY_NT_AUTHORITY, and set the first SubAuthority to the RID of the desired account.  The good news is that because this is an extremely tedious process, the NT security guys defined an API (in Windows XP and later) named CreateWellKnownSid which can be used to create any of the “standard” SIDs.

    Tomorrow: Some fun things you can do with a SID.

     

  • Larry Osterman's WebLog

    Structured Exception Handling Considered Harmful

    • 33 Comments

    I could have sworn that I wrote this up before, but apparently I’ve never posted it, even though it’s been one of my favorite rants for years.

    In my “What’s wrong with this code, Part 6” post, several of the commenters indicated that I should be using structured exception handling to prevent the function from crashing.  I couldn’t disagree more.  In my opinion, SEH, if used for this purpose takes simple, reproducible and easy to diagnose failures and turns them into hard-to-debug subtle corruptions.

    By the way, I’m far from being alone on this.  Joel Spolsky has a rather famous piece “Joel on Exceptions” where he describes his take on exception (C++ exceptions).  Raymond has also written about exception handling (on CLR exceptions).

    Structured exception handling is in many ways far worse than C++ exceptions.  There are multiple ways that structured exception handling can truly mess up an application.  I’ve already mentioned the guard page exception issue.  But the problem goes further than that.  Consider what happens if you’re using SEH to ensure that your application doesn’t crash.  What happens when you have a double free?  If you don’t wrap the function in SEH, then it’s highly likely that your application will crash in the heap manager.  If, on the other hand, you’ve wrapped your functions with try/except, then the crash will be handled.  But the problem is that the exception caused the heap code to blow past the release of the heap critical section – the thread that raised the exception still holds the heap critical section. The next attempt to allocate memory on another thread will deadlock your application, and you have no way of knowing what caused it.

    The example above is NOT hypothetical.  I once spent several days trying to track down a hang in Exchange that was caused by exactly this problem – Because a component in the store didn’t want to crash the store, they installed a high level exception handler.  That handler caught the exception in the heap code, and swallowed it.  And the next time we came in to do an allocation, we hung.  In this case, the offending thread had exited, so the heap critical section was marked as being owned by a thread that no longer existed.

    Structured exception handling also has performance implications.  Structured exceptions are considered “asynchronous” by the compiler – any instruction might cause an exception.  As a result of this, the compiler can’t perform flow analysis in code protected by SEH.  So the compiler disables many of its optimizations in routines protected by try/catch (or try/finally).  This does not happen with C++ exceptions, by the way, since C++ exceptions are “synchronous” – the compiler knows if a method can throw (or rather, the compiler can know if an exception will not throw).

    One other issue with SEH was discussed by Dave LeBlanc in Writing Secure Code, and reposted in this article on the web.  SEH can be used as a vector for security bugs – don’t assume that because you wrapped your function in SEH that your code will not suffer from security holes.  Googling for “structured exception handling security hole” leads to some interesting hits.

    The bottom line is that once you’ve caught an exception, you can make NO assumptions about the state of your process.  Your exception handler really should just pop up a fatal error and terminate the process, because you have no idea what’s been corrupted during the execution of the code.

    At this point, people start screaming: “But wait!  My application runs 3rd party code whose quality I don’t control.  How can I ensure 5 9’s reliability if the 3rd party code can crash?”  Well, the simple answer is to run that untrusted code out-of-proc.  That way, if the 3rd party code does crash, it doesn’t kill YOUR process.  If the 3rd party code is processing a request crashes, then the individual request fails, but at least your service didn’t go down in the process.  Remember – if you catch the exception, you can’t guarantee ANYTHING about the state of your application – it might take days for your application to crash, thus giving you a false sense of robustness, but…

     

    PS: To make things clear: I’m not completely opposed to structured exception handling.  Structured exception handling has its uses, and it CAN be used effectively.  For example, all NT system calls (as opposed to Win32 APIs) capture their arguments in a try/except handler.  This is to guarantee that the version of the arguments to the system call that is referenced in the kernel is always valid – there’s no way for an application to free the memory on another thread, for example.

    RPC also uses exceptions to differentiate between RPC initiated errors and function return calls – the exception is essentially used as a back-channel to provide additional error information that could not be provided by the remoted function.

    Historically (I don’t know if they do this currently) the NT file-systems have also used structured exception handling extensively.  Every function in the file-systems is protected by a try/finally wrapper, and errors are propagated by throwing exception this way if any code DOES throw an exception, every routine in the call stack has an opportunity to clean up its critical sections and release allocated resources.  And IMHO, this is the ONLY way to use SEH effectively – if you want to catch exceptions, you need to ensure that every function in your call stack also uses try/finally to guarantee that cleanup occurs.

    Also, to make it COMPLETELY clear.  This post is a criticism of using C/C++ structured exception handling as a way of adding robustness to applications.  It is NOT intended as a criticism of exception handling in general.  In particular, the exception handling primitives in the CLR are quite nice, and mitigate most (if not all) of the architectural criticisms that I’ve mentioned above – exceptions in the CLR are synchronous (so code wrapped in try/catch/finally can be optimized), the CLR synchronization primitives build exception unwinding into the semantics of the exception handler (so critical sections can’t dangle, and memory can’t be leaked), etc.  I do have the same issues with using exceptions as a mechanism for error propagation as Raymond and Joel do, but that’s unrelated to the affirmative harm that SEH can cause if misused.

  • Larry Osterman's WebLog

    Running Non Admin

    • 38 Comments

    There’s been a fascinating process going on over here behind the curtains.  With the advent of XP SP2, more and more people are running as non administrative users.  Well, it’s my turn to practice what I preach, I’ve taken the plunge on my laptop and my home machine, I’m now running as a non admin user (I can’t do it on my development machine at work for the next few weeks for a variety of reasons).

    The process so far has been remarkably pain free, but there have been some “interesting” idiosyncrasies.  First off, I’ve been quite surprised at the number of games that have worked flawlessly.  I was expecting to have major issues, but none so far, with the exception of Asheron’s Call.  Annoyingly, the problem with AC isn’t the game itself, it’s with Microsoft’s Gaming Zone software, which insists on modifying files in the C:\Program Files directory. 

    Aaron Margosis’ blog posts about running as a limited user have been extremely helpful as well.

    Having said that, there are some oddities I’ve noticed.  First off: There seem to be a lot of applications that “assume” that they know what the user’s going to do.  For instance, if you double click on the time in the system tray, it pops up with “You don’t have the proper privilege level to change the System Time”.  This is a completely accurate statement, since modifying the time requires the SeSystemTime privilege, which isn’t granted to limited users.  But it assumes that the reason that I was clicking on the time was to change the time.  But maybe I wanted to use the date&time control panel as a shortcut to the calendar?  I know of a bunch of users that call action of double clicking on the time in the taskbar as invoking the “cool windows calendar”, they don’t realize that they’re just bringing up the standard date&time applet.  If I don’t have the SeSystemTime privilege, then why not just grey out the “OK” button?  Let me navigate the control but just prevent me from changing things.

    Similarly, the users control panel applet prompts you with a request to enter your credentials.  Why?  There are lots of things a limited user can do with the users control panel applet (enumerating groups, enumerating users, enumerating group membership, setting user information).  But the control panel applet ASSUMES that the user wants to manipulate the state of the other users.  It’s certainly true that most of the useful functionality of that control panel applet requires admin access.  But it should have waited until the user attempted to perform an action that was denied before it prompted the user for admin credentials.

    From my standpoint, these examples violate two of the principals of designing interfaces that involve security: 

    1)      Don’t tell the user they can’t do something until you’ve proven that they can’t do it.

    2)      Don’t assume what access rights the user needs to perform an operation. 

    The date&time control panel violates the first principal.  The user might be interacting with the control panel for reasons other than changing the time.  It turns out that the reason for this is that the date&time applet violates the principle of least privilege by enabling the SeDateTime privilege, running the control panel applet, and then disabling the privilege.  If the control panel applet had waited until the user clicked on the “Apply” button before it enabled the privilege (and then failed when the enable privilege failed), it would have better served the user IMHO.

    The users control panel applet violates the second principal.  In the case of the users control panel, it assumed that I was going to do something that required admin access.   This may in fact be a reasonable assumption given the work that the users control panel applet does (its primary task is to manage local group membership).  But the applet assumes up front that the user has to be an administrator to perform the action.  There may in fact be other classes of users that can access the information in the users control panel – as an example, members of the domains’ “account operators” group may very well be able to perform some or all the actions that the users control panel applet performs.  But the control panel applet doesn’t check for that – it assumes that the user has to be a member of the local administrators group to use the control panel applet.  Interestingly enough, this behavior only happens on XP PRO when joined to a domain.  If you’re not joined to a domain, the users control panel applet allows you to change your user information without prompting you – even as a limited user.   Peter Torr also pointed out that the computer management MCC snapin (compmgmt.msc) does the “right” thing – you can interact with the UI, perform actions (adding users to groups, etc), and it’s only when you click the “Apply” button that it fails.  The snap-in doesn’t know what’s allowed or not, it just tries the operation, and reports the failure to the user.

    This is a really tough problem to solve from a UI perspective – you want to allow the user to do their work, but it’s also highly desirable that you not elevate the privilege of the user beyond the minimum that’s required for them to do their job.  The good news is that with more and more developers (both at Microsoft and outside Microsoft) running as non administrative users, more and more of these quirks in the system will be fixed.

     

    Edit: Thanks Mike :)
  • Larry Osterman's WebLog

    Hey, why am I leaking all my BSTR's?

    • 12 Comments

    IMHO, every developer should have a recent copy of the debugging tools for windows package installed on their machine (it's updated regularly, so check to see if there's a newer version).

    One of the most useful leak tracking tools around is a wonderfully cool tool that's included in this package, UMDH.  UMDH allows you to take a snapshot of the heaps in a process, and perform a diff of the heap over time - basically you run it once to take a snapshot, then run it a second time after running a particular test and it allows you to compare the differences in the heaps.

    This tool can be unbelievably useful when debugging services, especially shared services.  The nice thing about it is that it provides a snapshot of the heap usage, there are often times when that's the only way to determine the cause of a memory leak.

    As a simple example of this the Exchange 5.5 IMAP server cached user logons.  It did this for performance reasons, it could take up to five seconds for a call to LogonUser to complete, and that affected our ability to service large numbers of clients - all of the server threads ended up being blocked waiting on the domain controllers to respond.  So we put in a logon cache.  The cache took the users credentials, performed a LogonUser with those credentials, and put the results into a heap.  On subsequent logons, the cache took the users credentials, looked them up in the heap, and if they were found, it just reused the token from the cache (and no, it didn't do the lookup in clear text, I'm not that stupid).  Unfortunately, when I first wrote the cache implementation, I had an uninitialized variable in the hash function used to lookup the user in the cache, and as a result, every user logon occupied a different slot in the hash table.  As a result, when run over time, I had a hideous memory leak (hundreds of megabytes of VM).  But, since the cache was purged on exit, the built-in leak tracking logic in the Exchange store didn't detect any memory leaks. 

    We didn't have UMDH at the time, but UMDH would have been a perfect solution to the problem.

    I recently went on a tear trying to find memory leaks in some of the new functionality we've added to the Windows Audio Service, and used UMDH to try to catch them.

    I found a bunch of the leaks, and fixed them, but one of the leaks I just couldn't figure out showed up every time we allocated a BSTR object.

    It drove me up the wall trying to figure out how we were leaking BSTR objects, nothing I did found the silly things.  A bunch of the leaks were in objects allocated with CComBSTR, which really surprised me, since I couldn't see how on earth they would leak memory.

    And then someone pointed me to this KB article (KB139071).  KB1239071 describes the OLE caching of BSTR objects.  It also turns out that this behavior is described right on the MSDN page for the string manipulation functions, proving once again that I should have looked at the documentation :).

    Basically, OLE caches all BSTR objects allocated in a process to allow it to pool together strings.  As a result, these strings are effectively leaked “on purposeâ€.  The KB article indicates that the cache is cleared when the OLEAUT32.DLL's DLL_PROCESS_DETACH logic is run, which is good to know, but didn't help me to debug my BSTR leak - I could still be leaking BSTRs.

    Fortunately, there's a way of disabling the BSTR caching, simply set the OANOCACHE environment variable to 1 before launching your application.  If your application is a service, then you need to set OANOCACHE as a system environment variable (the bottom set of environment variables) and reboot.

    I did this and all of my memory leaks mysteriously vanished.  And there was much rejoicing.

     

  • Larry Osterman's WebLog

    What's wrong with this code, part 6

    • 45 Comments

    Today, let’s look at a trace log writer.  It’s the kind of thing that you’d find in many applications; it simply does a printf and writes its output to a log file.  In order to have maximum flexibility, the code re-opens the file every time the application writes to the filelog.  But there’s still something wrong with this code.

    This “what’s wrong with this code” is a little different.  The code in question isn’t incorrect as far as I know, but it still has a problem.  The challenge is to understand the circumstances in which it doesn’t work.

    /*++
     * LogMessage
     *      Trace output messages to log file.
     *
     * Inputs:
     *      FormatString - printf format string for logging.
     *
     * Returns:
     *      Nothing
     *     
     *--*/
    void LogMessage(LPCSTR FormatString, ...)
    #define LAST_NAMED_ARGUMENT FormatString
    {
        CHAR outputBuffer[4096];
        LPSTR outputString = outputBuffer;
        size_t bytesRemaining = sizeof(outputBuffer);
        ULONG bytesWritten;
        bool traceLockHeld = false;
        HANDLE traceLogHandle = NULL;
        va_list parmPtr;                    // Pointer to stack parms.
        EnterCriticalSection(&g_TraceLock);
        traceLockHeld = TRUE;
        //
        // Open the trace log file.
        //
        traceLogHandle = CreateFile(TRACELOG_FILE_NAME, FILE_APPEND_DATA, FILE_SHARE_READ, NULL, OPEN_ALWAYS, FILE_ATTRIBUTE_NORMAL, NULL);
        if (traceLogHandle == INVALID_HANDLE_VALUE)
        {
            goto Exit;

        }
        //
        // printf the information requested by the caller onto the buffer
        //
        va_start(parmPtr, FormatString);
        StringCbVPrintfEx(outputString, bytesRemaining, &outputString, &bytesRemaining, 0, FormatString, parmPtr);
        va_end(parmPtr);
        //

        // Actually write the bytes.
        //
        DWORD lengthToWrite = static_cast<DWORD>(sizeof(outputBuffer) - bytesRemaining);
        if (!WriteFile(traceLogHandle, outputBuffer, lengthToWrite, &bytesWritten, NULL))
        {
            goto Exit;
        }
        if (bytesWritten != lengthToWrite)
        {
            goto Exit;
        }
    Exit:
        if (traceLogHandle)
        {
            CloseHandle(traceLogHandle);
        }
        if (traceLockHeld)
        {
            LeaveCriticalSection(&g_TraceLock);
            traceLockHeld = FALSE;
        }
    }

    One hint: The circumstance I’m thinking of has absolutely nothing to do with out of disk space issues. 

    As always, answers and kudos tomorrow.

     

  • Larry Osterman's WebLog

    Fun things to do with SIDs

    • 6 Comments

    As I mentioned in my previous article, SIDs are fascinating beasts.

    Consider domain SIDs.  As I mentioned yesterday, domain SIDs have the form S-1-5-5-X-Y.  But where do X and Y come from?

    S-1-5-5-X is the “domain” SID (X is the domain RID).  It turns out that in current versions of NT, X is always a 96bit random number (3 RIDs).  It’s calculated when the domain is created. And when you establish trust relationships between domains, the SIDs in the foreign domains are filtered out to prevent a foreign domain that has the same value of X (a highly unlikely occurrence, but possible).

    Now let’s consider Y.  Before Windows 2000, the allocation of Y was easy.  There was a strict domain hierarchy consisting of a primary domain controller that handled all account modifications, and backup domain controllers that handled authentication requests.  Since all account creation occurred on the primary domain controller, all the PDC did was to look in the SAM database for the highest previously allocated user RID, increment that value, assign the new value to the new user account, then write the incremented value back into the database.

    For Windows 2000, the system went to a multi-master replication scheme – every domain controller could create user accounts.  To handle that case, the RID FSMO (Flexible Single Master Operations) DC (essentially a broker for RID allocations) for divided the user RID space into a set of ranges.  As each DC needed more RIDs, it asked the RID FSMO DC for a new allocation, and the RID FSMO DC would assign a range to the DC.

    Now for the fun stuff.

    If you are a user on a domain, it’s relatively trivial to determine the well known groups in the domain.  For example, if you have your current processes SID (which you get by calling OpenProcessToken() and calling GetTokenInformation() asking for the TokenUser information).

    To get the domain administrators group for a given SID, you can do the following:

          PSID GetWellKnownDomainSid(PSID UserSid, WELL_KNOWN_SID_TYPE DomainSidType)
          {
                SID_IDENTIFIER_AUTHORITY sia = SECURITY_NT_AUTHORITY;
                PSID domainSid;
                PSID sidToReturn;
                DWORD sidSize = SECURITY_MAX_SID_SIZE;
                //
                // Check to ensure that the calling SID is an NT SID
                //

                if (memcmp(GetSidIdentifierAuthority(UserSid), &sia, sizeof(SID_IDENTIFIER_AUTHORITY)) != 0)         {
                      return NULL;
                }

                //
                // Now allocate the domain SID for the specified user SID
                //
                if (!AllocateAndInitializeSid(&sia, 3, GetSidSubAuthority(UserSid, 0), GetSidSubAuthority(UserSid, 1), GetSidSubAuthority(UserSid, 2), 0, 0, 0, 0, 0, &domainSid))
                {
                      return NULL;
                }
                sidToReturn = LocalAlloc(LMEM_FIXED, sidSize);
                if (sidToReturn == NULL)
                {
                      FreeSid(domainSid);
                      return NULL;
                }
                if (!CreateWellKnownSid(DomainSidType, domainSid, sidToReturn, &sidSize))
                {
                      FreeSid(domainSid);
                      return NULL;
                }
                FreeSid(domainSid);
                return sidToReturn;
          }

    So to get the domain admin SID for a user SID, you simply call GetWellKnownDomainSid(psid, WinBuiltinAdministratorsSid);

    Caveat Lector: I’ve not tested this code, it’s likely it may have bugs in it, but as a proof of concept, it’s accurate.

    Comment: If you’re willing to modify the callers SID, the code to AllocateAndInitializeSid() can be replaced with *GetSidSubAuthorityCount(UserSid) = 3.

    Edit: domain SID->domain RID

  • Larry Osterman's WebLog

    Cleaning up shared resources when a process is abnormally terminated

    • 18 Comments

    This post came into my suggestion box yesterday from Darren Cherneski:

    We have a system that has an in-memory SQL database running in shared memory that is created with CreateFileMapping(). Processes start up, attach to it via a DLL, do some queries, and shut down. The problem we keep running into during development is when a developer starts up a process in the debugger, performs a query (which gets a lock on a database table), and then the developer hits Shift-F5 to stop debugging, the lock on the database table doesn't get released. We've put code in the DllMain() function of the DLL to perform proper cleanup when a process crashes but DllMain() doesn't seem to get called when a developer stops a processes in the debugger.

    Windows has hundreds of system DLLs where a process can get a handle to a resource (Mutex, file, socket, GDI, etc). How do these DLLs know to cleanup when a developer hits Stop in the debugger?

    It's a great question which comes up on our internal Win32 programming alias once a month or so, and it illustrates one of the key issues with resource ownership.

    The interesting thing is that this issue only occurs with named synchronization objects.  Unnamed synchronization objects are always private, so the effects of a process abnormally terminating are restricted to that process.  The other resources mentioned above (files, sockets, GDI, etc) don't have this problem; because when the process is terminated, the handle to the resource is closed, and closing that handle causes all the per-process state (locks on the file, etc) to be flushed.  The problem with synchronization objects is that with the exception of mutexes, they have state (the signaled state) that's not tied to a process or thread.  The system has no way of knowing what to do when a handle is closed with an event set to the signaled state, because there is no way of knowing what the user intended.

    Having said that, a mutex DOES have the concept of an owning thread, and if the thread that owns a mutex terminates, then one of the threads blocked waiting on the mutex will be awoken with a return code of WAIT_ABANDONED.  That allows the caller to realize that the owning thread was terminated, and perform whatever cleanup is necessary.

    Putting code in the DllMain doesn't work, because, as the Darren observed, the DllMain won't be called when the process is terminated abruptly (like when exiting the debugger).

    To me, the right solution is to use a mutex to protect the shared memory region, and if any of the people waiting on the mutex get woken up with WAIT_ABANDONED, they need to recognize that the owner of the mutex terminated without releasing the resource and clean up.

    Oh, and I looked Windows doesn't have "hundreds of system DLLs where a process can get a handle to a resource"  There are actually a relatively few cases in the Windows code base where a named shared synchronization object is created (for just this reason).  And all of the cases I looked at either use a mutex and handle the WAIT_ABANDONED error, or they're using a manual reset event (which don't have this problem), or they have implemented some form of alternative logic to manage this issue (waiting with a timeout, registering the owner in a shared memory region, and if the timeout occurs, looking for the owner process still exists).

    The reason that manual reset events aren't vulnerable to this issue, btw is that they don't have the concept of "ownership", instead, manual reset events are typically used to notify multiple listeners that some event has occurred (or that some state has changed).  In fact, internally in the kernel, manual reset events are known as NotificationEvents for just this reason (auto-reset events are known as SynchronizationEvents).  Oh, and mutexes are known as Mutants internally (you can see this if you double click on a mutex object using the WinObj tool) Why are they called mutants?  Well, it's sort-of an in joke.  As Helen Custers put it in "Inside Windows NT":

    The name mutant has a colorful history.  Early in Windows NT's development, Dave Cutler created a kernel mutex object that implemented low-level mutual exclusion.  Later he discovered that OS/2 required a version of the mutual exclusion semaphore with additional semantics, which Dave considered "brain-damaged" and which was incompatible with the original object. (Specifically, a thread could abandon the object and leave it inaccessible.)  So he created an OS/2 version of the mutex and gave it the name mutant.  Later Dave modified the mutant object to remove the OS/2 semantics, allowing the Win32 subsystem to use the object.  The Win32 API calls the modified object mutex, but the native services retain the name mutant.

    Edit: Cleaned up newsgator mess.

     

  • Larry Osterman's WebLog

    Access Checks, part 2

    • 4 Comments

    Yesterday I discussed the format of an ACL.  For todays post, I want to talk about how the system uses ACLs to perform access checks.  Once again, the post on security terms is likely to be helpful.

    There are two forms of access check accessable from the same API – an access check can either be for a specific set of access rights, or it can be for the well known right “MAXIMUM_ALLOWED” – a MAXIMUM_ALLOWED access check basically grants the user as many rights as they can have, and no more.  In general, asking for MAXIMUM_ALLOWED is not advised, instead you should ask for the rights you need and no more.

    As I mentioned yesterday, Access Check takes three inputs: The user’s token, a desired access mask, and a security descriptor, and performs bitwise manipulation to determine a Boolean result: granted or not (the actual check is more complicated than that, but…).

    In a nutshell, the AccessCheck logic is as follows:

    Iterate through the ACEs in the ACL.
                If the SID in the ACE is active in the users token, then if the ACE is a grant ACE, turn off the bits in the desired access mask that correspond to the bits in the AccessMask field in the ACE.
                If the current desired access mask is 0, grant access.
                If the ACE is a deny ACE, then if the bits in the AccessMask are on in the desired access mask, then deny access.

    One feature of this algorithm that merits calling out: The user is denied access by default – if you aren’t granted access to the resource by virtue of the SIDs active in your token, then you don’t have access to the resource.  Period, end of discussion.

    The addition of restricted SIDs makes the access check process a check a smidge more complicated.  There are actually two checks performed, the first against the ACL using the normal SIDs in the users token.  If that succeeds, a second check is done on the restricted SIDs in the users token.  If either fails, access is denied.  In addition, there are two types of SIDs that can appear in the token – “normal” SIDs and “deny-only” SIDs – the deny-only SIDs will never grant access, but WILL deny access (in other words, the deny-only SIDs only apply for deny ACEs, not grant ACEs).  You can find the list of SIDs used in the AccessCheck process by calling GetTokenInformation asking for the TokenGroupsAndPrivileges information level.  The TOKEN_GROUPS_AND_PRIVILEGES structure contains the both lists of SIDs checked in AccessCheck.  The list of SIDs used for the first check is contained in the Sids structure field.  The list of restricted SIDs used for the second check is contained in the RestrictedSids structure field.

    The following is a rough pseudo-code version for the AccessCheck API:

    AccessCheck (desiredAccess, Token, SecurityDescriptor)

    {

          //

          // Handle the implicit WRITE_DAC and READ_CONTROL access rights.

          //

          if (<SecurityDescriptor->Owner is active in Token>)

          {

                desiredAccess &= ~(WRITE_DAC | READ_CONTROL);

                grantedAccess |= (WRITE_DAC | READ_CONTROL);

          }

          //

          //    Handle the ACCESS_SYSTEM_SECURITY meta access right.

          //

          if (desiredAccess & ACCESS_SYSTEM_SECURITY)

          {

                if (SE_SYSTEM_SECURITY_PRIVILEGE enabled in Token)

                {

                      desiredAccess &= ~ACCESS_SYSTEM_SECURITY;

                      grantedAccess |= ACCESS_SYSTEM_SECURITY;

                }

                else if (desiredAccess != MAXIMUM_ALLOWED)

                      return failure;

                }

          }

          //

          //    Handle NULL DACL case.

          //

          if (SecurityDescriptor->Control & SE_DACL_PRESENT == 0)

                return success, desiredAccess;
          //
          //    Empty DACL means no access.
          //

          if (SecurityDescriptor->Dacl->AceCount == 0)

                return failure, grantedAccess;

          //

          //    If we’ve granted all the desired accesses, access is allowed.

          //

          if (desiredAccess == 0)

                return success, grantedAccess;

     

          //

          //    Handle MAXIMUM_ALLOWED meta-right

          //

          if (desiredAccess == MAXIMUM_ALLOWED)

          {

                result = <MAXIMUM_ALLOWED Access Check, with normal token SIDs>

                If (result == success && Token is restricted)

                {

                      result = <MAXIMUM_ALLOWED Access Check, with restricted token SIDs>

                }

    return result;
          }

          else  // Handle “normal” access rights.

          {

                result = <Simple Access Check with normal token SIDs>

                If (result == success && Token is restricted)

                {

                result = <Simple Access Check with restricted token SIDs>

          }

          }

    }

     The MAXIMUM_ALLOWED access check is (roughly):

    for (i = 0; i< SecurityDescriptor->Dacl->AceCount ; i+=1)

    {

          Ace = SecurityDescriptor->Dacl->Ace[i];

          if (<Ace->Sid is active in Token>)

          {

                if (Ace->AceType==ACCESS_ALLOWED_ACE)

                {

                      grantedAccess |= Ace->AccessMask;

                }

                else if (Ace->AceType==ACCESS_DENIED_ACE|| Ace->AceType==ACCESS_DENIED_OBJECT_ACE)

                {

                      deniedAccess |= Ace->AccessMask;

                }

          }

    }

    returnedAccess = grantedAccess | ~deniedAccess;

    if (returnedAccess != 0)

          return success, returnedAccess;

    else

          return failure, returnedAccess;

    The “Normal” access check is (roughly):

    for (i = 0; i< SecurityDescriptor->Dacl->AceCount ; i+=1)

    {

          Ace = SecurityDescriptor->Dacl->Ace[i];

          if (<Ace->Sid is active in Token>)

          {

                if (Ace->AceType==ACCESS_ALLOWED_ACE)

                {

                      desiredAccess &= ~Ace->AccessMask;

                      grantedAccess |= (Ace->AccessMask & ~deniedAccess);

                }

                else if (Ace->AceType==ACCESS_DENIED_ACE||

                      Ace->AceType==ACCESS_DENIED_OBJECT_ACE)

                {

                      deniedAccess |= (Ace->AceMask & ~grantedAccess);

                      if (desiredAccess & Ace->AceMask)

                      {

                            return failure, desiredAccess;

                      }

                }

          }

          if (desiredAccess==0)

          {

                return success, grantedAccess|~deniedAccess;

          }

    }

    if (desiredAccess != 0)

    {

          return failure, grantedAccess|~deniedAccess;

    }

    The big difference between the “normal” and the “maximum allowed” access check is that the normal access check has an early-out when all the desired accesses are granted, while the maximum allowed access check needs to iterate over all the ACEs to determine the full set of rights granted to the user.

    Edit: Added recommendation against using MAXIMUM_ALLOWED.

     

  • Larry Osterman's WebLog

    When I moved my code into a library, what happened to my ATL COM objects?

    • 6 Comments

    A caveat: This post discusses details of how ATL7 works.  For other version of ATL, YMMV.  The general principals apply for all versions, but the details are likely to be different.

    My group’s recently been working on reducing the number of DLLs that make up the feature we’re working on (going from somewhere around 8 to 4).  As a part of this, I’ve spent the past couple of weeks consolidating a bunch of ATL COM DLL’s.

    To do this, I first changed the DLLs to build libraries, and then linked the libraries together with a dummy DllInit routine (which basically just called CComDllModule::DllInit()) to make the DLL.

    So far so good.  Everything linked, and I got ready to test the new DLL.

    For some reason, when I attempted to register the DLL, the registration didn’t actually register the COM objects.  At that point, I started kicking my self for forgetting one of the fundamental differences between linking objects together to make an executable and linking libraries together to make an executable.

    To explain, I’ve got to go into a bit of how the linker works.  When you link an executable (of any kind), the linker loads all the sections in the object files that make up the executable.  For each extdef symbol in the object files, it starts looking for a public symbol that matches the symbol.

    Once all of the symbols are matched, the linker then makes a second pass combining all the .code sections that have identical contents (this has the effect of collapsing template methods that expand into the same code (this happens a lot with CComPtr)).

    Then a third pass is run. The third pass discards all of the sections that have not yet been referenced.  Since the sections aren’t referenced, they’re not going to be used in the resulting executable, so to include them would just bloat the executable.

    Ok, so why didn’t my ATL based COM objects get registered?  Well, it’s time to play detective.

    Well, it turns out that you’ve got to dig a bit into the ATL code to figure it out.

    The ATL COM registration logic gets picked in the CComModule object.  Within that object, there’s a method RegisterClassObjects, which redirects to AtlComModuleRegisterClassObjects.  This function walks a list of _ATL_OBJMAP_ENTRY structures and calls the RegisterClassObject on each structure.  The list is retrieved from the m_ppAutoObjMapFirst member of the CComModule (ok, it’s really a member of the _ATL_COM_MODULE70, which is a base class for the CComModule).  So where did that field come from?

    It’s initialized in the constructor of the CAtlComModule, which gets it from the __pobjMapEntryFirst global variable.   So where’s __pobjMapEntryFirst field come from?

    Well, there are actually two fields of relevance, __pobjMapEntryFirst and __pobjMapEntryLast.

    Here’s the definition for the __pobjMapEntryFirst:

    __declspec(selectany) __declspec(allocate("ATL$__a")) _ATL_OBJMAP_ENTRY* __pobjMapEntryFirst = NULL;

    And here’s the definition for __pobjMapEntryLast:

    __declspec(selectany) __declspec(allocate("ATL$__z")) _ATL_OBJMAP_ENTRY* __pobjMapEntryLast = NULL;

    Let’s break this one down:

            __declspec(selectany): __declspec(selectany) is a directive to the linker to pick any of the similarly named items from the section – in other words, if a __declspec(selectany) item is found in multiple object files, just pick one, don’t complain about it being multiply defined.

            __declspec(allocate(“ATL$__a”)) – This one’s the one that makes the magic work.  This is a declaration to the compiler, it tells the compiler to put the variable in a section named “ATL$__a” (or “ATL$__z”).

    Ok, that’s nice, but how does it work?

    Well, to get my ATL based COM object declared, I included the following line in my header file:

            OBJECT_ENTRY_AUTO(<my classid>, <my class>)

    OBJECT_ENTRY_AUTO expands into:

    #define OBJECT_ENTRY_AUTO(clsid, class) \

            __declspec(selectany) ATL::_ATL_OBJMAP_ENTRY __objMap_##class = {&clsid, class::UpdateRegistry, class::_ClassFactoryCreatorClass::CreateInstance, class::_CreatorClass::CreateInstance, NULL, 0, class::GetObjectDescription, class::GetCategoryMap, class::ObjectMain }; \

            extern "C" __declspec(allocate("ATL$__m")) __declspec(selectany) ATL::_ATL_OBJMAP_ENTRY* const __pobjMap_##class = &__objMap_##class; \

            OBJECT_ENTRY_PRAGMA(class)

    Notice the declaration of __pobjMap_##class above – there’s that “declspec(allocate(“ATL$__m”))” thingy again.  And that’s where the magic lies.  When the linker’s laying out the code, it sorts these sections alphabetically – so variables in the ATL$__a section will occur before the variables in the ATL$__z section.  So what’s happening under the covers is that ATL’s asking the linker to place all the __pobjMap_<class name> variables in the executable between __pobjMapEntryFirst and __pobjMapEntryLast.

    And that’s the crux of the problem.  Remember my comment above about how the linker works resolving symbols?  It first loads all the items (code and data) from the OBJ files passed in, and resolves all the external definitions for them.  But none of the files in the wrapper directory (which are the ones that are explicitly linked) reference any of the code in the DLL (remember, the wrapper doesn’t do much more than simply calling into ATL’s wrapper functions – it doesn’t reference any of the code in the other files.

    So how did I fix the problem?  Simple.  I knew that as soon as the linker pulled in the module that contained my COM class definition, it’d start resolving all the items in that module.  Including the __objMap_<class>, which would then be added in the right location so that ATL would be able to pick it up.  I put a dummy function call called “ForceLoad<MyClass>” inside the module in the library, and then added a function called “CallForceLoad<MyClass>” to my DLL entry point file (note: I just added the function – I didn’t call it from any code).

    And voila, the code was loaded, and the class factories for my COM objects were now auto-registered.

    What was even cooler about this was that since no live code called the two dummy functions that were used to pull in the library, pass three of the linker discarded the code!

     

  • Larry Osterman's WebLog

    Types of testers and types of testing

    • 15 Comments

    In yesterday’s “non admin” post, Mat Hall made the following comment:

    "Isn't testing the whole purpose of developing as non-admin?"

    Remember, Larry is lucky enough that the REAL testing of his work is done by someone else. The last time I did any development in a team with dedicated testers, my testing was of the "it compiles, runs, doesn't break the build, and seems to do what I intended it to". I then handed it over to someone else who hammered it to death in completely unexpected ways and handed it back to me...
     

    Mat’s right and it served as a reminder to me that not everyone lives in the ivory tower with the resources of a dedicated test team.  Mea culpa.

    Having said that, I figured that a quick discussion about the kinds of testers and the types of tests I work with might be interesting.  Some of this is software test engineering 101, some of it isn’t.

    In general, there are basically four different kinds of testing done of our products.

    The first type of testing is static analysis tools.  These are tools like FxCop and PREfast that the developers run on our code daily and help to catch errors before they leave the developers machines.  Gunnar Kudrjavets has written a useful post about the tools we use that can be found here.

    The second is the responsibility of the developer – before a feature can be deployed, we need to develop a set of unit tests for that feature.  For some components, this test can be quite simple.  For example, the waveOutGetNumDevs() unit test is relatively simple, because it doesn’t have any parameters, and thus has a relatively limited set of scenarios.  Some components have quite involved unit tests.  The unit tests in Exchange server that test email delivery can be quite complicated.

    In general, a unit test functions as a “sniff test” – it’s the responsibility of the developer to ensure that the basic functionality continues to work.

    The next type of testing done is component tests.  These are typically suites of tests designed to thoroughly exercise a component.  Continuing the waveOutGetNumDevs() example above, the component test might include tests that involve plugging in and removing USB audio devices to verify that waveOutGetNumDevs() handles device arrival and removal correctly.   Typically a component covers more than a single API – all of the waveOutXxx APIs might be considered a single component, for example.

    And the last type of testing done is system tests.  The system tests are the ones that test the entire process.  So there won’t be a waveOutXxx() system test, but the waveOutGetNumDevs() API would be tested as a part of the system test.  A system test typically involves cross-component tests, so they’d test the interaction between the mixerXxx APIs and the waveOutXxx APIs. 

    System tests include stress tests and non stress tests, both are critical to the process.

    Now for types of testers.  There are typically three kinds of testers in a given organization. 

    The first type of tester is the developer herself.  She’s responsible for knowing what needs to be tested in her component, and it’s her job to ensure that her component can be tested.  It’s surprising how easy it is to have components that are essentially untestable, and those are usually the areas that have horrid bugs.

    The second type of tester is the test developer.  A test developer is responsible for coding the component and system tests mentioned above.  A good test developer is a rare beast; it takes a special kind of mindset to be able to look at an API and noodle out how to break it.  Test developers also design and implement the test harnesses that are used to support the tests. For whatever reason, each team at Microsoft has their own pet favorite test harness, nobody has yet been able to come up with a single test harness that makes everyone happy, so teams tend to pick their own and run with it.  There are continuing efforts going on to at least rationalize the output of the test harnesses, but that’s an ongoing process.

    The third type of tester is the test runner.  This sounds like a button-presser job, but it’s not.  Many of the best testers that I know of do nothing but run tests.  But their worth is in their ability to realize that something’s wrong and to function as the first line of defense in tracking down a bug.  Since the test runner is the first person to encounter a problem, they need to have a thorough understanding of how the system fits together so that (at a minimum) they can determine who to call in to look at a bug.

    One of the things to keep in mind is that the skill sets for each of those jobs is different, and they are ALL necessary.  I’ve worked with test developers who don’t have the patience to sit there and install new builds and run.  Similarly, most of the developers I’ve known don’t have the patience to design thorough component tests (some have, and the ability to write tests for your component is one of the hallmarks of a good developer IMHO).

     

  • Larry Osterman's WebLog

    Why is there a GENERIC_MAPPING parameter to the AccessCheck API?

    • 1 Comments

    Ok, back to techie stuff.  I recently received the following piece of mail sent to an internal mailing list:

    How is GenericMapping used by AccessCheck function?

    I thought it would be used to map GENERIC_XXX rights in the ACEs contained by the security descriptor passed, but it seems that I'm wrong.

    It seems that the ACEs in the SD need to have the specific rights, otherwise the validation would fail.

    If the SD contains specific rights and DesiredAccess parameter also contains specific rights what's the purpose of GenericMapping parameter?

    It’s a good question, and makes a good follow up to my AccessCheck posts last week.

    The answer is subtle (really subtle). 

    The contract for the AccessCheck API specifies that if you’re requesting the MAXIMUM_ALLOWED DesiredAccess to the resource, then on return, the DWORD pointed to by the GrantedAccess parameter will contain the access rights granted.

    So what happens if the pSecurityDescriptor parameter passed into the AccessCheck API specifies an object with a NULL DACL?  Remember a NULL DACL grants full access to the object, which means that the caller has every right that they could possibly be granted.  Ordinarily, in the presence of a DACL, the MAXIMUM_ALLOWED access rights can be easily calculated – you just enable all the bits in the ACEs that apply to the user and you’re done.  But in this case,  there’s no DACL on the object.  Since there’s no ACL to collect the requestor’s rights, those have to come from somewhere else.  And that somewhere else happens to be the GenericMapping parameter passed into the API.

    If you call AccessCheck requesting MAXIMUM_ALLOWED access rights, and you’re security descriptor has a NULL DACL, then in addition to the rights you may have been previously granted (WRITE_DAC, for example if you’re the object’s owner), all the access rights in the GENERIC_MAPPING->GenericAll field are also granted to the user.

     

  • Larry Osterman's WebLog

    Access Checks

    • 6 Comments

    Before I begin today’s post, a caveat: In this discussion, when I use the term “security”, I don’t mean “security defect free”, instead I mean “using the NT security subsystem”.  The two often become confused, and I just want to make sure that people know which aspect of security I’m talking about.  Also, people should refer to my glossary on security terms if there is any confusion about terms I’m using inside this post.

    Whenever discussions of NT security come up, the first comment that people make is “Oooh, security is hard”.  And I always respond “No it’s not, security in NT is just bit manipulation”.

    Well, if you look at it at the fine grained details (the myriad types of groups, the internal data structure of a SID, how to look up user information from the SAM database, etc) then security can seem daunting.

    But the reality is that the core of NT’s security architecture is a single routine: AccessCheck.  And the AccessCheck API is all about bit twiddling. Over the next couple of posts, I’ll describe how NT’s access check functionality works.  There are about a half a dozen functions that perform access checks, but they all share the basic structure of the AccessCheck API (in fact, internally all the NT access check functions are performed by a single routine).

    At its core, the AccessCheck API takes three inputs: The user’s token, a desired access mask, and a security descriptor.  The access mask and the user’s token are applied to the SecurityDescriptor and the API determines if the user should have access or not.

    Before we can discuss AccessCheck, we first need to discuss security descriptors (or, to be more specific, Access Control Lists (ACLs)).  In general, in NT, a security descriptor has four components:

    1)      The security descriptor owner (a SID).

    2)      The security descriptor group (a SID).

    3)      The security descriptor discretionary ACL (DACL).

    4)      The security descriptor System ACL (SACL).

    First, the easiest of the four: The security descriptor group exists in NT for posix compliance; it’s not used as a part of the access check process…

    The security descriptor owner is usually the creator of the object (if the creator is a member of the administrators group, then the owner of the object is the administrators group, this is because all administrators are considered to be interchangeable for the purposes of object ownership).  The owner of a security descriptor is automatically granted WRITE_DAC access to the object – in other words, the owner of an object can ALWAYS modify the security descriptor of that object.

    The SACL describes the auditing semantics of the object – if the application calls an access check function that does auditing (AccessCheckAndAuditAlarm, AccessCheckByTypeAndAuditAlarm, AccessCheckByTypeResultListAndAuditAlarm, and AccessCheckByTypeResultListAndAuditAlarmByHandle), then after the AccessCheck functionality is executed, the system applies the result to the SACL and if the SACL indicates that the result should be audited, then an audit entry is generated.

    The DACL’s the big kahuna.  It’s the structure that completely describes the access rights that the user has to the object being protected. 

    The DACL and SACL are both access control lists, and thus share the same structure.

    Essentially an access list is an ordered list of Access Control Entries (ACEs).  There are about 15 different types of access control entries, in general they fall into four different categories:

    1)      Access Allowed ACEs – these grant access to an object for a particular SID.

    2)      Access Denied ACEs – these deny access to an object for a particular SID.

    3)      Audit ACEs – these indicate if an audit should be generated when a particular user is either granted or denied a particular access right.

    4)      Alarm ACEs – currently unused, but similar to audit aces – an alarm ACE would trigger a system alarm if their criteria were met instead of generating an audit record.

    Regardless of the type of the ACE, each ACE shares a common header, the ACE_HEADER.  There are two important fields in the ACE header – the first is the ACE type, which describes the semantics of the ACE, the second is the AceFlags – this field describes the ACL inheritance rules for this ACE.  It also contains the FAILED_ACCESS_ACE_FLAG and SUCCESSFUL_ACCESS_ACE_FLAG which is used for the audit ACEs.

    Tomorrow: How are ACL’s used in the access check algorithm.

     

  • Larry Osterman's WebLog

    Microsoft Exchange 5.0's credential cache security problems

    • 5 Comments

    The other day, I wrote about the Exchange 5.0 NNTP (and POP3, and Exchange 5.5 IMAP) server’s credential cache.

    Well, the credentials cache was my first experience with a customer reported security vulnerability.  As reported in Windows IT Pro magazine, the password cache resulted in a “vulnerability” being reported to Microsoft.

    The problem occurs when a user logs on to the Exchange 5.0 POP3 server with a username and password.  Then that user logs onto the domain and changes their domain password.

    Unfortunately, because of the credentials cache, the user could still use their old password!  Why was this?  Well, when they logged onto their account, we cached their token, and their credentials (the username, domain, mailbox name, user password, and some extra salt, like their IP address).  On their next logon, if the credentials were still in the cache, we’d get a hit on the credentials, and we’d re-use the token.  Now, their NEW password also worked, in that case, we wouldn’t find the credentials in the cache, and we’d call LogonUser and create a new token for the user. 

    But the security researcher in this case still complained because they were still able to use the OLD password, even though it was no longer on the domain.  And they’d be able to continue using the password, until it aged out of the cache.

    I still have the opinion that this behavior wasn’t a security vulnerability.  The credentials cache would only allow a password to be cached for a maximum of 2 hours, and if a user account was inactive for more than 15 minutes, the credentials would also be discarded.  So the longest potential period of the old credentials being considered valid was 2 hours after the user changed their password.  On the other hand, most POP3 clients polled the server every 10 minutes, this meant that if a user logged into their POP3 client and left it logged in, we’d never have to delay their “check new mail” for a domain authentication – which could take up to 15 or more seconds in pathological circumstances (if one of the domain controllers went down and we needed to re-discover a domain controller for the domain, for example).  So at the cost of having a users token in memory for 2 hours, we were able to significantly improve their experience using the Exchange POP3 server.

    We also provided registry configuration parameters to allow the user to tune any and all of these behaviors, the Windows IT Pro article indicates how to turn it off.

    For Exchange 2000, the NNTP, POP3 and IMAP server were moved out of the Exchange store, and the credentials cache went away, and was replaced with a different implementation, unfortunately I don’t know how that worked.

    For the person who asked how we implemented the credentials cache, it was relatively simple – we took the users clear text username, domain and password (and the salt as mentioned above, which included their IP address), calculated the MD5 hash of the credentials and used that as the key to a hash table.  This relied on the fact that the various clients sent the users credentials in plain text.  If they used NTLM or other more secure authentication protocols, then we simply used the built-in NT authentication mechanisms to log the user on.

     

     

  • Larry Osterman's WebLog

    What's wrong with this code, part 6 - the answers

    • 6 Comments

    In yesterdays post I presented a trace log writer that had a subtle bug.

    As I mentioned, the problem had nothing to do with the code, but instead, the problem had to do with the directory in which the trace log file was written.

    My second hint to the problem was that only symptom that the person who wrote the code saw was that the trace logger would only ever write one line to the file, and then only after the file was deleted.

    And it turns out that that’s key to the problem.  Consider a folder with an ACL that allows the user to create files, but not write to them.  This isn’t as strange as you might imagine, it’s a common pattern to grant users the ability to create files in a directory but not to modify them once they’ve been created.  As a simple example, consider your review – you should be allowed to post your review to a shared folder, but once the review’s been posted, it’s a bad idea for you to be able to modify the document.  On the other hand, you do want to be able to read that file after it’s been posted.

    When you have a directory that is set up in this fashion, the use can create files (they have the FILE_CREATE_FILE access right to the directory), but they can’t modify the files once they’re created (the OBJECT_INHERIT_ACE | INHERIT_ONLY ACE on the directory only grants GENERIC_READ access to the file).

    The thing that’s confusing about this is that the user was allowed to write to the file when it was created, but not when they opened the file subsequently. 

    The reason for this is that in NT, a user is granted full control over an object that they create, but once the object has been created, the ACLs on that object apply.  So when the LogMessage API first created the file, it was granted full access to the file.  And the user was able to write their data to the file.  Once they closed the file, however, their full access to the file reverted back to the ACL on the file, so for the second write, the call to CreateFile() failed.

    It turns out that this behavior has the potential to hit a large set of applications.  There is a very common design pattern intended to protect an application’s data files from data corruption when saving a file.  This design pattern works by first creating a temporary file, writing the contents of the document to the temporary file, and if the temporary file was written successfully, it deletes the old file and renames the new file over the old file.  It then sets the security descriptor on the renamed file to match the original security descriptor on the file.  The problem is that before the rename happened, the application closed the temporary file.   As a result, the ACL on the newly renamed file is the default ACL for the directory, which may not allow the user to modify the security descriptor on the file.

    So what WAS wrong with the code?  Well, the problem is in the call to CreateFile:

        traceLogHandle = CreateFile(TRACELOG_FILE_NAME, FILE_APPEND_DATA, FILE_SHARE_READ, NULL, OPEN_ALWAYS, FILE_ATTRIBUTE_NORMAL, NULL);

    The 4th parameter to the CreateFile API is the SECURITY_ATTRIBUTES structure that is to be applied to the new file.  That structure contains inheritance information about the file, but it also the security descriptor to apply to the new file.  In this case, either the TRACELOG_FILE_NAME should be set to a directory that the developer knows grants the user access to the file, or the developer should specify an ACL for the newly created file that grants access to the file.

    Oh, and before anyone asks why NT did such a silly thing as granting the user full control over a newly created resource, consider the save your review to the shared folder example above.  If the ACL on the folder applied, then you’d never be able to write the contents of the file in the first place.

    This is a general principal of all objects in NT, in case it wasn’t obvious.  It applies to all kinds of handles, from file handles to event handles, to registry handles. I’ve debugged dozens of problems in the past related to this problem – someone creates an object, sets some things on that object, closes the object and later attempts to modify the object and the modification fails.  Just about every time I’ve ever seen it, the problem has been related to the ACL on the object.

     So Kudo’s and Mea Culpa’s:  Once again, Mike Dimmick was the first person to catch on to what I was trying to demonstrate.  He framed it in terms of a limited access service account, but the principal is the same – you can create an object but can’t re-open it for the same access after it’s inherited its ACL from the parent.

    And as usual, there were unintentional bugs in the code.  The biggest one was caught by Skywing; he correctly caught the fact that the EnterCriticalSection should be protected with a try/except since it can throw (EDIT: THIS IS NOT TRUE ON WINDOWS XP AND BEYOND, SEE BELOW).  Several people (Keith Moore, Sayavanam, Dave M, caught issues with my use of StringCbVPrintfEx (not checking return codes, not explicitly using the A version of the API.  And Niclas Lindgren had a fascinating series of posts about exception safety in applications (which I’ll address in the near future in a separate post).

    Normon Diamond also caught the fact that the cleanup code checked for the handle not being null, but INVALID_HANDLE_VALUE is -1. 

    Edit: Pavel Lebedinsky points out in the comments below that on XP and beyond (W2K3, Longhorn) EnterCriticalSection will NOT raise exceptions.  So Skywing's point is actually incorrect.

  • Larry Osterman's WebLog

    Putting Pants on an Elephant.

    • 2 Comments

    Valorie's enrolled in City University to get her Masters in Teaching.  Yesterday was her orientation class, and one of the lectures was entitled "How to put pants on an Elephant".

    My first reaction on hearing the title of the lecture was "Huh?"  That makes no sense whatsoever.

    But Valorie went on to explain:

    When her professor was a 5 or 6 year old, his father took him to see the circus parade.  As his father explained it, he should pay close attention to the guys that walked behind the elephants, since they had the most important job in the circus.

    Well, the elephants came on by and there were a couple of guys with brooms and shovels trotting dutifly behind them.

    "Why are those guys the most important guys in the circus?" the professor-to-be asked.

    His father replied, "Well, if they weren't there, what would happen to the elephants poop?"

    "Oh, you're right".

    From the other side of the kids' father, came a young voice "I don't know, I'd just put pants on the elephant".

    Valorie's professor then went on to explain that the lecture was to discuss all of the things that could go wrong during the class and how they should be mitigated.  The class broke up into small groups and the groups discussed things that would disrupt the education process - things like people not doing their work, or people who do the work of others, people coming to the lecture with garlic on their breath, etc.  Each group came up with what they thought was the full set of things that could go wrong, and devised a plan to mitigate them.

    The more that I've thought about that lecture, the more that I've come to realize that they essentially went through the process of writing a threat model for their class - to them, the threats were the things that prevented them from having a quality learning experience.

    So when I start writing the threat model for the component my team is designing next week, I'll be thinking about putting pants on an elephant.  And what other things exist that can make my project go wrong.

    Edit: Valorie's not "hoping" to get her degree, she's GOING to get her degree.

     

  • Larry Osterman's WebLog

    Desigining an authentication system

    • 9 Comments

    I ran into this a while ago, and thought it was a wonderful discussion of how to go about designing a high quality authentication system.

    As I’ve mentioned in the past, authentication is one of the hardest problems in security – authorization (AccessCheck) is relatively simple, but authentication is a nightmare.

    This dialog, from MIT, discusses the issues that need to be considered while designing an authentication system, and the ramifications of not considering them.  All-in-all, an excellent read.

     

    On a personal note: Work’s getting very hectic, so the blog’s likely to go dark until sometime next week, sorry about that :(.

     

  • Larry Osterman's WebLog

    How Exchange's role SIDs work (aka. NT's security on Psychotropic Drugs)

    • 7 Comments

    Now that we've seen some of the things you can do with SIDs when you use them in the way they were intended to be used, now let's see what you can do with SIDs when you’re willing to work outside the box.

    For Exchange 2000, we had a product requirement that we implement a feature called “Roles”.  A simple example of a “Role” in the NT base security system is the CREATOR_OWNER SID (S-1-3-0).  If you put the creator-owner SID in an ACL, it has no effect on the access check for that object, but when the ACL is inherited by a new object, the creator-owner ACE is replaced with the owner SID of the user who created the new object.

    So a role is essentially a macro that’s expanded at some point in the lifetime of the object.  For CREATOR_OWNER, it’s expanded at ACL inheritance time, for an Exchange role SID, it’s expanded at the access check time, and allows you to define a single ACE that acts as a per-object group – essentially when Exchange performs its access checks, the role SID is expanded to contain the set of people in a property on the message that’s described by the role SID.

    We couldn’t use NT security groups for roles, because of the requirement that roles be expanded on a per-object basis – you couldn’t define a separate security group for each message in a folder, for example, since that wouldn’t scale (or be manageable).  In addition, you need to have some form of administrative rights to modify the active directory, and we knew that most IT departments wouldn’t allow random users to be creating groups in the active directory.

    One of the other criteria that we had for role ACEs was that ACLs containing role ACEs had to be valid NT ACLs.  So we needed to find a way of representing roles in an ACE without breaking NT’s IsValidSecurityDescriptor() API.

    So how did we do this?

    The first thing that needs to be explained is the concept of a “resource manager”. What’s a “resource manager”?  The term “resource manager” is sort-of like the term “policy” – each person uses it in a different fashion, which means that it can be quite confusing.  To the NT security infrastructure, a resource manager is a component that is using NT security to protect some asset.  The NT filesystem is a resource manager, as is the NT registry, the NT Active Directory, etc.  Well, Exchange is also a resource manager.  I’ve written about the Exchange resource manager before here, here, and here on the Microsoft Exchange blog.  The key thing to keep in mind about a resource manager is that it completely defines the security descriptors that are contained in its store.  The ACL for a registry key will never appear in the active directory as an ACL.  Exchange’s ACLs will never appear in the NT registry, or on the NT filesystem (ok, I lied on that last one). 

    Well, if we’re writing our own resource manager, we can take advantage of that hierarchy in ways that the designers of the NT security infrastructure never imagined.  And that’s where the psychotropic drugs come into play.

    In my first SID post, I talked about the fact that SIDs are essentially hierarchical.  The Revision determines the structure of what follows the Revision in the SID, the IdentifierAuthority determines the structure of what follows the IdentifierAuthority in the SID. 

    So if we could define our own identifier authority, then we could define whatever we wanted to in the portion of the SID after the identifier authority.  This is EXACTLY what the SECURITY_RESOURCE_MANAGER_AUTHORITY exists for.  Essentially, if you define a SID in the SECURITY_RESOURCE_MANAGER_AUTHORITY SID authority, you get to have SECURITY_MAX_SID-FIELD_OFFSET(SID, SubAuthority) (about 30-ish) bytes of data that you can set to be anything you want them to be.  The NT security infrastructure will completely ignore the data after the resource manager identifier authority.  All of the APIs that check for validity of SIDs will stop looking as soon as they see that the SID is owned by the resource manager authority.

    Since security descriptors don’t bleed from one resource manager to another, you only need a single resource manager identifier authority.

    So for Exchange Roles, we encoded the information needed to expand the role (the role scope and the role property) into the role SID. When Exchange performed an access check, it took the role membership, retrieved the role property from the folder (or object) specified by the role scope, and replaced the ACE containing the role SID in the ACL with the contents of that property (of course we cached the result of this expansion, so it was performed at most once per message).

     

  • Larry Osterman's WebLog

    Dare on cost cutting at Microsoft

    • 9 Comments

    Dare Obasanjo has an insightful post here about Microsoft’s cost cutting strategies.

    Dare and I’ve had some rather vocal disagreements in the past (mostly about XML, and mostly in private ) but IMHO, he’s 100% right on here.  I fully support Microsoft initiatives to cut costs in-house, they make sense.  I griped about cutting the towel service, but it really did make sense if looked at objectively – saving $250,000 a year that was benefiting maybe a thousand people in total does make sense.  On the other hand, I’ve not YET come to a decent explanation of why not stocking office supplies is a good idea. 

    If it’s to cut costs, then it’s likely to backfire – the reaction for most groups will be to have the administrative assistant for the group buy the supplies for the group and stock them in his/her office.  This means that instead of having a single location on each floor of the building to go for supplies, each admin will maintain their own stock.  On average (at least in my building) there are two admins per floor, so now, instead of having the supplies stocked in one location, they’re stocked in two.  I can’t see how having admins waste their valuable time stocking office supplies saves us money – it may reduce the cost of stocking the supply rooms, but all it does is to move the costs around – instead of it being a single facilities expense, the expenses get moved to the individual departments.

    The only other reason I’ve come up with is to avoid pilfering – but how much of a problem is that realistically?  I know that at some companies, the pilfering problem is significant, but think of it as a cost/benefit trade-off – what is the cost of pilfering vs. the benefit of having office supplies convenient?  For example, the conference rooms (and there are 8 of them on each floor of my building) are constantly running out of either white-board markers or erasers.  If I’m having a meeting in a conference room and there’s no markers (not an unusual occurrence), someone’s got to run out and get new ones.  But if they’re not stocked on the floor, they’ve got to run all around the building trying to find markers.

    So in order to save the time of a minimum wage stocker, this policy causes a meeting attended by four or five highly paid developers to be held up for several minutes while someone searches for the supplies needed to hold the meeting.

    Sigh.  Penny wise and pound foolish is exactly right.

     

  • Larry Osterman's WebLog

    Proudly Serving our corporate masters...

    • 7 Comments

    Adam Barr, a friend and former co-worker of mine (and currently an employee over in another part of Longhorn) is the author of a wonderful book on the development process at Microsoft called “Proudly Serving My Corporate Masters”.  The story on page 327 happened to me.

    Well, Adam’s got a new book coming out, called “Find the Bug”.  His publisher’s also running a “find the bug, win an iPod” contest, everyone’s welcome to participate.  Oh, and he’s blogging too, at http://www.proudlyserving.com.

    Oh, and to save you from looking it up, here’s Adam’s version of my story (reprinted with permission):

    One person I know planned a six-week vacation a year in advance, in careful consultation with his manager.  As you might expect, the six weeks wound up coming right in the middle of a ship crunch.  He ended up leaving his family for part of the vacation and coming back to work.  Since he was technically on vacation, he was paid as a consultant.  And since he had rented out his house for the six weeks, Microsoft put him up in a hotel.

    The bit about being paid as a consultant didn’t happen, but the rest of the story (including being put up in a hotel) is true.  Basically Valorie and I had planned this long trip to show Daniel (who was 6 months old at the time) to all the grandparents.  We were also going to Valorie’s fifth high school reunion, and to attend Worldcon in Orlando.  While we were at Mercersburg attending the reunion, we had a catastrophe happen with the NT browser – the NT browser needed to be compatible with the Windows for Workgroups browser protocol, and the Windows for Workgroups guys had changed their protocol just before they shipped.  Since I was the owner of the NT browser, and the changes were significant, I had to come back to Redmond to make them (we were weeks away from the first beta release of NT, IIRC), it couldn’t wait until I was done with my vacation (since we would have shipped by then).

    So I left Valorie and Daniel in Washington DC with her mother, came back to Redmond, worked 18-20 hour days for a week to get the changes integrated, and flew back to meet with the family in New York City.  It was not the best time in my career.  But it had to be done, unfortunately there wasn’t anyone else on the NT team who could have made those modifications (my backup was an amazing developer but he’d never worked on the browser).

    One day I’ll tell the story of the NT browser and the browser checker. But not today.

    Edit: Cleaned up CITE usage. 

  • Larry Osterman's WebLog

    I figure this is redundant, but...

    • 2 Comments

    If there's anyone reading my blog that's not reading Raymonds, Raymond has finally proven for once and for all that he is THE uber geek.

    One of todays blog posts is this gem - it's a visual analysis of the spam that he's received over the past ten years.

     

    I am once again officially humbled.

     

  • Larry Osterman's WebLog

    On the "Day of Caring"

    • 7 Comments

    Well, I spent Friday morning (and part of the afternoon) at the Center for Career Alternatives, down in the Rainier Valley. 

    I wasn’t going to write about it, but this comment from Mat Hall pushed me over the top (profanity removed)

    A lesser man may point out that it's a bit worrying that MSFT only take one day out a year to give a <darn>. :)

    I’ve got to say that I take serious offense at that comment.  The reality is that Microsoft people take a heck of a lot more than a day to help others.  Every year I’ve worked at Microsoft, Microsoft’s sponsored a United Way affiliated giving campaign.  Annually Microsoft employees give millions of dollars.  In 2003, Microsoft employees gave over $16 million, which was matched by the company for a total of over $32.7 million in cash donations.  That’s just the gifts of cash through the Giving Campaign; Microsoft employees give far more than that as individuals.

    The “Day of Caring” is a United Way program that gets employees of various companies out in the community helping with various projects.  The projects can range from helping to build a house with Habitat for Humanity, to talking to disadvantaged (what a horrible buzzword) youths, to weeding/beach cleanup, to painting, to sewing, etc.  There are literally hundreds of projects performed by dozens of companies as a part of the DoC – last year, over 7,000 volunteers from 190 organizations worked on 300 different projects.

    All of this doesn’t include the hundreds of hours that are spent on private contributions – individuals donating their time and money to various organizations as private individuals, and not as Microsoft employees.

    Microsoft’s dedication to corporate philanthropy goes back to Bill Gates mother, Mary Gates, who was (among other things) the chairman of United Way International.  Bill’s been heavily involved in philanthropic ventures for years, and his dedication has trickled down through the rest of the company.  Far from being a pack of self-centered millionaires, most of the people here that I know give a significant percentage of their income to charities (10% or more).

    So it is a base canard to say that Microsoft people only take one day out of a year to give a darn – most of us spend a lot more than that helping out.

     

  • Larry Osterman's WebLog

    Going Dark Tomorrow.

    • 3 Comments

    Sorry about not posting anything significant today, I've been swamped.  And tomorrow's the "Day of Caring" so I'll be talking at a local career center (along with a bunch of co-workers) about working at Microsoft, so I'm not likely to have anything.

    Sorry about that

     

     

  • Larry Osterman's WebLog

    I thought I could avoid writing this today but...

    • 4 Comments

    For 9/11, Joel's gone dark.  It won't be there after 9/11, but today, his blog has been replaced with a Vietnam Veterans Memorial style listing of all of the 9/11 victims (edit: Updated to permalink of memorial page).

    It's been 3 years, and my images of that day are still raw.  I woke up at 6:30 to the person on the radio (Alice Porter, at KLSY) telling us that a plane had just hit the first tower.  Valorie and I ran downstairs and watched in horror just as the second plane hit.

    We then stared, watching the screen as one tower, then the next crumbled into the ground.

    I went to work that day, but couldn't function.  Eventually an email was sent out suggesting that people might consider going home to be with their families, I took that opportunity.

    I find it difficult to even think about that time, the horror of what I saw on TV still gets to me.  I listen to the 9/11 memorial stories on the radio and have to pull over to the side of the road.

    9/11 is also my mothers birthday.  Shortly afterwards, my sister commented to her that my mom now has the suckiest birthday of anyone.  And she was right.  We talked just this morning and it's hard not to revisit that day.  She lives on the Upper West Side, and all but one of the firefighters in her neighborhood firestation didn't come back that day - that's all the firefighters on all three shifts

    A co-worker of mine was in NYC on 9/11, and sent the following dispatches back.  I still have them in my inbox (I've stripped the names from the posts):

    -----Original Message-----
    From: <CoWorker2>
    Sent: Wednesday, September 12, 2001 10:31 AM
    Subject: FW: Another morning in Manhattan

    I am sitting on a park bench in central park. Will go back to red cross this afternoon to try to give up my blood. Seems like water is in shorter supply though. I am hoping since I can’t leave Manhattan that I can volunteer at a shelter tonight.  

    -----Original Message-----
    From: <CoWorker2>
    Sent: Wednesday, September 12, 2001 6:53 AM
    Subject: FW: Another morning in Manhattan

    This morning Manhattan is quiet. Not much traffic and not the usual din of a NYC day. Maybe it is too early for folks. Still a lot of sirens though, and the occasional military aircraft. Many businesses (including Starbucks are still closed). All of the tunnels to and from Manhattan remain closed. I can’t figure that out. Why would they not let us out? For <CoWorker1> and me though, there would be no where to go anyway since the airports are closed. I have this theory that the government thinks there may have been terrorist observers that they might still be able to identify. <CoWorker1> and I are toying with the idea of buying fresh clothes. I put a set into the hotel’s laundry system and they didn’t com back. The Red Cross has asked for socks and sweatshirts for volunteers working around the clock. <CoWorker1> and I have decided to find out where we can donate blood today. There are no news papers today in Manhattan. I may end up swimming across the Hudson for a paper and Starbucks coffee. A friend of ours from OpenCola offered to put us up at his Manhattan apartment, but the hotel will let us stay another night. I am guessing they don’t have anyone new to replace us with anyway. I heard on the news that the American Express Building is open on one side (from damage) and is being used as a morgue. On Monday, I was standing in front of it impressed by its size and the fact they are one of Microsoft’s .NET Early Adopters. The hotel phones are out so we can only use cell phones. I can’t explain that one. The hotel still has only a single door open too, and are checking for room keys. <CoWorker1's> wife's birthday was last night. My wife says we will celebrate on Saturday when we are all back in Seattle.

    -----Original Message-----
    From: <CoWorker2>
    Sent: Tuesday, September 11, 2001 9:21 AM
    Subject: Manhattan frenzy

    I think all access to and from Manhattan is blocked. North-South roads are completely gridlocked as the folks in the South are trying to get North. East-west roads are blocked by the police and other city officials, I suspect to give emergency vehicles a path South. Waves of sirens of every sort and horns – yes, an order of magnitude more than is normal for this very loud city. All businesses were shut down so the walks were filled with folks – many crying; others attempting cell phone calls. Each pay phone has lines of folks over a block long.

    I lost <CoWorker1> because I was impatient and bolted to Starbucks before he showed up at our meeting place. With no phones, I am not sure how we will link up today.

    With all the phones out, I am leaning against the glass pane of a closed Starbucks to access my email.

    Our hotel on Central Park closed all but a side door. They are asking for identification to get in. I think this is overkill but it has advantages as many folks that I am guessing don’t live in Manhattan have no where to go.

    -----Original Message-----
    From: <CoWorker2>
    Sent: Tuesday, September 11, 2001 7:50 AM
    Subject: Manhattan is a mess - <CoWorker1> and <CoWorker2> are safe in Central Park

    <CoWorker1> and I are stranded in Manhattan. Thought I would include a map to show you how close to all this we are. We were at NASDAQ on Wall Street 4 blocks away yesterday with <Sam> and <Joe>, and were 40 blocks away at the time of this tragedy meeting with HorizonLive. <Sam> and <Joe> flew to Chicago last night.

    We are safely removed to our hotel at Central Park, but it seems few here are unaffected. Many have friends and family among the 50k folks that work in those buildings. There is hardly a dry eye in this hotel and I imagine in Manhattan and maybe even most of the world.

    Phone circuits are busy so little chance for phone calls. <CoWorker1> and I were fortunate enough to contact our families shortly after the first attack.

    I expect <CoWorker1> and I will be here for awhile.

     

    I can still clearly remember being at a Wellington Elementary school PTSA fundraiser when the first planes started to fly - I remember looking up from the playground and seeing the silver arrow flying across the sky and thinking that finally the world was starting to return to normal. 

    Edit: Removed a couple of names accidentally left in the text.

     

Page 1 of 1 (23 items)