JeremyK's [MSFT] WebLog

  • Bridging the gaps in Software Engineering

    After reading Steve’s analogy on software development vs. building bridges, I figured it would be easier to convey some of my own thoughts in a post vs. comments.

     

    Being raised around engineering and architectural practices and now working in the software industry, there are some very clear differences in the disciplines. I to get picked on by my father who is a licensed Engineer which I have a lot of respect for. I think one of the most important differences between software engineers and Engineers is this question:

     

    When was the last time a Software Engineer had to carry Errors and Omissions insurance?

     

    I think this conveys the liability factor state certified Professional Engineers (P.E.) carry. This insurance covers the P.E. in the event a design fails resulting in a costly mistake or worse, the loss of life. This is the critical factor differentiating these two disciplines. To date, I do not recall a situation where we have allowed commercial software to make a decision about the well being of human without the ability of human intervention. So using the bridge analogy, is there someone always standing by to intervene if a bridge is in the process of collapsing and put a halt to it?

     

    Well, someone might say, what about AED (Automated External Defibrillator) for example? Yes, that is true it is making decisions through software on the treatment to administer to the heart, but consider the ranges of input and the decision tree the device has to make. Also, humans are standing by to remove the device should it malfunction. Overall the programming of such a device would be considered trivial in comparison to a commercial software application. I may be wrong…

     

    Enough on that point, I think we get it. I want to dig in a bit on the differences Steve points out, he is right, software itself has very little tolerance for mistakes, but in some cases it can still run with those mistakes. One wrong choice can bring down the entire application, but is that really the fault of the software development process? Consider the framework in which an application has to work within… the hardware platform, the operating system, etc. All of these elements are what leads to a single point of failure in software. If we could build systems smarter, then this would not be an issue in conjunction with reducing the complexity. Engineering is by its nature the application of known technology, materials, and knowledge to solve a problem. Very rarely is a total new material or technique developed. Researchers develop and define the expected behavior of a widget, but it’s the Engineer that determines how the widget can be applied everyday.

     

    The most compelling aspect of this debate is the sheer number of decisions that must be made in software and design specifications. Sometimes software is asked to do more than it was designed to do. The random nature of data exposes design deficiencies in software. The analogy is attempting to place an aircraft carriers on a two lane bridge. First off the Engineer will tell you that I didn’t design the bridge for that load. The engineer designed the bridge with a given criteria in mind (max GVW of 10000 lb. x 5 vehicles). But that doesn’t prevent someone from driving a few tanks over it. The same is true in software. Yes, you can attach that 5gb file to your email message, but that doesn’t mean the software was designed to handle a file that large. As in bridge design, software is designed with certain criteria in mind, but the difference is software is expected to handle all possible criteria.

     

    I am really at odds in regards to the comment regarding reinventing the wheel every time. If I recall correctly, only 30% of the code changes from one product revision to another. Even then the fundamental elements (algorithms, data types, etc.) are still reused in code, its how they are organized. The same with building a building. Engineers simply reorganize beams, rivets, etc. Software development bridges the gap between what is considered “pure research” and the application of the research (engineering). In many cases the architect or developer has to fill both shoes. But once a model of a widget is in place (research phase), engineering principals come into play in how we approach the refinement, testing, and documenting of the widget.

     

    In summary, we should honor the roots of engineering and apply tried and true principals where and when it is applicable in our development and support processes.

  • A critical junction in support issues: Root Cause VS. Relief

    In the lifetime of a case, particularly one of high impact, a point is reached where a decision is necessary regarding the direction of the case. This decision impacts how the situation is approached going forward. The decision that must be made is:

     

    “Do I want Root Cause or Relief?”

     

    This decision is important because there are trade offs that have to be made.

     

    You might ask “What is the difference?” In the case of relief, the goal is to restore service as quickly as possible, to determine what is failing and prevent the failure from occurring. In the case of root cause, effort is put forth to understand all the sequence of events and conditions causing the failure in order to pinpoint the specific action that resulted in the failure. Then action is taken to address the real reason of the failure, not just the symptoms of the failure.

     

    So here is cliché analogy (BTW this is a fictious story):

     

    Every 2-3 months my car’s voltage regulator goes bad. An average mechanic will simply locate failed part and replace it. Every 2-3 months the process is repeated. A better mechanic will realize after the second time something must be causing the Voltage Regulator to go bad. After diagnosing the system she discovers the alternator is not producing enough voltage to satisfy the needs of the vehicle. The alternator is not performing within spec because the coils are starting to corrode, therefore he replaces the alternator.

     

    When it comes to computers, there is an additional challenge to Root Cause Analysis (RCA) vs. relief. The process of providing relief (replacing the voltage regulator) in most cases destroys the information necessary (the alternator) to perform root cause analysis. Relief may be to reboot the computer, but there may not be enough information in the various logs produced to determine what happened after the system is available again. Why not? Well, that is a never ending dilemma between performance versus supportability. There are strong arguments on both sides of that debate I don’t want to cover here. RCA is a labor and time intensive process, collecting information and examining the system in a failed state requires additional time which isn’t acceptable to companies operating under Service Level Agreements (SLAs).  In the majority of cases it takes multiple occurrences of the problem in the customer’s environment to allow for RCA to be performed.

     

    It is a common practice of mine to be very clear when working with customers when we arrive at that critical junction. This typically comes when state information could be lost resulting from the actions of providing relief (rebooting, killing a process, restoring, etc.). So can you have both Root Cause and Relief? I have to acknowledge the possibly, however in my experience providing both in the same steps occurs less than roughly 15% of the time.

     

    I am curious to hear back how Root Cause vs. Relief is perceived and valued by your company when dealing with issues.

  • The stages of conversion for Internet Mail

    As Naresh pointed out in his history of content conversion blog, Exchange 2000 swapped the roles of our primary transport from X.400 to SMTP. As such we also more closely integrated in native MIME formatted message into Exchange store by introducing the .STM file.

     

    We are unfortunately bound by the fact that our richest client (Outlook) still requires messages to be presented in a MAPI format. Up until Exchange 2000, message conversion took place immediately upon receipt of the Information Store (Exch5.5), or when received by the IMC (Exch4.0 and Exch5.0). With Exchange 2000 we introduced a delayed conversion model whereby full conversion of the message would only take place when it was absolutely needed. This introduced several phases of conversion.

     

    The first phase is when a message is first received in native SMTP/MIME format and routed into the Information Store, then entire message is referenced by pointer to the message in the STM file. At this point, the message is lives in the SMTP temp tables while the delivery of the message is being handled.

     

    In the next phase we parse enough details from the SMTP header message, such as Sender, Date, Receiver, etc. to render the presence of a message in a folder. (think about the information you see when looking at a list of messages in Folder in Outlook) At this stage we also take a look over the MIME data to determine if we have any attachments and if any of the attachments are of type ms-tnef. If a TNEF type is encountered, then all aspects of the message are immediately converted into MAPI format. Such as when a message has arrived with the same organization between two Outlook Clients. Otherwise, we determine if any attachments exists to accurately display the “paper clip” icon beside the message in Outlook. At this point the message is delivered into the correct Inbox folder based upon the recipient.

     

    When the an Outlook client attempts to open the message, we simply convert the appropriate bodypart of the MIME message on the fly from MIME to RTF and present it to the client. In addition we reparse the MIME data to retrieve the details of the attachment from the MIME bodypart header and promote those into the EDB file. When the client closes the message, the message body still remains in the STM file and the attachment details remain in the EDB. If the client attempts to forward the message, all aspects of the message are converted from MIME to MAPI format and the representation of the message in the STM file is removed if no one else is using it.

     

    If the client attempts to open an attachment, all aspects of the attachment are promoted to the EDB file before being presented to the client. Note that during this time the original MIME message still resides in the STM file, including the attachment information to preserve the original MIME until complete conversion has been performed. The promotion is necessary to maintain the integrity of the original MIME data.

     

    Finally if the client modifies an attachment or message body, the entire message is promoted to the EDB and references to the original MIME stream are removed from the STM file. A forward operation is an operation that would also force the complete conversion of the message.

     

    Such is the life of an inbound message in Exchange…

     

  • Watch out for that 16gb limit!

    Part of my role is to understand trends and determine how technically we can solve our supportability issues. While performing some reviews of cases related to store startup, I was shocked at the number of cases where the 16GB limit of the standard version of Exchange was reached. For the standard version of Exchange you are limited to a single mailbox and public folder database with a maximum size of 16gb. One can argue that it’s an archaic limit, but I’ll leave the final decision up to the marketing guys on how we treat this in future products.

     

    For those of you out there running the standard edition, you have an additional task and that is to monitor the size of the priv and pub EDB files. Once you reach this limit the store will be dismounted. The KB article below explains how you can extend the limit by 1gb, but remember this isn’t a permanent solution. Offline defrag and other steps should be taken immediately to prevent the database from continuing to dismount unexpectedly.

     

    Refer to this article for more details on the 16gb limit scenario: http://support.microsoft.com/?id=828070

     

    Along these same lines the other major issue observed was disk full scenario. This is true for both Standard and Enterprise editions. In many of the cases drive(s) would simply fill up with transaction logs, database growth, or large amounts of data being stored on the same drive as the databases/logs. When this occurs, the database goes down hard, typically requiring drastic measures to get the database operational again.

     

    I thought this was just worth mentioning considering these two issues accounted for the largest majority of issues where the stores would not mount and are both easily managed, but yet require the greatest amount of time for recovery.

  • Still Unable to Edit User Properties in ADUC?

    This is an issue that I ended up spending 6 hours debugging, only because the answer could not be found in any documentation. I know as soon as I post this someone will go post a comment saying "duh, its documented right here <link>"

    The scenario is a customer was attempting to give permissions to modify certain Active Directory attributes of Exchange users (e.g. Hidden from Address Book) to a subset of administrators. The permissions where correctly set on the attributes and they could be modified using other tools, however in Active Directory User and Computers, the check boxes corresponding to those attributes where greyed out.

    Now, let me preface this and say that I am not an expert when it comes to migration, a guru of what permissions to use in certain cases, etc. I am just stating the facts...

    The extension dll (maildsmx.dll) used to manage Exchange users employes logic to only muck with users that are truely "Exchange" users. This includes users on Exchange 2000/2003 and Exchange 5.5. Exchange 2000/2003 users are easily identified, but Exchange 5.5 require an extra step in validating if they should be administered as an Exchange user. This is where the problem occurs, to determine if they are valid 5.5 recipients, a look up is performed against the Active Directory Connector's Connection Agreements. The configuration of the connection agreements are stored under CN=Active Directory Connections,CN=Microsoft Exchange,CN=Services,CN=Configuration,DC=Domain,DC=com, which non-administrator accounts do not have access to by default.

    In order to make this work, the group in question had to be given Read Only rights to this portion of the AD to determine if the user being administered is being handled by one of the ADC connection agreements.

  • Out of Control Transaction Log files?

    An issue came up this week that I feel is worth making Exchange Admins aware of. The symptoms of the problem is transaction logs growing uncontrollably on the Exchange Server (as high as 1000 logs per an hour). However there is no growth encountered in the EDB or STM files. Obviously this resulted in log file drives filling up and larger than expected backups.

     

    The problem is with the Entourage client and its interaction with the Exchange Server. Entourage utilizes httpdav to interact with the Exchange server. When a user attempts to send a message larger than the maximum message size limits, the message gets stuck in the user’s Entourage Outbox because Entourage was unable to handle the error returned from the Exchange server regarding the message size. Since the user set the Send/Receive schedule set, the client would continuously attempt to submit the same message to the server. The server would accept, process, and reject the message, resulting in the increase in the number of transaction logs.

     

    Immediate Steps to take in this situation:

    • Disable www service on the server experiencing the problem, thus preventing Entourage clients from connecting.
    • Determine the client(s) responsible by review message tracking logs, looking for messages with the same MessageID being submitted. These message will likely contain large attachments, therefore the message sizes would be large.
    • Alternatively, if you are proficient with netmon, look a network sniff of incoming http traffic to the server and determine clients attempting to submit messages.
    • You may also disable http protocol access for the client, also preventing the client further connections to the server.
    • Remove messages from that clients Outbox.
    • Adjust schedule on client so they are not continuously attempting to submit the message.
    • Re-enabled www service, http protocol access, etc.

     

    [Edit 01/04/05]: Adjusted immediate steps so they are clearer.

  • Learning Essentails in Exchange PSS

    Learning Essentials

     

    Ever wonder how we internally develop our people? How our top people continue to develop? As one might expect we do REQUIRE the technical staff of Product Support Services (PSS) to maintain their MCSE/MCSD certification. However, this is largely accomplished on the individual's own time. The program new hires are put through depends largely on the major product they will be supporting. In Exchange, approximately 3 weeks of compressed training covering topics on the Operating System, Active Directory, and Exchange are covered. In addition further training is provided by the “specialty team” the individual will be a member of such as Administration, Client, etc. In the majority of cases, this training stems directly from the content and whitepapers that has been published and freely consumable by customers.

     

    Readiness and Sustained Education is the program internally of providing training related to new products and features such as Service Pack releases, feature additions, depth knowledge in a product, etc. Various teams work together to determine where knowledge is lacking based upon analysis of the cases that are coming in. For example, Storage Area Network concepts. Not directly related to Exchange, but engineers are required to have some conceptual understand of the devices. Further training is also provided in the Exchange group in a triage format where engineers take turns covering topics they have strength in such as performance monitor analysis, cluster recovery, mail flow troubleshooting, etc.

     

    So what about Escalation Engineers? It isn’t what you would typically think… years of technical training? Not even close. It all about architecture and fundamentals. In a recent internal meeting, a question was posed to a panel of EE on how do they stay abreast of technology? The answer was simple, we don’t. We rely upon the fundamental understanding of how things work. What I am referring to is how the Operating System Platform works, specifically the interaction of a process running on the operating system.

     

    Regardless if you are running the basic command line application or a sophisticated server based application as Exchange, the rules are the same.  If you want to synchronize threads within the process you use one of half-dozen synchronization mechanism (e.g. critical section, mutex, timed event, etc.). If you want to transmit something over TCP/IP from a usermode application you have to use Winsock or some wrapper for it. If you are in kernel mode, then you have to submit requests via IRP (I/O request packets) to the tcpip driver. The process of allocating memory and interfacing with the network all require adherence to the rules set forth by the operating system.

     

    So for example say we introduce a new log file replication service. Here would be my immediate questions to attempt to answer:

     

    Is a push or a pull scenario?

    Who initiates the replication?

    How frequent is the replication?

    What is the rule used to govern when something should be replicated?

    What communication mechanism does it use? RPC? Straight Sockets?

    How does it know about it target replication partner?

     

    These questions are applicable to any replication scenario; these are the rules of replication so-to- speak… So where is the new information here you might ask? We’ll this may not be new, but when it comes down to applying these concepts, people get caught up in the details and forget to understand the big picture.

     

    Lastly, the major key attribute in EEs is their attitude. There is an inherent drive to learn, to develop continuously, and most importantly accept responsibility for resolving a customer issue.

     

    So what do I recommend as essential reading? It would be a course on the platform, programming, and debugging. Regardless of the application, it will adhere to using the facilities provide by the operating system. Understanding how the platform works provides the essentials to be able to look at and reverse engineer any application on the fly.

     

    • Inside Windows 2000, Third Ed. (ISBN 0735610215)
    • Programming Server-Side Applications for Microsoft Windows 2000 (ISBN 0735607532)
    • Programming Windows, Fifth Edition (ISBN 157231995X)
    • Windows NT Device Driver Development (ISBN 1578700582)
    • The C++ Programming Language (Special 3rd Edition) (ISBN 0201700735)
    • Algorithms in C++ (ISBN 0201510596)
    • Debugging Applications (ISBN 0735608865)
    • The help file with Windows Debuggers download (http://www.microsoft.com/whdc/devtools/debugging/default.mspx)
    • Transaction Processing : Concepts and Techniques (Morgan Kaufmann Series in Data Management Systems) (ISBN 1558601902)

     

     

  • VSAPI Postings just started going live on Exchange Team Blog!

    I've been putting together several postings regarding the VSAPI including background, architecture, performance, and troubleshooting analysis for the Exchange Team blog.

    The first one just went live...
    http://blogs.msdn.com/exchange/archive/2004/10/20/245157.aspx

    Keep an eye out for future post regarding the VSAPI over the coming days...

     

    - Jeremy

  • Do you know your critical code paths?

    In my role I interact with a lot of developers and it is continuously amazing how naive an individual can be regarding the interactions their own code has with other components. I'll shorten this example to the critical points... In one instance I recall conversing with a developer regarding a performance issue in their code that was causing delays responding to requests. In this code path we were looking for configuration information to make a decision on on how to handle a particular request. The lookup of the configuration information was resulting to a call to the Active Directory. In this customer's case they had a stale DNS record that was periodically impacting the lookups to GC/DCs. Because the developer that wrote the client code wasn't aware of the process of the lookup, which was written by another team, they didn't realize a dependency was added to their code path for outbound RPC, network connectivity, and inevitability DNS.

    This situation took over two weeks to nail down. Why did it take two weeks? Several factors, issue timing, customer scheduling, data transfer...

    1. It wasn't clear from the customer's explanation of the problem if the performance issue was server or client side. Needed to take a network monitor trace between the client and server. Looking at the trace, the requests where clearly making to the server and the server was taking a long time to respond.
    2. Hmm, so the problem is server side, let take a look at perfmon - yep request taking a while to respond to the client. Guess need to take some user mode dumps of the process.
    3. Found the client request thread... hmm.. its waiting on a DS Lookup. Looking backwards in the stack we find where the lookup is made. When I simply discussed it with the dev, they were very apprehensive to believe it is occurring until I showed them the call stack. The dev just discovered the dependency... opps. They were never aware of it because it always just "worked".
    4. Looking at a second netmon taken server side, we could see the server attempting the communication to the directory and timing out, which eventually led us to the stale DNS record.

    This is just one example of how important it is to be aware of who is calling your code and what components you call and the interaction of each. You may have the most benign piece of code that does a quota calculation doing a user lookup that isn't itself critical, however someone could easily pickup and use it in a critical code path. Now that benign piece of code all of a sudden becomes critical since it is now called anytime some saves data.

    This might sound silly and most people may respond with a big "Duh, I know that", but it still happens more often than not.

    I would highly recommend when looking at design elements, look at them from END to END, not just your piece of the puzzle. Document the critical code paths and know the components used at every stage in those critical code paths. Customer don't call stating that "Hey that configuration caching thingy is behaving poorly".

    You need to know your Ins and Outs... What are the paths where I accept data and process it and what are the paths where I communicate externally. How can I know how far a request has been down a critical code path and where did the request have problems?

    This leds to my next post... supportability in your code.

  • Don't believe everything the debugger is telling you!!! (aka Rootkit)

    <sigh>

     

    So this issue was incredibly interesting, but you know at the end it just makes you mad. This past week a customer was having a very clear cut, reproduciable issue: When attempting to attach an item to a message via OWA, the Information Store would crash with what appeared to be stack corruption. Boy, I didn’t know where this would take me and everyone else that got drawn into this issue. Below is our debug journey to understand what was happening. The conclusion of this was a collective effort of several Escalation Engineers, so I am not taking full credit for all of this investigation… The information below may make it look easy, but it had us going in an infinite loop for a couple of hours <grin>

     

    First the stack looked like this after the crash (yes I had correct symbols in place):

    0:058> kb

    ChildEBP RetAddr  Args to Child             

    WARNING: Frame IP not in any known module. Following frames may be wrong.

    7c576b4e f8458d51 0875ff50 ffffb7e8 0fc085ff 0x77f87cac

     

    So based upon my past experience with Stack corruption, I looked at raw dump of the stack which showed over and over an exception handler was attempting to be located, which chewed up the remnants of the original stack. Fortunately the customer allowed me to perform a remote debug, so I didn’t have to decipher the mess left on the stack or attempt to find the real .cxr (Exception Record), besides there where errors when this userdump was created related to not being able to read memory… so we really couldn’t trust everything in the userdump. In turns out there was a valid reason for this error...


    So after manually walking through the functions that are called, I eventually found the crash was occurring in this code path:

     

    ChildEBP RetAddr 
    25b0cf10 001f3cae ESE!ErrOSSLVFileOpen+0x61
    25b0d68c 001edb85 ESE!ErrSLVSetColumnFromFile+0x5b
    25b0dc84 001c4bc8 ESE!ErrSLVSetColumn+0x4d9
    25b0dcc0 001c49a6 ESE!ErrIsamSetColumn+0x1ec
    25b0dcd8 001c48f1 ESE!ErrESESetColumn+0x6f
    25b0dd10 001c4885 ESE!JetSetColumnEx+0x4d
    25b0dd54 004041d1 ESE!JetSetColumn+0x4c
    25b0dd84 0040f710 store!JetSetColumnFn+0x4a
    25b0ddc8 004a78ba store!JTAB_BASE::EcSetColumn+0x1aa
    25b0de08 004a50ea store!ATTACH::EcFlushFileHandleProps+0x94
    25b0de9c 004a45c9 store!ATTACH::EcFlushProps+0xac7
    25b0df18 004a42d6 store!ATTACH::EcSaveChanges2+0x4a1
    25b0df30 004a429a store!EcSaveOneAttach+0x19
    25b0df58 213bb966 store!EcSaveChangesAttachOp2+0x81
    25b0df6c 213bb7c2 EXOLEDB!CStoreLogon::ScSaveChangesAttach+0x42
    25b0e030 213bb43d EXOLEDB!CWMMultiPartFormStream::ScCommitAttachment+0x1c6
    25b0e06c 213c0233 EXOLEDB!CWMMultiPartFormStream::FWrite+0x9b
    25b0efe0 2138ddb0 EXOLEDB!CRequestStream::ScProcessContinuation+0x8e
    25b0f014 2138d9c9 EXOLEDB!ScDispatchRequest+0x1e1
    25b0fec8 2138d88b EXOLEDB!CDavServer::ProcessItem+0x94

     

    So in this code path we are taking in the information about the file and creating a file handle by calling NTCreateFile(). This is the call that causes the crash to occur. Simply stepping into this call instruction the store will crash. This means the problem is somewhere in NtCreateFile.

     

    0:014>r
    eax=20b9cb90 ebx=80000000 ecx=20b9cbc0 edx=20b9cba8 esi=00000000 edi=0000004b
    eip=001f429f esp=20b9cb54 ebp=7c576b4e iopl=0         nv up ei pl zr na po cy
    cs=001b  ss=0023  ds=0023  es=0023  fs=0038  gs=0000             efl=00000247
    ESE!ErrOSSLVFileOpen+0x3b7:
    001f429f ff159c253200 call dword ptr [ESE!pfnNtCreateFile (0032259c)]{ntdll!ZwCreateFile (
    77f87cac)} ds:0023:0032259c=77

     

    OK, so I dumped the parameters about to be passed to NTCreateFile (not shown) and nothing looked out of the ordinary. Next was to unassemble NtCreateFile and see what is about to happen:

     

    0:014> u eip
    ntdll!ZwCreateFile:
    77f87cac b820000000       mov     eax,0x20
    77f87cb1 8d542404         lea     edx,[esp+0x4]
    77f87cb5 cd2e             int     2e
    77f87cb7 c22c00           ret     0x2c

     

    Still nothing out of the ordinary in comparison to a working system, so stepping through NTCreateFile actually gets me into ZwCreateFile (the backend function that is really called). Here are the register states as we sit poised at the first instruction of ZwCreateFile.

     

    0:014> t
    eax=20b9cb90 ebx=80000000 ecx=20b9cbc0 edx=20b9cba8 esi=00000000 edi=0000004b
    eip=77f87cac esp=20b9cb50 ebp=7c576b4e iopl=0         nv up ei pl zr na po cy
    cs=001b  ss=0023  ds=0023  es=0023  fs=0038  gs=0000             efl=00000247
    ntdll!ZwCreateFile:
    77f87cac b820000000       mov     eax,0x20

     

     

    So I executed the first assembly instruction of ZwCreateFile, but $#%* what happened? EIP didn’t advance to the next instruction as expected, there is a weird value for EIP now,  and the values for all the other registers where exactly the same. We didn’t skip ahead and execute the int 2e instruction to cause a kernel mode transition so what the heck happened?

     

    0:014> p
    eax=20b9cb90 ebx=80000000 ecx=20b9cbc0 edx=20b9cba8 esi=00000000 edi=0000004b
    eip=7ffa488d esp=20b9cb50 ebp=7c576b4e iopl=0         nv up ei pl zr na po cy
    cs=001b  ss=0023  ds=0023  es=0023  fs=0038  gs=0000             efl=00000247
    7ffa488d ??               ???

     

    So what instruction(s) are these?

     

    0:014> u eip
    7ffa488d ??               ???
               ^ Memory access error in 'u eip'

     

    Why couldn’t we read this memory? This is a live debug, it should be readable regardless if it is mapped in the process or not. It turns out this is the reason why the userdump could not be created successfully. So I reset the repro and stepped through ESE!ErrOSSLVFileOpen again to ensure we didn’t miss anything… and of course we found nothing. At this point the impossible was happening... a benign instruction was blowing up. So its time to get a different perspective on this, so I asked a few people to give me their thoughts... well, a few hours later there was 8 people huddled over my desk all scratching their heads.

     

    Hmmm… time to breakout the kernel debugger J

     

    OK, so once again we walked to the point in assembly of the instruction that failed and had the customer generate a kernel dump.

     

    So now looking at the kernel dump lets first find our process (note output is truncated for brevity):

    : kd> !process 0 0

    **** NT ACTIVE PROCESS DUMP ****

     

    PROCESS ff629400  SessionId: 0  Cid: 0f54    Peb: 7ffdf000  ParentCid: 0130

        DirBase: 041eb000  ObjectTable: 81432428  TableSize: 3056.

        Image: store.exe

     

     

    PROCESS fd43e1e0  SessionId: 0  Cid: 0f5c    Peb: 7ffdf000  ParentCid: 0df0

        DirBase: 046c7000  ObjectTable: 811a54c8  TableSize: 265.

        Image: cdb.exe

     

    Set the context to the store.exe process:

    0: kd> !process ff629400 1

    PROCESS ff629400  SessionId: 0  Cid: 0f54    Peb: 7ffdf000  ParentCid: 0130

        DirBase: 041eb000  ObjectTable: 81432428  TableSize: 3056.

        Image: store.exe

        VadRoot fd434fe8 Clone 0 Private 6435. Modified 445462. Locked 67.

        DeviceMap 8208c128

        Token                             e2c4a470

        ElapsedTime                        0:14:19.0750

        UserTime                          0:00:07.0809

        KernelTime                        0:00:04.0998

        QuotaPoolUsage[PagedPool]         325628

        QuotaPoolUsage[NonPagedPool]      207428

        Working Set Sizes (now,min,max)  (58595, 50, 345) (234380KB, 200KB, 1380KB)

        PeakWorkingSetSize                61146

        VirtualSize                       560 Mb

        PeakVirtualSize                   561 Mb

        PageFaultCount                    340776

        MemoryPriority                    BACKGROUND

        BasePriority                      8

        CommitCharge                      9763

        DebugPort                         e2602040

     

    Now lets go and see the state of the memory range that could not be read from usermode (truncated output):

    0: kd> !vad fd434fe8

    VAD     level      start      end    commit

    <snip>

    fb8882e8 ( 7)      7ffa0    7ffa4         5 Private      EXECUTE_READWRITE

     

    So can we dump this memory? You bet…

     

    0: kd> dd 7ffa0000

    7ffa0000  000000e8 b62d5800 c300405d 3d2d2e5f

    7ffa0010  6361485b 2072656b 65666544 7265646e

    7ffa0020  2e2d3d5d 0000005f 00000000 00000800

    7ffa0030  72656b00 336c656e 6c642e32 6553006c

    7ffa0040  73614c74 72724574 4300726f 74616572

    7ffa0050  69614d65 6f6c736c 47004174 614d7465

    7ffa0060  6c736c69 6e49746f 57006f66 65746972

    7ffa0070  656c6946 61655200 6c694664 6c430065

     

    What is the state of this entry in the Page Tables?

    0: kd> !pte 7ffa0000

    7FFA0000  - PDE at C03007FC        PTE at C01FFE80

              contains 0D634867      contains 035BC8C6

            pfn d634 --DA--UWV       not valid

                                   Transition:   35bc

                                   Protect:  6

     

    0: kd> !pte 7ffd4000

    7FFD4000  - PDE at C03007FC        PTE at C01FFF50

              contains 0D634867      contains 028070C2

            pfn d634 --DA--UWV       not valid

                                   PageFile    1

                                   Offset 2807

                                   Protect:  6

     

    OK, so the value of ‘Protect’ explains why we can access this memory from usermode, but why? This condition isn’t normal.  Hmmm… lets look at the memory again using dc command:

     

    0: kd> dc 7ffa0000

    7ffa0000  000000e8 b62d5800 c300405d 3d2d2e5f  .....X-.]@.._.-=

    7ffa0010  6361485b 2072656b 65666544 7265646e  [Hacker Defender

    7ffa0020  2e2d3d5d 0000005f 00000000 00000800  ]=-._...........

    7ffa0030  72656b00 336c656e 6c642e32 6553006c  .kernel32.dll.Se

    7ffa0040  73614c74 72724574 4300726f 74616572  tLastError.Creat

    7ffa0050  69614d65 6f6c736c 47004174 614d7465  eMailslotA.GetMa

    7ffa0060  6c736c69 6e49746f 57006f66 65746972  ilslotInfo.Write

    7ffa0070  656c6946 61655200 6c694664 6c430065  File.ReadFile.Cl

    0: kd> dc

    7ffa0080  4865736f 6c646e61 65470065 766e4574  oseHandle.GetEnv

    7ffa0090  6e6f7269 746e656d 69726156 656c6261  ironmentVariable

    7ffa00a0  65470057 646f4d74 46656c75 4e656c69  W.GetModuleFileN

    7ffa00b0  41656d61 70754400 6163696c 61486574  ameA.DuplicateHa

    7ffa00c0  656c646e 65724300 50657461 65636f72  ndle.CreateProce

    7ffa00d0  00417373 74697845 65726854 43006461  ssA.ExitThread.C

    7ffa00e0  74616572 72685465 00646165 61657243  reateThread.Crea

    7ffa00f0  69506574 50006570 4e6b6565 64656d61  tePipe.PeekNamed

     

    <lights and sirens go off>

    Hacker Defender??? What the heck is that? (I’ll let you go dig around the web for that thing and remember to read the disclaimer on my blog before you do.)

     

    So this is the thread waiting in the usermode debugger at the time the kernel dump was taken:

     

     

           THREAD 8477f020  Cid f54.fd8  Teb: 7fee1000  Win32Thread: a217aea8 WAIT: (WrLpcReply) KernelMode Non-Alertable
    SuspendCount 1
                8477f208  Semaphore Limit 0x1
            Waiting for reply to LPC MessageId 00034466:
            Pending LPC Reply Message:
                e2602041: [40000000,01000000]
            Not impersonating
            Owning Process ff629400
            Wait Start TickCount    244909        Elapsed Ticks: 1690
            Context Switch Count    1200                   LargeStack
            UserTime                  0:00:00.0004
            KernelTime                0:00:00.0006
            Start Address KERNEL32!BaseThreadStartThunk (0x7c574333)
            Win32 Start Address MDBTASK!MdbTaskPoolThread (0x61bd13e2)
            Stack Init bd412000 Current bd41169c Base bd412000 Limit bd40f000 Call 0
            Priority 10 BasePriority 8 PriorityDecrement 0 DecrementCount 0

     

            ChildEBP RetAddr  Args to Child
            bd4116b4 8042c413 8477f020 8477f1d8 8477f208 nt!KiSwapThread+0x1b1
            bd4116dc 80434085 8477f208 00000011 00000000 nt!KeWaitForSingleObject+0x1a3
            bd41170c 804c8113 e2602040 e2602040 bd411740 nt!LpcpRequestWaitReplyPort+0x52f
            bd411720 80526c43 e2602040 bd4118c0 bd411740 nt!LpcRequestWaitReplyPort+0x13
            bd4118a0 80526cf0 bd4118c0 e2602040 00000001 nt!DbgkpSendApiMessage+0x43
            bd411938 8042f24a bd411d10 00000001 00000000 nt!DbgkForwardException+0x78
            bd411cf4 80467735 bd411d10 00000000 bd411d64 nt!KiDispatchException+0x172
            bd411d5c 804676d0 00000000 00000000 00000000 nt!CommonDispatchException+0x4d
            bd411d64 00000000 00000000 00000000 00000000 nt!KiUnexpectedInterruptTail+0x1e1

     

    bd411cf4 80467735 nt!KiDispatchException(
                            struct _EXCEPTION_RECORD * ExceptionRecord = 0xbd411d10,
                            struct _KTRAP_FRAME * ExceptionFrame = 0x00000000,
                            struct _KTRAP_FRAME * TrapFrame = 0xbd411d64,
                            char PreviousMode = 1 '',
                            unsigned char FirstChance = 0x01 '')+0x172
     

     

    We can see this thread is hooked:

     

    0: kd> .trap bd411d64
    ErrCode = 00000000
    eax=0e84cb90 ebx=80000000 ecx=0e84cbc0 edx=0e84cba8 esi=00000000 edi=00000716
    eip=77f87cac esp=0e84cb50 ebp=7c576b4e iopl=0         nv up ei pl zr na po cy
    cs=001b  ss=0023  ds=0023  es=0023  fs=0038  gs=0000             efl=00000247
    ntdll!ZwCreateFile:
    001b:77f87cac e9dccb0108  jmp     7ffa488d

     

    Well, this isn’t the same code that usermode was showing us… recall this from the usermode debug at exactly the same address:

    0:014> t
    eax=20b9cb90 ebx=80000000 ecx=20b9cbc0 edx=20b9cba8 esi=00000000 edi=0000004b
    eip=77f87cac esp=20b9cb50 ebp=7c576b4e iopl=0         nv up ei pl zr na po cy
    cs=001b  ss=0023  ds=0023  es=0023  fs=0038  gs=0000             efl=00000247
    ntdll!ZwCreateFile:
    77f87cac b820000000       mov     eax,0x20

     

    So, this means Hacker Defender was also intercepting calls made by the debugger during the live debug to read process memory and showing what would be expected at that address, not what was really happening… In kernel mode we can clearly see the instruction at 7ffa488d that attempting to be executed.. This address is in the range of 7ffa0000-7ffa4000 that we determined was unreadable from usermode.

     

    The jmp that actually occurred placed the thread’s instruction pointer (EIP) pointing into the region of memory that could not be accessed from usermode prior to transitioning to kernel mode and this is why the information store was crashing.  So it appears this rootkit has a bug …

     

    TOO FUNNY!!!!

  • The Effects of Message Looping

    I have seen this problem several times in the last few weeks and figured it was worth a quick blog entry.

     

    Here is what happens:

    1. A Unix host with a running daemon sends some sort of email message a User reporting on the status of processing a request.
    2. The recipient of the message is on an Exchange Server.
    3. The recipient is removed from the Exchange organization.
    4. The automated mailing hits the Exchange Server and is NDR’ed back to the sender (e.g. root@host).
    5. The Unix receives the NDR, but root is aliased or .forwards to a group or individual on Exchange contains the original recipient in Step 1
    6. Exchange NDRs the message back to the Unix host, which in turn the NDR is sent back to individual or group and Exchange NDRs the message back to Unix
    7. For each NDR, the original message is encapsulated in the NDR

     

    The reason this loop persist is each message receives a new Internet Message ID, thus it is considered a new message. With changes introduced in Exchange 2000 Post Service Pack 3 ( >= 6.0.6521) and in Exchange 2003 we set a maximum nested limit of 250 attachments. This limit can be adjust using a registry key entry outlined below. Eventually the nested depth of NDRs reaches 250 and Exchange begins NDR’ing the message due to Attachment limit being exceeded. At this point the message is ~3mb in size. This loop continues indefinitely as each NDR received by the Unix host is forwarded to the recipient which in turn is NDRed back to Unix.

     

    For each message generated by this daemon running on Unix, a seperate looping message situation is created. So assuming over a 24 hour period 1 message is generated every 15 minutes, 96 messages are floating around the system each 3mb in size. By the end of the day 300mb of messages are continously trying to be processed. The eventually leds to the very poor performance from the server and growing STM files.

     

    This problem can be prevented by aliasing/.forwarding the Unix contacts to a Distribution List on Exchange that is configured not to NDR messages. This way individuals can come and go from the DL and any delivery failures do not go back to the host.

     

    To adjust the limits on number of attachments per message:

     

    1. Start Registry Editor.

     

    2. Locate and then click the following registry key:

     

    HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\MSExchangeIS\Parame

                    tersSystem\InternetContent

     

    3. On the "Edit" menu, click "Add Value".

     

    4. In the "Value Name" box, type "LimitBodyParts" (without the

       quotation marks).

     

    5. In the "Data Type" list, click "REG_DWORD", and then click "OK".

     

    6. In the "DWORD Editor" dialog box, click "Decimal". In the "Data"

       box, type the number of attachments that you want to set the limit to,

       and then click "OK".

    Note:

    When LimitBodyParts is set to a value of less than 250 or not present, a default value of 250 is used. The value can be set as high as 4294967295 (0xFFFFFFFF).

     

    To prevent the original message from being attached to a NDR, see the following article:

     

    http://support.microsoft.com/default.aspx?scid=kb;en-us;308303

     

     

     

    [EDIT: 9/21/04 - Added Note regarding acceptable values of LimitBodyParts regkey] 

  • Teaching the Skill of Reading Code

    One would think my last post related to this topic was a Oxidizing Agent in a Redux Reaction (yep, I started out as a pure science major) given the response by Larry and Eric. Hey, but thats a good thing... it gets people thinking.

    So a few sessions into this endeavor, here is what I think is working. People need to immediately tie back a concept they learn to everyday applications. Being able to immediately apply something learned helps retention of that which is presented. So instead of going weeks teaching abstract concepts and hoping participants will be able to put 2 and 2 together in the end, I am ensuring they can immediately put to use a concept that is discussed in class when they leave.

    Towards the end of every session, I randomly pick pieces of our code containing the concepts just presented and have students verbally break it down. Using the knowledge just obtained, they have begun to see how simple queries in the source where particular event log entries are made can help reinforce the traditional troubleshooting techniques in problem isolation. This is baby steps for those not from a coding background, but hopefully this is a catalyst for those individuals of a non-coding background can use source code to begin to research how some aspect of a large, complex product such as Exchange works and determine the appropriate next steps in resolving customer issues.

  • Using your Mailbox to Store Work-in-Progress documents

    Like most people, I work on a collection of work-in-progress documents. Over time I started loosing track of where the latest version of a particular document was left... my desktop, tabletpc, laptop, or on my home machine.

    Then I remembered a relatively unused feature of Exchange+Outlook which is “Free Docs”. If you didn't know already, you can directly drag a document into a folder in your mailbox. Here you can leave the document while you edit and develop it over time. Coupling this with Outlook Local Cache Mode, you can keep multiple clients in sync with the latest version of a document regardless of where you are or which client you decide to use.

     

     

  • Chicken and the Egg... (aka. Read vs. Writing code)

    So I have been asked to deliver training for a group of people on how to “read” source code. I guess I should frame this request a bit. With very large product such as Exchange there are millions of lines of code and no one knows everything about what is happening in source. Most developers are very focused on the pieces of the puzzle they own. As part of the Escalation Team for Exchange we have to “reverse engineer on the fly” so-to-speak to understand and develop steps towards resolving customer issues. This typically involves jumping quickly from one code base to another depending upon where the investigation takes us... A large portion of our time is spent simply reading source code, not writing it.

    So how do you teach people this “art” of digging deep very quickly into unfamilar code that you had no hand in writing? I myself, I come from a very traditional process of learning how to code.. by sitting down and writing it. I am struggling with how to tailor a delivery to focus on reading vs. writing source code. To me the only way you can be truely efficient in this process is by having written code yourself.

    Thoughts?

     

    ==== Update ====

    Great comments...
    So boy do I agree about good comments, but to me comments are really geared towards explaining a particular block of code at the implementation detail level. But how does one know where to begin looking for that particular block of code? I think this stems from great engineering documentation about the object model itself and how things related from a high level. 

    I guess to ellaborate more on my intention and its really to help individuals without a lot of understanding of C/C++/C# begin to understand how things fit together and how they can begin using source code to determine what to look for that is wrong in the customer enviornment. To me if you are attempting to read source with sparse comments, then you need to have some practicle understanding of the language itself.

     

  • Store Background Processes Part I - IS Maintenance

    Overview

    IS Maintenance is the term applied to a series of operations performed by the Information Store to ensure logical consistency in the databases. Generally speaking the operations ensure items no longer needed by the system are removed from the database. If a custom schedule has not been defined for the database in Exchange System Manager, a default scheduled period of 12-5AM local time will be used.

     

    Overall, there are 10 tasks performed by the Information Store during this period:

    1. Purge Indices (public and mailbox stores)
    2. Tombstone Maintenance (public and mailbox stores)
    3. Dumpster Cleanup (public and mailbox stores)
    4. Public Folder Expiry (public stores)
    5. Age Folder Tombstone (public stores)
    6. Folder Conflict Aging (public stores)
    7. Update Server Versions (public stores)
    8. Cleanup Secure Folders (public stores)
    9. Cleanup Deleted Mailboxes (mailbox stores)
    10. Check Messages Table (public and mailbox stores)

     

    If the maintenance window ends before one of the ten tasks above is completed, the last one running is recorded. The last task will be allowed to run until completion, which could exceed the maintenance window. During the next maintenance window the information store determines the last task in progress and restarts the process, essentially picking up where it left off on the next task.

     

    The progress of the tasks above can be monitored by adjusting the Event Logging Level to minimum for the Exchange Information Store Mailbox/Public. When set to minimum, the following events are logged:

     

    Event Type:   Informational

    Event Source: MSExchangeIS Public / Private

    Event ID:     1208

    User:         N/A

    Description:

    Starting the IS Maintenance tasks.

    %n%nFor more information, click http://www.microsoft.com/contentredirect.asp.

     

    Event Type:   Informational

    Event Source: MSExchangeIS Public / Private

    Event ID:     1210

    User:         N/A

    Description:

    The IS Maintenance task <subtask name> completed.

    %n%nFor more information, click http://www.microsoft.com/contentredirect.asp.

     

    Event Type:   Informational

    Event Source: MSExchangeIS Public / Private

    Event ID:     1209

    User:         N/A

    Description:

    The IS Maintenance tasks completed.

    %n%nFor more information, click http://www.microsoft.com/contentredirect.asp.

     

     

     

    I want to delineate the 11th task performed, Online Defragmentation, which is not performed by the Information Store but by the underlying database engine (ESE).  At the completion of the above list of tasks, the Store process initiates Online Defragmentation by calling into (ESE) to inform it to start the defragmentation process. It is only after at least 1 of the 10 tasks above have been successfully completed will this process start.

     

    The progress of online defragmentation can be monitored by reviewing the application log without adjusting the Event Logging levels.

     

    When the Information Store initiates ESE online defragmentation for a database, the following is logged:

    Event Type:   Informational

    Event Source: ESE

    Event ID:     700

    User:         N/A

    Description:

    %1 (%2) %3Online defragmentation is beginning a full pass on database '%4'.

    %n%nFor more information, click http://www.microsoft.com/contentredirect.asp.

     

    If a full pass of online defragmentation completes within the maintenance window the following is logged indicating it has completed.

    Event Type:   Informational

    Event Source: ESE

    Event ID:     701

    User:         N/A

    Description:

    %1 (%2) %3Online defragmentation has completed a full pass on database '%4'.

    %n%nFor more information, click http://www.microsoft.com/contentredirect.asp.

     

    If a pass of the online defragmentation was interrupted due to window expired or backup initiated, the following event is logged:

    Event Type:   Informational

    Event Source: ESE

    Event ID:     704

    User:         N/A

    Description:

    %1 (%2) %3Online defragmentation of database '%4' was interrupted and terminated. The next time online defragmentation is started on this database, it will resume from the point of interruption.

    %n%nFor more information, click http://www.microsoft.com/contentredirect.asp.

     

    If the previous online defragmentation cycle was previously interrupted before it completed a full pass of the database the following is logged instead of the event above:

    Event Type:   Informational

    Event Source: ESE

    Event ID:     702

    User:         N/A

    Description:

    %1 (%2) %3Online defragmentation is resuming its pass on database '%4'.

    %n%nFor more information, click http://www.microsoft.com/contentredirect.asp.

     

    If a resumed pass of the database completes within the maintenance window the following is logged:

    Event Type:   Informational

    Event Source: ESE

    Event ID:     703

    User:         N/A

    Description:

    %1 (%2) %3Online defragmentation has completed the resumed pass on database '%4'.%n%nFor more information, click http://www.microsoft.com/contentredirect.asp.

     

    Things to keep in mind…

    An important note to make about IS Maintenance and online defragmentation is the large number of changes made to the database.  To maintain recoverability of the database, each of the changes made by these tasks are also transacted in the transaction logs. Depending upon the number of changes made in the database during these tasks, a large number of log files can be generated during the maintenance windows.

     

    Backups schedules should also be monitored so they do not overlap with the IS Maintenance schedule. Backup will not preempt the 10 initial tasks performed by the Information Store, therefore they will run simultaneously. However, when initiating a backup of a database, the online defragmentation process is suspended until the backup is completed.

     

    Below is a list of every action during Online Maintenance and the possible configuration operations to control their behavior.

     

    Purge Indices (public and mailbox stores)

    One of the benefits of the Extensible Storage Engine (ESE or aka JET) is its ability to dynamically create indexes when information is resorted or requested in a different sort order. Unchecked, the Information Store database could end up with thousands of indexes on a table.

     

    The Information Store handles this situation by keeping track of indexes by assigning them an expiry time. Information about the index is added to an internal table referred to as the indexing aging table. During online maintenance the index aging table is walked and each entry’s value of ptagIndexDeleteTime is compared to the current time. If the index has expired, the index is removed. This process runs every 24 hours during the Information Store maintenance cycle and removes entries older than 8 days for Exchange 5.5 and 40 days for Exchange 2000/2003 by default.

     

    Default Inverval:               Every 24 hours

    Default Criteria:                 Remove Indexes older than 8 days on Exchange 5.5

    and older than 40 days on Exchange 2000/2003.

    Configurable Settings:

     

    DC=<Domain>

    CN=Configuration

                  CN=Services

                         CN=Microsoft Exchange

                               CN=<ExchOrgName>

                                      CN=Administrative Groups

                                             CN=<site>

                                                    CN=Servers

                                                           CN=<servername>

                                                                  CN=InformationStore

                                                                         CN=<Storage Group>

                                                                               CN=<Database>

     

    Object Class:              Database Object

    Attribute Name:            msExchAgingKeepTime

    Description:               The amount of time in milliseconds seconds to keep unused indexes and views in the database. This value is applies to Exchange 2000/2003 and supersedes the registry values below.

     

     

     

    HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\MSExchangeIS\ParametersSystem

     

    Value Name:                Aging Keep Time

    Radix:                     REG_DWORD

    Description:               The amount of time in milliseconds seconds to keep unused indexes and views in the database. This value is honored by both Exchange 5.5 and 2000/2003.

     

    Value Name:                Aging Clean Interval

    Radix:                     REG_DWORD

    Description:               The interval (in milliseconds) at which the store attempts to purge unused indices. This value is honored by both Exchange 5.5 and 2000/2003.

     

    For Exchange 5.5:

    HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\MSExchangeIS

    \ParametersPublic

    - or –

    HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\MSExchangeIS

    \ParametersPrivate

     

    For Exchange 2000/2003:

    HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\MSExchangeIS

    \<ServerName>\Public-<Public Folder GUID>

    -or-

    HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\MSExchangeIS

    \<ServerName>\Private-<Public Folder GUID>

     

    Value Name:                Reset Views

    Radix:                     REG_DWORD

    Description:               Setting this value to 1 will cause the Store to remove all indexes regardless of age on the next cycle of IS maintenance for the corresponding database. The value is then set back to 0. By setting this value, clients may experience initial delays rebuilding previously cached views.

     

    Tombstone Maintenance (public and mailbox stores)

    Each folder maintains a list of deleted messages for a folder in ptagMidsetDeleted on the folder object itself. When a message is deleted from a folder, an entry is made in ptagMidsetTombstones. If this folder is replicated, this gives an indication of the messages delete operations that need to be propagated. When the propagation occurs, the entries are moved from ptagMidsetTombstones and merged into ptagMidsetDeleted. If a folder is not replicated, it is necessary to clean up this situation through the Information Store maintenance cycle. This process runs every 24 hours and is not configurable.

     

    Dumpster Cleanup (public and mailbox stores)

    When a message is deleted from a client a simple flag is set to indicate (ptagMsgDeleted) the message is deleted. By default a client will not display messages that have this flag set. In order to view these messages you use the dumpster functionality in the client. If the dumpster is not turned on, messages that have been deleted will be removed from the folder when this process runs during the maintenance cycle. If dumpster functionality has been enabled, only messages that have been deleted and have exceeded the retention period will be removed during this maintenance routine.

     

    This process also checks for folders that have exceeded the retention period and will remove them as well.

     

    Configurable Settings:

     

    For Exchange 5.5:

    HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\MSExchangeIS

    \ParametersPublic

    - or –

    HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\MSExchangeIS

    \ParametersPrivate

     

    For Exchange 2000/2003:

    HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\MSExchangeIS

    \<ServerName>\Public-<Public Folder GUID>

    -or-

    HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\MSExchangeIS

    \<ServerName>\Private-<Public Folder GUID>

     

    Value Name:                Deletion Thread Period

    Radix:                     REG_DWORD

    Description:               The interval (in seconds) at which the store attempts to purge unused indices. This value is honored by both Exchange 5.5 and 2000/2003.

     

    Public Folder Expiry (public stores)

    This is a maintenance process of traversing through all the messages in Public Folders and removing the ones that have exceeded the expiry time.

     

    Configurable Settings:

     

    For Exchange 5.5:

    HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\MSExchangeIS

    \ParametersPublic

     

    For Exchange 2000/2003:

    HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\MSExchangeIS

    \<ServerName>\Public-<Public Folder GUID>

     

    Value Name:                Replication Expiry

    Radix:                     REG_DWORD

    Description:               The interval (in milliseconds) at which the store removes items that have exceeded the retention period.

     

    Age Folder Tombstone (public stores)

    This maintenance process removes deleted public folders that have exceeded the tombstone lifetime (180days default). When a Public Folder is deleted, it is tombstoned for replication purposes. This allows for items that are deleted to be tracked during the replication process. This removes a maximum of 500 folders per a 24 hour period.

     

    Configurable Settings:

     

    For Exchange 5.5:

    HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\MSExchangeIS

    \ParametersPublic

     

    For Exchange 2000/2003:

    HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\MSExchangeIS

    \<ServerName>\Public-<Public Folder GUID>

     

    Value Name:                Replication Folder Tombstone Age Limit

    Radix:                     REG_DWORD

    Description:               The interval (in days) to keep tombstones.

     

    Folder Conflict Aging (public stores)

    When two users have edited and modified the same message in a public folder, the resulting saved messages are defined as being in conflict.  This can happen when a simultaneous save occurs on the same public folder server, or if the message is edited on two servers that contain replicas of the folder.  If no action is taken by the owners of the Public Folder within Conflict Age Limit, this process takes the appropriate action based upon the property PR_RESOLVE_METHOD on the folder itself. The resolution method is defined in the Exchange System Manager on the Public Folder object.

     

    Configurable Settings:

     

    For Exchange 5.5:

    HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\MSExchangeIS

    \ParametersPublic

     

    For Exchange 2000/2003:

    HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\MSExchangeIS

    \<ServerName>\Public-<Public Folder GUID>

     

    Value Name:                Replication Folder Conflict Age Limit

    Radix:                     REG_DWORD

    Description:               The interval (in days) items in conflict are allowed to exist. 

     

     

    Update Server Versions (public stores)

    This process simply updates the version information as necessary for any public folder databases containing a replica of the system configuration folder. No configurable settings are available.

     

    Cleanup Secure Folders (public stores)

    This process is only valid for Exchange 5.5. This process attempts to age out folders that have been secured in other sites and the site is no longer present in the directory. Each folder is examined and for each secure folder where it’s home site is not visible in the directory a timer is set to remove the folder. The default is 7 days.

     

    Configurable Settings:

     

    For Exchange 5.5:

    HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\MSExchangeIS

    \ParametersPublic

     

    Value Name:                Secure Folder Aging Task Frequency

    Radix:                     REG_DWORD

    Description:               The interval (in days)

     

    Site Folder Check (public stores)

    Every 1 hour a public database will check to ensure no duplicate site folders exists within an Administrative Group and removes them. No configurable settings are available for this process.

     

    Cleanup Deleted Mailboxes (mailbox stores)

    This process cycles through mailboxes that currently do not have Directory Service objects. After a default of 30 days, the mailboxes are deleted. This process runs every 24 hours and no interval adjustments are available.

     

    Configurable Settings:

     

    DC=<Domain>

           CN=Configuration

                  CN=Services

                         CN=Microsoft Exchange

                               CN=<ExchOrgName>

                                      CN=Administrative Groups

                                             CN=<site>

                                                    CN=Servers

                                                           CN=<servername>

                                                                  CN=InformationStore

                                                                         CN=<Storage Group>

                                                                               CN=<Database>

    Object Class:              Database Object

    Attribute Name:            msExchMailboxRententionPeriod

    Description:               The amount of time in seconds to keep retain mailboxes without a corresponding Directory Object.

     

     

     

    Check Messages Table

    This process traverses the message table for a given database and looks for messages that currently do have a reference count of 0 indicating no folder currently has a reference to the message. Messages encountered are deleted. A maximum of 50000 messages will be deleted in a given cycle. This process runs every 24 hours and no interval adjustment is available.

     

    Online Defragmentation

    If at least one of the maintenance task was able to complete within the maintenance schedule, the Information Store requests that ESE begin an online defragmentation cycle and complete a single full pass of the database. This process will run for at least 15 minutes at the end of the maintenance window.

     

    The intention for online defragmentation is to free up pages in the database by compacting records onto the fewest number of pages possible, thus reducing the amount of I/O necessary. The ESE database engine does this by walking the database metadata (information in the database that describes tables in the database) and for each table, visit each page in the table and attempt to move records onto logically order pages. If you have an understanding of BTrees, we start at the furthest page to the right and begin compressing records to the left most page. This does not necessarily mean the pages are in order, but the movement is from a logical perspective.

     

    It is possible this process does not complete before the time indicated by the Information Store. In such a case the ESE engine stores the last table successfully processed. At the next Online Defragmentation request, ESE picks up on the next table in the metadata will complete the work until all tables have been processed or the maintenance window ends. If Online Defragmentation is able to process all tables before the maintenance window ends, it will restart at the first table listed in the metadata and continue processing until all tables are processed or the maintenance window ends.

     

    It is import to point out however, if IS maintenance is set to always and one of the above tasks completes (note some only run once every 24 hours), then Online Defragmentation could be invoked.

     

    Configurable Settings:

     

    For Exchange 5.5:

    HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\MSExchangeIS

    \ParametersPublic

     

    For Exchange 2000/2003:

    HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\MSExchangeIS

    \<ServerName>\Public-<Public Folder GUID>

    -or-

    HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\MSExchangeIS

    \<ServerName>\Private-<Public Folder GUID>

     

    Value Name:                OLD Minimum RunTime

    Radix:                     REG_DWORD

    Description:               Amount of time to run after completing at least 1 of IS Maintenance tasks (in minutes)

                               Default value of 15 minutes

     

    Value Name:                OLD Completion Time

    Radix:                     REG_DWORD

    Description:               The amount of time (in seconds) beyond the maintenance window Online Defragmentation will be allowed to run.
    Default value of 1 hour (3600sec)

     

    EDITS:
    9/9/04 - Adjusted units for Aging Keep Time from milliseconds to seconds.

    9/28/04 - Clarified Online Defragmentation process

    9/30/05 - Changed "Deletion Thread Period" milliseconds to seconds

     

More Posts Next page »

© 2009 Microsoft Corporation. All rights reserved. Terms of Use  |  Trademarks  |  Privacy Statement
Microsoft
Page view tracker