So, there were two major groups of comments on the last post, and I'll try to address them.

    The first was a question about managed support for ETW.  I talked to the ETW team, and the current state is that there is no official managed interface for ETW.  Being a standard Win32 API, it is posisble to PInvoke the functions involved, and several internal teams have written their own managed wrappers around ETW.  This isn't expected to change up through Whidbey (Visual Studio 2005); for Orcas (the version of VS after that!) an official managed interface is on the table.  The second is one of backwards compatibility -- ETW is only available in Win2K and later OSes.  Users will expect that software works similarly on all OSes; thus, if you want to support 9x-era OSes, you have to write your own logging code anyways.  So, if you are already putting in old-style logging, why use ETW?  I'll try to answer that with this entry.

    One of the gains of ETW is that it's fast; you can spit out thousands of events per second while using relatively little CPU, far faster than you can fprintf() a string to disk.  The biggest gain, though, is combining multiple sources -- including information outside your own process.  And the most notable external source is the kernel.

    The XP and Codename Longhorn kernels are extremely extensive providers, and can be enabled to log any or all of these to a log, and we publish decoding information for:
  • Hardware configuration events -- notes on the system's CPUs, hard drives, NICs, video card, and ACPI power states
  • Disk-level I/O -- every I/O on the system, including IRP flags, operation time in microseconds, number of bytes, and target disk
  • File-level I/O -- every access to every file on the system (including information to tie it to the disk I/Os above)
  • Image layouts -- filenames, locations in memory, and PIDs for every image in the system
  • Page faults -- pointers to instructions and pages whenever a fault occurs (including COWs, demand-zero faults, hard page faults, transition faults, and guard pages)
  • Network I/O -- all TCP and UDP actions, including connects/accepts, transmits (and retransmits!), recieves, etc.
  • Registry I/O -- all Registry key/value  creation/deletions/changes, registry flushes, etc.
  • Process and thread info -- all creations/deletions of processes and threads
    And Codename Longhorn adds even more events -- most notably, the ability to trace extremely fine-grained high-frequency events such as individual context switches, interrupts (ISRs and DPCs), etc.  

    (FYI, since all of the above information is exceedingly detailed, you can only enable the kernel provider if you have Administrator privileges, are part of the Performance Log Users group, or a service running as LocalSystem, LocalService, or NetworkService.)

    Thus, ETW can be used as an effective debugging tool.  By allowing ETW to pull from, sort, and combine events from multiple providers, you can get a powerful log of everything the system was doing, probably the most accurate log available (save for running the entire OS in a debugger).  It's an incredible tool for noticing "hey, things act strangely when X, Y, and Z, but not W, are happening" at a system level, as well as a code level, and it takes far less time than getting symbols and attaching a debugger/profiler to the system.  And it's all available to devs -- and even to users, given a generic ETW tool such as tracelog in Server 2003!

    Even if you don't specifically use the kernel as an information source, ETW's ability to combine providers is useful for mixing and matching information from multiple DLLs, EXEs, etc. in a system.  ETW events are timestamped by the kernel to extremely high resolution (RDTSC on stable machines, converted to microsecond intervals; MM timers on others) and are automatically sorted at process time, so you don't have to write or parse plaintext date/time formats.

    The goal I'm driving at is that when you have more than one DLL or EXE providing information, individually implementing logging for each component means that you usually need a third app to read in the logs from each component and combine them into a single log with coherent event ordering, and this can be difficult -- especially if you have to tie it to some event.  ETW allows you to automate all that, and it's exceedingly efficient at it as well.  Even if you are only personally maintaining one component, ETW can log very quickly, and it can be shipped in retail builds -- and if you publish the structures of some or all of the events you provide, you can give valuable information to your consumers and to future devs without ever needing to work with them.

    Next entry, I'll start discussing how providers are written, starting with thread structures and common ways of publishing event structs.