June, 2006

Larry Osterman's WebLog

Confessions of an Old Fogey
  • Larry Osterman's WebLog

    Critical Driver or Cargo Cult Programming?

    • 37 Comments

    I've been self hosting Vista on my laptop since sometime in January.  Every Monday morning, without fail, I installed the latest build available from the "main" windows branch, and tried it.

    There have been good builds and bad builds - the first few were pretty painful, everything since sometime in March has been wonderfully smooth.

    But sometime late in May, things changed for the worse.  Weekly builds installed just fine on my main development machine, but my laptop would get about 3/4ths of the way through the install and stop after a reboot complaining about a problem with the critical system driver <driver>.sys.

    Of course, I filed a bug on the problem and moved on - every week I'd update my laptop and it'd fail.  While I was away on vacation, the guys looking into the bug finally figured out what was happening. 

    The first part of the problem was easy - something was causing <driver>.sys to fail to load (we don't know what).  But that didn't explain  the unbootable system.

    Well, the <driver>.sys driver is the modem driver for my laptop.  Eventually one of the setup devs figured the root cause.  For some totally unknown reason, their inf has the following lines:

    [DDInstall.Services]
    AddService=<driver>_Service_Inst

    [<driver>_Service_Inst]
    StartType=0

    If you go to msdn and look up DDInstall.Services, you get this page.

    If you follow the documentation a bit you find the documentation for the service install section which describes the StartType key - it's the same as the start type for Windows services.

    In particular, you find:

    StartType=start-code
    Specifies when to start the driver as one of the following numerical values, expressed either in decimal or, as shown here, in hexadecimal notation.
    0x0 (SERVICE_BOOT_START)
    Indicates a driver started by the operating system loader.

    This value must be used for drivers of devices required for loading the operating system.

    0x1 (SERVICE_SYSTEM_START)
    Indicates a driver started during operating system initialization.

    This value should be used by PnP drivers that do device detection during initialization but are not required to load the system.

    For example, a PnP driver that also can detect a legacy device should specify this value in its INF so that its DriverEntry routine will be called to find the legacy device, even if that device cannot be enumerated by the PnP manager.

    0x2 (SERVICE_AUTO_START)
    Indicates a driver started by the service control manager during system startup.

    This value should never be used in the INF files for WDM or PnP device drivers.
    0x3 (SERVICE_DEMAND_START)
    Indicates a driver started on demand, either by the PnP manager when the corresponding device is enumerated or possibly by the service control manager in response to an explicit user demand for a non-PnP device.

    This value should be used in the INF files for all WDM drivers of devices that are not required to load the system and for all PnP device drivers that are neither required to load the system nor engaged in device detection.

    0x4 (SERVICE_DISABLED)
    Indicates a driver that cannot be started.

    This value can be used to temporarily disable the driver services for a device, but a device/driver cannot be installed if this value is specified in the service-install section of its INF file.

    So in this case, the authors of the modem driver decided that their driver was a boot time critical driver - which, as the documentation clearly states is only intended for drivers required to load the operating system.

    So I'll leave it up to you to decide - is this an example of cargo cult programming, or did the authors of this modem driver REALLY think that the driver is a critical system driver?

    What makes things worse is that this is a 3rd party driver - we believe that their INF is in error, but we can't fix it because it's owned by the 3rd party.  Our only choice is to baddriver it and prevent Vista from loading that particular driver.  The modem chip in question hasn't been made for many, many years, the vendor for that chip has absolutely no interest in supporting it on Vista, so we can't get it fixed (the laptop is old enough that it's out of OEM support, so there's no joy from that corner either - nobody wants to support this hardware anymore).

    Please note: This is NOT an invitation for a "If only the drivers were open source, then you could just fix it" discussion in the comments thread.  The vendor for the modem driver owns the rights to their driver, they get to choose whether or not they want to support it, not Microsoft.

     

  • Larry Osterman's WebLog

    Best Practices for DllMain

    • 19 Comments

    DllMain keeps on popping up as a pain point for developers.  Heck, just yesterday we found an old bug where a component was calling CoFreeUnusedLibraries in its DllMain function.

     

    So it was really cool that this morning I came in and discovered an email (from the PM who owns the DLL loader) about a new article that was posted with best practices for authoring DLLs (in particular DllMain).

     

    You can find the article here.

  • Larry Osterman's WebLog

    Outta Here Again

    • 6 Comments

    I'm off to Europe for a couple of weeks of vacation with the extended family, I'll be back in early July.

     

     

    Have Fun y'all.

  • Larry Osterman's WebLog

    Making things go fast on a network, part 3

    • 6 Comments

    Now I want to pop the stack up a bit and talk about messages.  At their heart, connection oriented protocols are about establishing pipes between senders and receivers - they don't NEED to have semantics beyond that.

    But applications are different.  They typically fall into some stereotypical use patterns, the most common of which is the client/server usage pattern.  A client sends data to the server, the server responds.  The role of sender and receiver alternate between the client and the server.

    In the client/server pattern, the client sends a "message" to the server, the server responds with its own "message".  A "message" may be formed from multiple packets (send()s) but not always.

    Every "message" is self describing - there needs to be a mechanism that allows the server to know it's received all the data that the client sent.  That mechanism may be a length prepended to each message, it might be a magic "terminating sequence" (CR/LF is a very common magic terminator).  Often the semantics of a message are defined by the API being used to send the data - for example, the NetBIOS networking API includes a length of the data being sent, the receiver of the message is guaranteed to receive a block of the same length that was sent, regardless of fragmentation (RFC1001/RFC1002 define how NetBIOS API semantics are implemented over TCP/IP if anyone cares).  In other cases, the semantics of the message are defined by the protocol being implemented.  For example, POP3, IMAP4, NNTP and SMTP define their messages as CR/LF delimited strings, while LDAP uses ASN.1's semantics for defining messages.

    But however a message is defined, there is still a request/response semantic associated with client/server protocols.

    From an application level, here's what happens on the wire when you have a client/server interaction at the application level:

    Client Server
    Send message request "A"  
      Send message with response "A"

    But as we discussed earlier, that's not REALLY what happens.  Each of those messages is a packet, and that means there has to be an acknowledgment between the two.

    Client Server
    Send message request "A"  
      Acknowledge receipt of request "A"
      Send message with response "A"
    Acknowledge receipt of response "A"  

    Now it gets REALLY interesting when you string lots of client/server request/response sequences together:

    Client Server
    Send message request "A"  
      Acknowledge receipt of request "A"
      Send message with response "A"
    Acknowledge receipt of response "A"  
    Send message request "B"  
      Acknowledge receipt of request "B"
      Send message with response "B"
    Acknowledge receipt of response "B"  
    Send message request "C"  
      Acknowledge receipt of request "C"
      Send message with response "C"
    Acknowledge receipt of response "C"  
    Etc.  

    Remember that on local area networks, the time to send a given packet is the same, regardless of the payload size.  It would be unbelievably cool if there was a way of combining the acknowledgement with the response to an operation - that would double your bandwidth since you'd be sending half the number of packets.

    And it turns out that several people independently came to the same solution.  In the NetBEUI protocol the feature's called "piggy-back acks", in TCP, it's called the "Nagel Algorithm" (after the person who invented it).  When you turn on piggy-back acks (or nagling), the sequence above becomes:

    Client Server
    Send message request "A"  
      Acknowledge receipt of request "A" and send message with response "A"
    Acknowledge receipt of response "A" and send message request "B"  
      Acknowledge receipt of request "B" and send message with response "B"
    Acknowledge receipt of response "B" and send message request "C"  
      Acknowledge receipt of request "C" and send message with response "C"
    Acknowledge receipt of response "C"  
    Etc.  

    It halves the number of frames.  But there's a tricky bit here - it only works if the application's going to send a response to the client - if not, it needs to send the acknowledgement.  And that's where things get "interesting".  In order to give the server time to send the response to the client, the transport holds off on sending the ack for a short amount of time (somewhere around 200 ms typically).  If the server responds to the client within 200 milliseconds, everything's copasetic.  If the server doesn't respond, the receiver sends the acknowledgement and nothing happens.

    But what happens if you DON'T use the request/response pattern?  What happens if your protocol involves multiple messages from the client to the server? 

    This isn't as silly an idea as you might think - for example, in CIFS, the client and server negotiate a common message size - the client is prohibited from sending a block larger than the server's block size and the server is prohibited from sending a block larger than the client's buffer size.  If a CIFS client needs to send tons of data to the server, it would make sense to break the requests up into server block size chunks and shotgun them to the server - the client issues async sends for all the requests to the server and waits on the transport to deliver them as best as it can.

    Client Server
    Send message "A.1"  
    Send message "A.2"  
    Send message "A.3"  
      Respond to A.1..A.3

    On the surface, it seems like a great idea, except for the fact that (as I mentioned in my first article) the sender can't start sending message A.2 before it's gotten the acknowledgment for A.1.

    Client Server
    Send message request "A.1"  
      Acknowledge receipt of request "A.1"
    Send message request "A.2"  
      Acknowledge receipt of request "A.2"
    Send message request "A.3"  
      Acknowledge receipt of request "A.3"
      Send response to A.1..A.3
    Acknowledge receipt of response to A.1..A.3  
    Etc.  

    But if nagling is involved, things get REALLY ugly:

    Client Server
    Send message request "A.1"  
      Wait 200 ms waiting for the server to respond
      Acknowledge receipt of request "A.1"
    Send message request "A.2"  
      Wait 200 ms waiting for the server to respond
      Acknowledge receipt of request "A.2"
    Send message request "A.3"  
      Acknowledge receipt of request "A.3" and send response to A.1..A.3
    Acknowledge receipt of response to A.1..A.3  
    Etc.  

    All of a sudden, the performance "optimization" of nagling has totally killed your performance.

    This leads to Larry's "Making things go fast on the network" rule number 2:

    You can't design your application protocol in a vacuum.  You need to understand how the layers below your application work before you deploy it.

    In this case, the designers of TCP/IP realized that the nagle algorithm could cause more harm than good in certain circumstances, so they built in an opt-out.  The opt-out has two forms: First, if you're sending "large" packets (typically more than 16K) nagling is disabled.  Second, you can disable nagling on an individual socket basis by calling setsockopt(socket, ..., SO_NODELAY,...).  But your better choice is to understand how the underlying network works and design your application protocol to match.  Yes, that's a leaky abstraction, but this article is about making things faster - when you're trying to make things fast, you have to understand all the layers.

    The CIFS protocol has a couple of protocol elements that use the "send multiple messages" pattern mentioned above, when we were doing NT 3.1, it became clear that the performance associated with nagling/piggyback acks was killing our performance, I wrote about that two years ago here (I thought I'd written this stuff out before).

  • Larry Osterman's WebLog

    He's Back (for a little bit)

    • 5 Comments

    He's back (sort-of).

    I've been back from vacation for a 2 weeks but I've been a bad blogger Larry->slaps_wrist(Larry).  What's worse is that I'm going to be gone for two more weeks starting next Monday (part of the reason I've not gotten back into the saddle).

    On the other hand, the trip to DC/New York was awesome - I haven't had that much fun in a really long time.

    We started in Williamsburg/Jamestown, everyone was bright and chipper.

    By the time the group had pulled into New York, things were somewhat different:

    But not too tired to have a bit of fun:

    I've still got somewhare around 1500 photos to go through, and I've got to finish them before I leave on my next vacation next week :).

     

    At the end of the trip, Daniel and I met James Blunt's band on the train from New York to DC (they were playing at the WBOS Earthfest in Boston) - we didn't meet Mr. Blunt, but we did several members of his band - they were really nice guys, especially given that they'd been on the road for 5 months at this point - very cool, in a sort-of Scoblizer kinda way (I don't usually meet interesting people on trips).

     

    Anyway, time to get back into the blogging saddle again.  At least for a little while.  I've not forgotten the going faster articles :)

Page 1 of 1 (5 items)