Understanding Storage Timeouts and Event 129 Errors

Understanding Storage Timeouts and Event 129 Errors

Rate This
  • Comments 6

Greetings fellow debuggers, today I will be blogging about Event ID 129 messages.  These warning events are logged to the system event log with the storage adapter (HBA) driver’s name as the source.  Windows’ STORPORT.SYS driver logs this message when it detects that a request has timed out, the HBA driver’s name is used in the error because it is the miniport associated with storport.

 

Below is an example 129 event:

 

Event Type:       Warning

Event Source:     <HBA_Name>

Event Category:   None

Event ID:         129

Date:             4/9/2009

Time:             1:15:45 AM

User:             N/A

Computer:         <Computer_Name>

Description:

Reset to device, \Device\RaidPort1, was issued.

 

So what does this mean?  Let’s discuss the Windows I/O stack architecture to answer this.

 

Windows I/O uses a layered architecture where device drivers are on a “device stack.”  In a basic model, the top of the stack is the file system.  Next is the volume manager, followed by the disk driver.  At the bottom of the device stack are the port and miniport drivers.  When an I/O request reaches the file system, it takes the block number of the file and translates it to an offset in the volume. The volume manager then translates the volume offset to a block number on the disk and passes the request to the disk driver.  When the request reaches the disk driver it will create a Command Descriptor Block (CDB) that will be sent to the SCSI device.  The disk driver imbeds the CDB into a structure called the SCSI_REQUEST_BLOCK (SRB).  This SRB is sent to the port driver as part of the I/O request packet (IRP).

 

The port driver does most of the request processing.  There are different port drivers depending on the architecture.  For example, ATAPORT.SYS is the port driver for ATA devices, and STORPORT.SYS is the port driver for SCSI devices.  Some of the responsibilities for a port driver are:

  • Providing timing services for requests.
  • Enforcing queue depth (making sure that a device does not get more requests that it can handle).
  • Building scatter gather arrays for data buffers.

The port driver interfaces with a driver called the “miniport”.  The miniport driver is designed by the hardware manufacturer to work with a specific adapter and is responsible for taking requests from the port driver and sending them to the target LUN.  The port driver calls the HwStorStartIo() function to send requests to the miniport, and the miniport will send the requests to the HBA so they can be sent over the physical medium (fibre, ethernet, etc) to the LUN.  When the request is complete, the miniport will call StorPortNotification() with the NotificationType parameter value set to RequestComplete, along with a pointer to the completed SRB.

 

When a request is sent to the miniport, STORPORT will put the request in a pending queue.  When the request is completed, it is removed from this queue.  While requests are in the pending queue they are timed. 

 

The timing mechanism is simple.  There is one timer per logical unit and it is initialized to -1.  When the first request is sent to the miniport the timer is set to the timeout value in the SRB.  The disk timeout value is a tunable parameter in the registry at: HKLM\System\CurrentControlSet\Services\Disk\TimeOutValue.  Some vendors will tune this value to best match their hardware, we do not advise changing this value without guidance from your storage vendor.

 

The timer is decremented once per second.  When a request completes, the timer is refreshed with the timeout value of the head request in the pending queue.  So, as long as requests complete the timer will never go to zero.  If the timer does go to zero, it means the device has stopped responding.  That is when the STORPORT driver logs the Event ID 129 error.  STORPORT then has to take corrective action by trying to reset the unit.  When the unit is reset, all outstanding requests are completed with an error and they are retried.  When the pending queue empties, the timer is set to -1 which is its initial value.

image002

Each SRB has a timer value set.  As requests are completed the queue timer is refreshed with the timeout value of the SRB at the head of the list.

 

The most common causes of the Event ID 129 errors are unresponsive LUNs or a dropped request.  Dropped requests can be caused by faulty routers or other hardware problems on the SAN.

 

I have never seen software cause an Event ID 129 error.  If you are seeing Event ID 129 errors in your event logs, then you should start investigating the storage and fibre network.

Leave a Comment
  • Please add 4 and 7 and type the answer here:
  • Post
  • We are troubleshooting a "non-responsive" type issue on one of our Exchange clusters currently.  The active node in the cluster is posting Event ID 129 with Event Source ql2300. We have supplied logs to our SAN fabric vendor to which they have uncovered no issues. We have engaged our storage array vendor who has thoroughly reviewed the storage frame to which this environment is zoned from both a hardware and performance perspective.  No issues found.  We have collected logs and have been working the hardware support angle for 2 days.  I have seen a few articles referring to an issue with the Microsoft StorPort driver that failed to register the Iologmsg.dll file properly.  We are investigating that now.  

    I have also suggested that we open a case with Qlogic and provide them details.  Maybe this is a hardware issue on the HBA??  

  • the LOL OMG msg .dll haha will have to steal that

    [Good catch in the prior comment.]

  • The VHD Miniport Driver (vhdmp.sys, logged with a Source of vhdmp in the Event Log) will throw 129 errors every 30 seconds during a backup if the Backup (Volume Snapshot) Integration Service is enabled for a Guest VM but not supported by the Guest VM OS. I get this all the time when I forget to disable the Backup Integration Service for my FreeBSD VMs.

  • If you are seeing this in a guest virtual machine, it is worth looking at this social.technet.microsoft.com/.../97186e59-2f4e-4f58-b56b-c88f49487211

    Some have reported that Event ID 129 is logged under the following conditions: 1)Hyper-V, 2)Windows Server 2012 in a guest virtual machine, 3)One or more drives connected via a virtual SCSI controller.

    The simple fix is to use the virtual IDE controller instead - although this limits the virtual machine to using 4 drives (because you can't use the SCSI controller).

    [Hi Mike.  As I understand your description, you are attaching .vhdx files to a virtual SCSI controller and then the VM generates event 129 errors.  The forum post you point to was for a VSS problem on Server 2008 and does not involve Server 2012.  We have not received reports of the problem you describe.  Unfortunately we are not able to provide in depth 1:1 troubleshooting on this blog, I would encourage you to open a case with Microsoft Support to further investigate the problem you experienced.]

  • I believe Mike is referring to the problem described here: social.technet.microsoft.com/.../e95631c6-c6b0-4dc8-a003-af4adbf113e9 .

    This is the common setup so far between all of us that are encountering this issue:

    1.)  Guest VM is Server 2012

    2.)  One or more VHDX (maybe VHD as well) are connected via a virtual SCSI controller.

    3.)  Once even 129 is logged in the problematic VM, other VMs become un-response as well.

    4.)  The problematic VM can only be turned off (normal OS shut down hangs)

    5.)  Detaching the Virtual Disk from the SCSI controller and attaching it to the IDE controller prevents the problem from concurring but then you are facing the 4 VD IDE limit.

    There are numerous posts about this popping up all over but so far no real resolution besides using IDE instead of SCSI (which isn't a permanent solution).

    [Thank you for the additional information.  Unfortunately we are not able to provide in depth diagnosis through this blog.  Please open a case with Microsoft Support to investigate this issue.]

  • Hey guys, this article is awesome, very well laid out, structured and easy to follow.  I wish all were like this.  Can you add one or two items to this article?  Can you add some diagrams / flow charts related to the driver / IO flow described above including storage stack, etc?  I know some are probably within the MS books but would love to see them here for quick read & reference.  Thanks!

    [Thank you for your feedback.  We don't have any articles showing the flow of I/O through the storage stack, we will consider this for a future article.  In the meantime these two articles somewhat describe how a request gets to storport and what storport does with it:

    http://blogs.msdn.com/b/ntdebugging/archive/2011/11/23/where-did-my-disk-i-o-go.aspx

    http://blogs.msdn.com/b/ntdebugging/archive/2012/06/21/what-did-storport-do-with-my-i-o.aspx ]

     

Page 1 of 1 (6 items)