• Ntdebugging Blog

    Understanding Storage Timeouts and Event 129 Errors

    • 6 Comments

    Greetings fellow debuggers, today I will be blogging about Event ID 129 messages.  These warning events are logged to the system event log with the storage adapter (HBA) driver’s name as the source.  Windows’ STORPORT.SYS driver logs this message when it detects that a request has timed out, the HBA driver’s name is used in the error because it is the miniport associated with storport.

     

    Below is an example 129 event:

     

    Event Type:       Warning

    Event Source:     <HBA_Name>

    Event Category:   None

    Event ID:         129

    Date:             4/9/2009

    Time:             1:15:45 AM

    User:             N/A

    Computer:         <Computer_Name>

    Description:

    Reset to device, \Device\RaidPort1, was issued.

     

    So what does this mean?  Let’s discuss the Windows I/O stack architecture to answer this.

     

    Windows I/O uses a layered architecture where device drivers are on a “device stack.”  In a basic model, the top of the stack is the file system.  Next is the volume manager, followed by the disk driver.  At the bottom of the device stack are the port and miniport drivers.  When an I/O request reaches the file system, it takes the block number of the file and translates it to an offset in the volume. The volume manager then translates the volume offset to a block number on the disk and passes the request to the disk driver.  When the request reaches the disk driver it will create a Command Descriptor Block (CDB) that will be sent to the SCSI device.  The disk driver imbeds the CDB into a structure called the SCSI_REQUEST_BLOCK (SRB).  This SRB is sent to the port driver as part of the I/O request packet (IRP).

     

    The port driver does most of the request processing.  There are different port drivers depending on the architecture.  For example, ATAPORT.SYS is the port driver for ATA devices, and STORPORT.SYS is the port driver for SCSI devices.  Some of the responsibilities for a port driver are:

    • Providing timing services for requests.
    • Enforcing queue depth (making sure that a device does not get more requests that it can handle).
    • Building scatter gather arrays for data buffers.

    The port driver interfaces with a driver called the “miniport”.  The miniport driver is designed by the hardware manufacturer to work with a specific adapter and is responsible for taking requests from the port driver and sending them to the target LUN.  The port driver calls the HwStorStartIo() function to send requests to the miniport, and the miniport will send the requests to the HBA so they can be sent over the physical medium (fibre, ethernet, etc) to the LUN.  When the request is complete, the miniport will call StorPortNotification() with the NotificationType parameter value set to RequestComplete, along with a pointer to the completed SRB.

     

    When a request is sent to the miniport, STORPORT will put the request in a pending queue.  When the request is completed, it is removed from this queue.  While requests are in the pending queue they are timed. 

     

    The timing mechanism is simple.  There is one timer per logical unit and it is initialized to -1.  When the first request is sent to the miniport the timer is set to the timeout value in the SRB.  The disk timeout value is a tunable parameter in the registry at: HKLM\System\CurrentControlSet\Services\Disk\TimeOutValue.  Some vendors will tune this value to best match their hardware, we do not advise changing this value without guidance from your storage vendor.

     

    The timer is decremented once per second.  When a request completes, the timer is refreshed with the timeout value of the head request in the pending queue.  So, as long as requests complete the timer will never go to zero.  If the timer does go to zero, it means the device has stopped responding.  That is when the STORPORT driver logs the Event ID 129 error.  STORPORT then has to take corrective action by trying to reset the unit.  When the unit is reset, all outstanding requests are completed with an error and they are retried.  When the pending queue empties, the timer is set to -1 which is its initial value.

    image002

    Each SRB has a timer value set.  As requests are completed the queue timer is refreshed with the timeout value of the SRB at the head of the list.

     

    The most common causes of the Event ID 129 errors are unresponsive LUNs or a dropped request.  Dropped requests can be caused by faulty routers or other hardware problems on the SAN.

     

    I have never seen software cause an Event ID 129 error.  If you are seeing Event ID 129 errors in your event logs, then you should start investigating the storage and fibre network.

  • Ntdebugging Blog

    What Is In A RHS Dump File Created By Windows Error Reporting

    • 0 Comments

    Hello all, East here.  I wanted to give you a hint on how to use a RHS dump to find what thread was part of the Windows Server 2008 R2 cluster RHS recovery deadlock.

     

    First let me start off with letting you know that Windows Server 2008 R2 will create two types of user-mode dumps:

    1 - A heap dump (.hdmp) - which is an extended mini-dump that contains additional data such as the process memory.

    2 - A mini dump (.mdmp) - which will only contain mainly stack information and not enough info to help you get further.

     

    When a Windows Server 2008 R2 cluster has a resource that does not respond to one of the main cluster queries in a timely fashion, RHS will try to recover that resource by restarting it in an isolated process as well as creating a WER report and a user-mode dump of the deadlock.

     

    NOTE: More info about RHS and WER reports can be seen in the following blog - http://blogs.msdn.com/b/clustering/archive/2009/06/27/9806160.aspx

     

    When a deadlock is detected the cluster log will show:

    "00000dd4.00000aac::2011/05/13-16:12:39.559 ERR   [RHS] WER report is submitted. Result : WerReportQueued."

     

    You should find the WER report file and dumps located at the default location:

    <SystemDrive>\Users\All Users\Microsoft\Windows\WER\ReportQueue

     

    Here is what a cluster created Windows Error Report looks like:

    Version=1

    EventType=WSFC Resource Deadlock

    EventTime=129497767591366327

    ReportType=1

    Consent=1

    ReportIdentifier=d2277837-7d7b-11e0-b5d0-00155d06402c

    Response.type=4

    Sig[0].Name=Resource Name

    Sig[0].Value=New flexdisk

    Sig[1].Name=Resource Type

    Sig[1].Value=flexdisk

    Sig[2].Name=Call Type

    Sig[2].Value=ISALIVE

    DynamicSig[1].Name=OS Version

    DynamicSig[1].Value=6.1.7601.2.1.0.274.10

    DynamicSig[2].Name=Locale ID

    DynamicSig[2].Value=1033

    File[0].CabName=memory.hdmp

    File[0].Path=WERC1F2.tmp.hdmp <----------------------------- Name of the dump file   

    File[0].Flags=2097152

    File[0].Type=3

    File[0].Original.Path=C:\Windows\Temp\WERC1F2.tmp.hdmp

    File[1].CabName=minidump.mdmp

    File[1].Path=WERC2CE.tmp.mdmp

    File[1].Flags=2162690

    File[1].Type=2

    File[1].Original.Path=C:\Windows\Temp\WERC2CE.tmp.mdmp

    FriendlyEventName=WSFC Resource Deadlock

    ConsentKey=WSFC Resource Host Monitor

    AppName=Failover Cluster Resource Host Subsystem <--------- The application that had an issue - RHS

    AppPath=C:\Windows\Cluster\rhs.exe

    ReportDescription=WSFC Resource Deadlock

     

    Now you have the area to look for in the cluster log, and an idea of which resource caused RHS to deadlock, and what it was doing at the time:

     

    The Resource Name = New flexdisk

    The Resource Type = flexdisk

    The type of call the resource was engaged in = ISALIVE

     

    You have all this information in the cluster log as well:

    00000dd4.00000aac::2011/05/13-16:12:39.136 ERR   [RHS] RhsCall::DeadlockMonitor: Call ISALIVE timed out for resource 'New flexdisk'.  RHS deadlock during an Isalive check for the New flexdisk resource.

    00000dd4.00000aac::2011/05/13-16:12:39.136 INFO  [RHS] Enabling RHS termination watchdog with timeout 1200000 and recovery action 3.

    00000dd4.00000aac::2011/05/13-16:12:39.136 ERR   [RHS] Resource New flexdisk handling deadlock. Cleaning current operation and terminating RHS process.

    00000dd4.00000aac::2011/05/13-16:12:39.136 ERR   [RHS] About to send WER report.

    000004dc.00000fb4::2011/05/13-16:12:39.136 WARN  [RCM] HandleMonitorReply: FAILURENOTIFICATION for 'New flexdisk', gen(3) result 4.

    000004dc.00000fb4::2011/05/13-16:12:39.136 INFO  [RCM] rcm::RcmResource::HandleMonitorReply: Resource 'New flexdisk' consecutive failure count 1.

    00000dd4.00000aac::2011/05/13-16:12:39.559 ERR   [RHS] WER report is submitted. Result : WerReportQueued.  WER report and user-mode dump created

    000004dc.00000fb4::2011/05/13-16:12:39.559 ERR   [RCM] rcm::RcmMonitor::RecoverProcess: Recovering monitor process 3540 / 0xdd4

    000004dc.00000fb4::2011/05/13-16:12:39.575 INFO  [RCM] Created monitor process 3280 / 0xcd0

    00000cd0.00000200::2011/05/13-16:12:39.575 INFO  [RHS] Initializing.

    000004dc.00000fb4::2011/05/13-16:12:39.591 INFO  [RCM] rcm::RcmResource::ReattachToMonitorProcess: (New flexdisk, OfflineCallIssued)

    000004dc.00000fb4::2011/05/13-16:12:39.591 WARN  [RCM] Canceling pending control GET_RO_PRIVATE_PROPERTIES for resource 'New flexdisk' due to monitor crash.

     

    Now if you want to get adventurous and review the dump file to see the call stacks of RHS and what the resource was doing, you will look at the WER dump with the extension “.hdmp”.  Since we are looking at a process with multiple threads, we need to find the one that is showing the WER reporting calls and the one showing the RHS deadlock.

     

    Run the following command after opening the dump under your debugger “~*k”.  In the below example, thread 3 is the one we are interested in. You can tell this because the process and thread ID matches the cluster log above.  If you don’t have a cluster log, this is the thread with wer.dll on the stack.

    0:000> ~*k

     

    .  0  Id: dd4.9d4 Suspend: 0 Teb: 000007ff`fffdd000 Unfrozen

    Child-SP          RetAddr           Call Site

    00000000`001cf128 000007fe`fd611420 ntdll!ZwWaitForMultipleObjects+0xa

    00000000`001cf130 00000000`772c1220 KERNELBASE!WaitForMultipleObjectsEx+0xe8

    00000000`001cf230 00000000`ff763d53 kernel32!WaitForMultipleObjects+0xb0

    00000000`001cf2c0 00000000`ff74802c rhs!cxl::WaitHandle::WaitAny+0x3f

    00000000`001cf3b0 00000000`ff7458a1 rhs!Rhs::WaitForTermination+0x74

    00000000`001cf450 00000000`ff769099 rhs!wmain+0x4b1

    00000000`001cfbd0 00000000`772c652d rhs!__wmainCRTStartup+0x13d

    00000000`001cfc10 00000000`774fc521 kernel32!BaseThreadInitThunk+0xd

    00000000`001cfc40 00000000`00000000 ntdll!RtlUserThreadStart+0x1d

     

       1  Id: dd4.d0c Suspend: 0 Teb: 000007ff`fffdb000 Unfrozen

    Child-SP          RetAddr           Call Site

    00000000`00c3fa98 000007fe`fd611420 ntdll!ZwWaitForMultipleObjects+0xa

    00000000`00c3faa0 00000000`772c1220 KERNELBASE!WaitForMultipleObjectsEx+0xe8

    00000000`00c3fba0 00000000`ff763d53 kernel32!WaitForMultipleObjects+0xb0

    00000000`00c3fc30 00000000`ff75785c rhs!cxl::WaitHandle::WaitAny+0x3f

    00000000`00c3fd20 00000000`ff766988 rhs!PingAliveCallScheduler::Run+0x184

    00000000`00c3fda0 00000000`772c652d rhs!cxl::Thread::ThreadStartRoutine+0x18

    00000000`00c3fdf0 00000000`774fc521 kernel32!BaseThreadInitThunk+0xd

    00000000`00c3fe20 00000000`00000000 ntdll!RtlUserThreadStart+0x1d

     

       2  Id: dd4.e78 Suspend: 0 Teb: 000007ff`fffd9000 Unfrozen

    Child-SP          RetAddr           Call Site

    00000000`00baf998 00000000`774eb007 ntdll!ZwWaitForMultipleObjects+0xa

    00000000`00baf9a0 00000000`772c652d ntdll!TppWaiterpThread+0x14d

    00000000`00bafc40 00000000`774fc521 kernel32!BaseThreadInitThunk+0xd

    00000000`00bafc70 00000000`00000000 ntdll!RtlUserThreadStart+0x1d

     

       3  Id: dd4.aac Suspend: 0 Teb: 000007ff`fffd7000 Unfrozen

    Child-SP          RetAddr           Call Site

    00000000`00cbe5e8 000007fe`fd611420 ntdll!ZwWaitForMultipleObjects+0xa

    00000000`00cbe5f0 00000000`772d2d53 KERNELBASE!WaitForMultipleObjectsEx+0xe8

    00000000`00cbe6f0 00000000`773e8f7d kernel32!WaitForMultipleObjectsExImplementation+0xb3

    00000000`00cbe780 00000000`773e62b2 user32!RealMsgWaitForMultipleObjectsEx+0x12a

    00000000`00cbe820 00000000`773e62e0 user32!MsgWaitForMultipleObjectsEx+0x46

    00000000`00cbe860 000007fe`f6f3c35c user32!MsgWaitForMultipleObjects+0x20

    00000000`00cbe8a0 000007fe`f6f6dbd0 wer!UtilMsgWaitForMultipleObjects+0x54

    00000000`00cbe930 000007fe`f6f54158 wer!CReportManager::ReportProblemOutOfProcess+0x5c4

    00000000`00cbecc0 000007fe`f6f3b99b wer!CReportManager::ReportProblem+0x1eb

    00000000`00cbede0 000007fe`f6f61cbe wer!CReportHandleInstance::SubmitReport+0x386

    00000000`00cbee50 00000000`ff750dcf wer!WerReportSubmit+0x16a

    00000000`00cbeef0 00000000`ff75117b rhs!RhsResource::WERReportDeadlock+0x43f

    00000000`00cbfaf0 00000000`ff74b270 rhs!RhsResource::HandleDeadlock+0x187

    00000000`00cbfb60 00000000`772c652d rhs!RhsCall::DeadlockMonitor+0x324

    00000000`00cbfc60 00000000`774fc521 kernel32!BaseThreadInitThunk+0xd

    00000000`00cbfc90 00000000`00000000 ntdll!RtlUserThreadStart+0x1d

     

       4  Id: dd4.e48 Suspend: 0 Teb: 000007ff`fffae000 Unfrozen

    Child-SP          RetAddr           Call Site

    00000000`00e2f0f8 000007fe`fd611203 ntdll!ZwDelayExecution+0xa

    00000000`00e2f100 000007fe`f78f1ad9 KERNELBASE!SleepEx+0xab

    00000000`00e2f1a0 00000000`00326aa0 FlexRes+0x11ad9

    00000000`00e2f1a8 00000000`00000000 0x326aa0

     

    <snipped to save space>

     

    We will switch context to thread 3 by using the “ ~3s ” command, and follow that with “ kn “ to show the stack with frame numbers.  Debug Tip: Since we already know that its thread id is “aac” we can also switch directly to thread 3 by using its thread id in the following command syntax:  “ ~~[aac]s ”

    0:003> kn

    # Child-SP          RetAddr           Call Site

    00 00000000`00cbe5e8 000007fe`fd611420 ntdll!NtWaitForMultipleObjects+0xa

    01 00000000`00cbe5f0 00000000`772d2d53 KERNELBASE!WaitForMultipleObjectsEx+0xe8

    02 00000000`00cbe6f0 00000000`773e8f7d kernel32!WaitForMultipleObjectsExImplementation+0xb3

    03 00000000`00cbe780 00000000`773e62b2 user32!RealMsgWaitForMultipleObjectsEx+0x12a

    04 00000000`00cbe820 00000000`773e62e0 user32!MsgWaitForMultipleObjectsEx+0x46

    05 00000000`00cbe860 000007fe`f6f3c35c user32!MsgWaitForMultipleObjects+0x20

    06 00000000`00cbe8a0 000007fe`f6f6dbd0 wer!UtilMsgWaitForMultipleObjects+0x54

    07 00000000`00cbe930 000007fe`f6f54158 wer!CReportManager::ReportProblemOutOfProcess+0x5c4

    08 00000000`00cbecc0 000007fe`f6f3b99b wer!CReportManager::ReportProblem+0x1eb

    09 00000000`00cbede0 000007fe`f6f61cbe wer!CReportHandleInstance::SubmitReport+0x386

    0a 00000000`00cbee50 00000000`ff750dcf wer!WerReportSubmit+0x16a

    0b 00000000`00cbeef0 00000000`ff75117b rhs!RhsResource::WERReportDeadlock+0x43f

    0c 00000000`00cbfaf0 00000000`ff74b270 rhs!RhsResource::HandleDeadlock+0x187

    0d 00000000`00cbfb60 00000000`772c652d rhs!RhsCall::DeadlockMonitor+0x324

    0e 00000000`00cbfc60 00000000`774fc521 kernel32!BaseThreadInitThunk+0xd

    0f 00000000`00cbfc90 00000000`00000000 ntdll!RtlUserThreadStart+0x1d

     

    The frame of our concern will be frame d. You can either go the context ( /c ) or just display the registers  ( /r )  for that frame context:

    0:003> .frame /r d

    0d 00000000`00cbfb60 00000000`772c652d rhs!RhsCall::DeadlockMonitor+0x324

    rax=000007fffffd7000 rbx=0000000000346920 rcx=00000000007316c0

    rdx=0000000000000000 rsi=000000000035f8b0 rdi=000000000035f8b0

    rip=00000000ff74b270 rsp=0000000000cbfb60 rbp=0000000000000000

    r8=00000000ffffffff  r9=00000000000004ff r10=0000000000000000

    r11=0000000000cbe810 r12=0000000000000000 r13=0000000000000000

    r14=0000000000000000 r15=000000004dcd5877

    iopl=0         nv up ei pl zr na po nc

    cs=0033  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00000246

    rhs!RhsCall::DeadlockMonitor+0x324:

    00000000`ff74b270 448b442430      mov     r8d,dword ptr [rsp+30h] ss:00000000`00cbfb90=ffffffff

     

    Here is where we get into tricky territory about where to look. I know from experience that the RDI register from above is going to end up containing my information for this resource. Another way to verify you are in the right area, dumping RDI will show the CallType.

     

    Dump that out to find the thread containing the stack with:

    0:003> dqs 000000000035f8b0

    00000000`0035f8b0  00000000`ff704eb8 rhs!IsAliveCall::`vftable' == CallType

    00000000`0035f8b8  00000000`00000004

    00000000`0035f8c0  00000000`001e7da0

    00000000`0035f8c8  00000000`00000e48 <-- Thread id of concerning resource

    00000000`0035f8d0  00000000`00346920

    00000000`0035f8d8  0000000c`00000001

    00000000`0035f8e0  00000000`00000003

    00000000`0035f8e8  00000000`4dcd5876

    00000000`0035f8f0  00000000`048d2dd3

    00000000`0035f8f8  00000000`00000100

    00000000`0035f900  00000001`00000100

    00000000`0035f908  00000000`00000000

    00000000`0035f910  00430000`00000031

    00000000`0035f918  8a0000a4`52e48988

    00000000`0035f920  004c0041`00530049

    00000000`0035f928  00450045`00560049

     

    If we scan our previous “~*k” output, we should find the thread id above to match our thread with the problem resource, FlexRes:

       4  Id: dd4.e48 Suspend: 0 Teb: 000007ff`fffae000 Unfrozen

    Child-SP          RetAddr           Call Site

    00000000`00e2f0f8 000007fe`fd611203 ntdll!ZwDelayExecution+0xa

    00000000`00e2f100 000007fe`f78f1ad9 KERNELBASE!SleepEx+0xab

    00000000`00e2f1a0 00000000`00326aa0 FlexRes+0x11ad9

    00000000`00e2f1a8 00000000`00000000 0x326aa0

     

    In this case I was using an in house test app that just went to sleep instead of responding to the LooksAlive() check, which caused IsAlive() to fail.

     

    Also if you check the cluster log 5 minutes prior to the RhsCall::DeadlockMonitor error you will see the resource and thread id as well, showing what was going on prior to the Deadlock:

    00000dd4.00000e48::2011/05/13-16:07:37.270 INFO  [RES] flexdisk <New flexdisk>: Set new entry point  = ISALIVEERROR:1722,ISALIVEFAILTYPE:1,ISALIVEHANG:1 Error = 0.

    000004dc.00000fd4::2011/05/13-16:07:37.286 WARN  [RCM] ResourceControl(SET_PRIVATE_PROPERTIES) to New flexdisk returned 5024.

    00000dd4.00000e48::2011/05/13-16:07:38.023 INFO  [RES] flexdisk <New flexdisk>: Set new entry point  = ISALIVEERROR:1722,ISALIVEFAILTYPE:1,ISALIVEHANG:0 Error = 0.

    00000dd4.00000e48::2011/05/13-16:07:38.023 INFO  [RES] flexdisk <New flexdisk>: Going to Hang in IsAlive.

     

    In most cases the cluster log can provide enough information to help you narrow down possible cluster resource issues.  If not, the user-mode dump may be able to help verify additional data, especially if you are debugging a third party resource dll that is deadlocking RHS.

Page 1 of 1 (2 items)