CSS SQL Server Engineers

This is the official team Web Log for Microsoft Customer Service and Support (CSS) SQL Support. Posts are provided by the CSS SQL Escalation Services

How It Works: Debugging SQL Server Stalled or Stuck I/O Problems - Root Cause

How It Works: Debugging SQL Server Stalled or Stuck I/O Problems - Root Cause

Rate This
  • Comments 2

Previously I have covered stuck and stalled I/O issues in other posts and articles.  However, these only tell you that there is a problem outside the SQL Server engine.   This post attempts to extend your root cause debugging capabilities beyond the SQL Server process.

NOTE: Consult with your Windows Administrator before considering these techniques because this is an advanced technique.  If you have further questions about using these techniques and need the assistance of Microsoft you should consult with Windows Support.

SQL Server considers a stuck or stalled I/O one that has not achieved completion status from the operating system level onward.  The problem could be an OS issue, 3rd party driver, hardware or other I/O path related problem.

When SQL Server makes the ReadFile, WriteFile or other I/O call the Windows I/O manager (kernel mode) creates the IRP and begins processing the request.  The IRP has a stack concept and when exposed shows the owning stack location.  To see the IRP stack you have to look into the kernel address space.

Microsoft provides a set of debugging tools that can be used to enumerate the active IRPs.  

WARNING
  • For Vista and Windows 2008 these requires elevated administrator privileges.
  • These techniques may not always work due to the dynamic nature of the system.    You may experience problems related to 'get-context' errors and extended output like '<Data Not Accessible>' that cause the scripts to fail.   When you encounter such issues, retry the LiveKd or Windbg step.   This may require several attempts.
  • Vista 64 bit and Windows 2008 64 bit do not support the LiveKd approach.  A reboot troubleshooting method must be used and with the Windows Debugger.
  • The 'Operating System Remains Active' techniques result in negible impact. They are snapshots of memory while the system remains active.  The in-motion data can be misleading so be sure to study the information carefully before drawing conclusions.  If the in-motion data capture is not sufficient you will need to consider the 'Operating System Frozen' technique.

No Reboot Required - Operating System Remains Active  (Windows 2003 32 and 64 bit, Windows XP 32 and 64 bit, Vista 32 bit)

This method does not require reboot of the computer.   It works off the live server and as such the data could be in-flight and may require more than one capture attempt to determine the proper IRP(s) to troubleshoot.

  1. Install the Debugging Tools For Windows for the native computer architecture - http://www.microsoft.com/whdc/devtools/debugging/default.mspx
  2. Install LiveKD in the same directory you installed the debugging tools. http://technet.microsoft.com/en-us/sysinternals/bb897415.aspx   (LiveKD is not currently supported on 64 bit Vista and Windows 2008 unless you disable signed driver compliance.)
  3. At the end of this post is the FILES TO SAVE section.   Save the files as directed.
  4. Open a command prompt and change (cd) to the debugger install directory.  Then issue the following commands.   This will capture the information in the CollectSQLIRPs.log file.  Use the evaluation section of this post to evaluate the output.

    set _NT_SYMBOL_PATH=srv*c:\temp\symbols*http://msdl.microsoft.com/download/symbols  
    livekd.exe

Reboot Required - Operating System Remains Active

This method does require reboot of the computer.   It works off the live server and as such the data could be in-flight and may require more than one capture attempt to determine the proper IRP(s) to troubleshoot.  This technique uses the live debugging capabilities of the Windows Debuggers to obtain the information.

  1. Install the Debugging Tools For Windows for the native computer architecture - http://www.microsoft.com/whdc/devtools/debugging/default.mspx
  2. Add /DEBUG to the boot.ini or use bcdedit.exe, Debug On to enable kernel debug capabilities.
  3. Restart the computer
  4. At the end of this post is the FILES TO SAVE section.   Save the files as directed.
  5. Open a command prompt and change (cd) to the debugger install directory.  Then issue the following commands.  

    set _NT_SYMBOL_PATH=srv*c:\temp\symbols*http://msdl.microsoft.com/download/symbols  
    windbg.exe -kl
  6. In the command window of the debugger issue the following command. This will capture the information in the CollectSQLIRPs.log file. Use the evaluation section of this post to evaluate the output.

    $<collectsqlirps.script

Reboot Required - Operating System Frozen

This method does require reboot of the computer and a second debugging client computer.

  1. Install the Debugging Tools For Windows for the native computer architecture - http://www.microsoft.com/whdc/devtools/debugging/default.mspx
  2. Follow documented instructions to enabled kernel debugging from a client machine.  You commonly need a debug cable and other settings to establish the debugging session.
  3. At the end of this post is the FILES TO SAVE section.   Save the files as directed.
  4. Open a command prompt and change (cd) to the debugger install directory.  Then issue the following commands.  

    set _NT_SYMBOL_PATH=srv*c:\temp\symbols*http://msdl.microsoft.com/download/symbols  
    windbg.exe 
  5. Establish the kernel debugging session.   File | Kernel Debug
  6. In the command window of the debugger issue the following command. This will capture the information in the CollectSQLIRPs.log file. Use the evaluation section of this post to evaluate the output.

    $<collectsqlirps.script

Evaluation

The output from the debugging script is placed in the CollectSQLIRPs.log file.   The script will take a snapshot of the IRPs, wait 10 seconds and take another snapshot.    Using the snapshots you can find the IRPs that appear to be stuck or stalled and further investigate.

Not all IRPs are expected to change state.  

The following shows the an IRP waiting for a network connection and it is expected to wait.  You have to look at each IRPs type it understand the expected behavior.   SQL Server File I/O based IRPs should complete quickly.

THREAD ADDRESS 0xfffffadf9d396840
.echo 0xfffffadf9c9a1200
.echo 0xfffffadfa7896010
IRP ADDRESS 0xfffffadf9c9a1200
   +0x000 Type             : 5
   +0x002 Size             : 184
   +0x008 DeviceObject     : 0xfffffadf`9d6ecbc0 _DEVICE_OBJECT
   +...

   +0x058 FileName         : _UNICODE_STRING "\SQLLocal\SQL2005"
...

  +0x038 DriverName       : _UNICODE_STRING "\FileSystem\Npfs"

Irp is active with 1 stacks 1 is current (= 0xfffffadf9c9a12d0)
No Mdl: No System Buffer: Thread fffffadf9d396840:  Irp stack trace. 
Flags = 00000800
ThreadListEntry.Flink = fffffadfa7896030
ThreadListEntry.Blink = fffffadf9d396bd8
IoStatus.Status = 00000000
IoStatus.Information = 00000000
RequestorMode = 00000001
Cancel = 00
CancelIrql = 0
ApcEnvironment = 00
UserIosb = 00edfab0
UserEvent = 00000000
Overlay.AsynchronousParameters.UserApcRoutine = 00000000
Overlay.AsynchronousParameters.UserApcContext = 00edfab0
Overlay.AllocationSize = 00000000 - 00000000
CancelRoutine = fffffadf90ce3320   Npfs!NpCancelChangeNotifyIrp
UserBuffer = 00000000
&Tail.Overlay.DeviceQueueEntry = fffffadf9c9a1278
Tail.Overlay.Thread = fffffadf9d396840
Tail.Overlay.AuxiliaryBuffer = 00000000
Tail.Overlay.ListEntry.Flink = fffffa80047460b8
Tail.Overlay.ListEntry.Blink = fffffa80047460b8
Tail.Overlay.CurrentStackLocation = fffffadf9c9a12d0
Tail.Overlay.OriginalFileObject = fffffadf9dad85e0
Tail.Apc = 00000000
Tail.CompletionKey = 00000000
     cmd  flg cl Device   File     Completion-Context
>[  d, 0]   5  1 fffffadf9d6ecbc0 fffffadf9dad85e0 00000000-00000000    pending
        \FileSystem\Npfs
   Args: 00000000 00000000 00110008 00000000

The following is an example of a stuck IRP.

This IRP stayed in the same state for over an hour.  Looking closer at the stack I could tell it was waiting for the client to send a TDS packet.  This should be waiting at the network driver (TPC for example) level and got stuck on its way in MYDRIVER.

Note: This is a stuck network IRP but the technique applies to file I/O as well.

!irp fffffadf9d599c60

Irp is active with 4 stacks 2 is current (= 0xfffffadf9d599d78)

Mdl=fffffadf9d83c310: No System Buffer: Thread fffffadf9c839770:  Irp stack trace. 

     cmd  flg cl Device   File     Completion-Context

[  0, 0]   0  0 00000000 00000000 00000000-00000000   

 

                                                Args: 00000000 00000000 00000000 00000000

>[  f, 8]   0 e1 fffffadf9dd86af0 fffffadf9c9cda20 fffffadf8d468e20-00000000 Success Error Cancel pending

                       \Driver\Tcpip             mydriver

                                                Args: 1002000001f69 00000000 00000000 00000000

Using the !thread command and public symbols I can see stack.

!thread fffffadf9c839770

  THREAD fffffadf9c839770  Cid 0600.1360  Teb: 000007fffff80000 Win32Thread: 0000000000000000 WAIT: (UserRequest)
     ...

        Child-SP          RetAddr           Call Site

        fffffadf`688c0970 fffff800`01027682 nt!KiSwapContext+0x85 [d:\nt\base\ntos\ke\amd64\ctxswap.asm @ 80]

        fffffadf`688c0af0 fffff800`0102828e nt!KiSwapThread+0x3c9 [d:\nt\base\ntos\ke\thredsup.c @ 2018]

        fffffadf`688c0b50 fffff800`011129fe nt!KeWaitForSingleObject+0x5a6 [d:\nt\base\ntos\ke\wait.c @ 1256]

        fffffadf`688c0bd0 fffff800`0102e33d nt!NtSignalAndWaitForSingleObject+0x18e [d:\nt\base\ntos\ob\obwait.c @ 278]

        fffffadf`688c0c70 00000000`77ef1b5a nt!KiSystemServiceCopyEnd+0x3 (TrapFrame @ fffffadf`688c0c70)

        00000000`0af6f028 00000000`77d44ebb ntdll!ZwSignalAndWaitForSingleObject+0xa

        00000000`0af6f030 00000000`01003ff0 kernel32!SignalObjectAndWait+0x133 [d:\nt\base\win32\client\synch.c @ 1309]

        00000000`0af6f0e0 00000000`01003693 sqlservr!SOS_Scheduler::SwitchContext+0x378

        00000000`0af6f390 00000000`0100353e sqlservr!SOS_Scheduler::Suspend+0xaf

        00000000`0af6f400 00000000`0135957f sqlservr!SOS_Event::Wait+0x151

        00000000`0af6f460 00000000`01359c19 sqlservr!CNetConnection::FWaitForNewPacket+0x7e

        00000000`0af6f570 00000000`0103e7f3 sqlservr!CNetConnection::FReadTdsPacket+0x1f5

        00000000`0af6f6a0 00000000`01035a5b sqlservr!CNetByteStream::ReadTDSPacketFromNetwork+0xd4

        00000000`0af6f740 00000000`0103584a sqlservr!CNetByteStream::EsetBeginNewStream+0x8b

        00000000`0af6f840 00000000`0100da8e sqlservr!process_commands+0x14a

        00000000`0af6faf0 00000000`0100dc11 sqlservr!SOS_Task::Param::Execute+0xee

        00000000`0af6fc00 00000000`0100e971 sqlservr!SOS_Scheduler::RunTask+0xc9

At the end of the capture the lmkv was issued to list the details about the loaded kernel modules.   Using this information I can find out more details about the driver such as file location.

    Loaded symbol image file: MYDRIVER.SYS
    Image path: \??\C:\WINDOWS\system32\Drivers\MYDRIVER.SYS
    Image name: MYDRIVER.SYS
    Timestamp:        Wed Jul 20 15:20:21 2005 (42DEB205)
    CheckSum:         000045C7
    ImageSize:        00008000
    Translations:     0000.04b0 0000.04e0 0409.04b0 0409.04e0

------ FILES TO SAVE ------

NTSD.INI

Save the following as file NTSD.INI in the debugger install directory.

$<collectsqlirps.script

collectsqlirps.script

Save the following as file collectsqlirps.script in the debugger install directory.  Note, the debugger does not let command span more than one line.  The .foreach commands in this script have to remain on a single line without CR\LR.

$$===========================================================================
$$=== The following dumps all Thread IRPs for the SQLServr.exe processes
$$=== !process 0 0 sqlservr.exe
$$===========================================================================
.logopen CollectSQLIRPs.log

$$ Turn off a bunch of output that would cause .foreach to receive unwanted tokens
!sym quiet
!sym prompts off

.symopt-0x01
.symopt-0x04
.symopt-0x02
.symopt-0x10
.symopt-0x20
.symopt-0x10000
.symopt-0x20000

.echo Gathering offsets
r $t1 = @@C++(#FIELD_OFFSET(nt!_ETHREAD, ThreadListEntry))
r $t2 = @@C++(#FIELD_OFFSET(nt!_ETHREAD, IrpList))
r $t3 = @@C++(#FIELD_OFFSET(nt!_IRP, ThreadListEntry))
r $t4 = @@C++(#FIELD_OFFSET(nt!_EPROCESS, ThreadListHead))
r $t5 = @@C++(#FIELD_OFFSET(nt!_LIST_ENTRY, Blink))
r $t6 = @@C++(#FIELD_OFFSET(nt!_IRP, Tail.Overlay.OriginalFileObject))
r $t7 = @@C++(#FIELD_OFFSET(nt!_FILE_OBJECT, DeviceObject))
r $t8 = @@C++(#FIELD_OFFSET(nt!_DEVICE_OBJECT, DriverObject))

.echo =========================================================
.echo Thread IRPS for SQL Server
.echo =========================================================
.time
.foreach /pS 1 /ps 11 (process {!process 0 0 sqlservr.exe}) { .echo EPROCESS ADDRESS ${process}; dt ${process} nt!_EPROCESS ImageFileName; .foreach (threadaddress {  !list -t nt!_ETHREAD.ThreadListEntry.Flink -x ".echo"  poi(${process}+$t4)-$t1 }  ) { .echo THREAD ADDRESS ${threadaddress}; .foreach(irpaddress {  !list -t nt!_IRP.ThreadListEntry.Flink -x ".echo"  poi(${threadaddress}+$t2)-$t3 } ) {.echo IRP ADDRESS ${irpaddress}; j( poi(${irpaddress}+$t3) = poi(poi(${irpaddress}+$t3)+$t5)) '.echo No active IRP on Thread' ; 'dt poi(${irpaddress}+$t6) nt!_FILE_OBJECT; dt poi(poi(${irpaddress}+$t6)+$t7) nt!_DEVICE_OBJECT; dt poi(poi(poi(${irpaddress}+$t6)+$t7)+$t8) nt!_DRIVER_OBJECT; !irp ${irpaddress} 1' } } }

.echo =========================================================
.echo Sleeping for 10 seconds
.echo =========================================================
.sleep 10000

.echo =========================================================
.echo Thread IRPS for SQL Server
.echo =========================================================
.time
.foreach /pS 1 /ps 11 (process {!process 0 0 sqlservr.exe}) { .echo EPROCESS ADDRESS ${process}; dt ${process} nt!_EPROCESS ImageFileName; .foreach (threadaddress {  !list -t nt!_ETHREAD.ThreadListEntry.Flink -x ".echo"  poi(${process}+$t4)-$t1 }  ) { .echo THREAD ADDRESS ${threadaddress}; .foreach(irpaddress {  !list -t nt!_IRP.ThreadListEntry.Flink -x ".echo"  poi(${threadaddress}+$t2)-$t3 } ) {.echo IRP ADDRESS ${irpaddress}; j( poi(${irpaddress}+$t3) = poi(poi(${irpaddress}+$t3)+$t5)) '.echo No active IRP on Thread' ; 'dt poi(${irpaddress}+$t6) nt!_FILE_OBJECT; dt poi(poi(${irpaddress}+$t6)+$t7) nt!_DEVICE_OBJECT; dt poi(poi(poi(${irpaddress}+$t6)+$t7)+$t8) nt!_DRIVER_OBJECT; !irp ${irpaddress} 1' } } }

.echo =========================================================
.echo Kernel Module List
.echo =========================================================
lmkv

.echo =========================================================
.echo Process list
.echo =========================================================
!process 0 0

$$  Close the log file
.logclose
q

Bob Dorr
SQL Server Senior Escalation Engineer

Leave a Comment
  • Please add 3 and 1 and type the answer here:
  • Post
  • You've been kicked (a good thing) - Trackback from DotNetKicks.com

  • Having non-yielding scheduler issue on a Windows 2003 Version 3790 (SP2)MP(16 procs) Free x64 serve with SQL 2005 SP3 running on it. Every day we see IO warning msgs after 6 am and cluster becomes un-responsive by 8:38 am, cluster fail over to other node. Until 8:37 system is accessible and IO response on the drive is very good. I have analysed sql and cluster hang dump, but still unable find the root cause. Could you please suggest, let me know if you need the dump or windbg output.

Page 1 of 1 (2 items)