CSS SQL Server Engineers

This is the official team Web Log for Microsoft Customer Service and Support (CSS) SQL Support. Posts are provided by the CSS SQL Escalation Services

SQLIOSim - Outstanding I/O Warnings

SQLIOSim - Outstanding I/O Warnings

  • Comments 8

I have encountered some questions pertaining to the SQLIOSim outstanding IO warnings and I am posting some details.

 

The Display monitor thread is responsible for updating the display (progress outputs) and checking for long IOs  (CheckForLongIO/ReportLongIo).   Each time it wakes (from 1 to 10 second range depending on state) it fires the check.

 

Each logical CPU (scheduler) in the simulator has an IO list.   The CheckForLongIO loops over each logical CPU and checks the pending IO list of each logical scheduler.

 

When the IO is posted (async) we grab the current time and add the IO to the logical scheduler specific list.  When the check is performed each IO in the list is checked with HasOverlappedIoCompleted.   This is a WIN32 macro that checks the status stored in the OVERLAPPED structure of the IO to see if the system has reported the IO as complete.  When FALSE is returned and the duration of the IO is > 15 seconds the condition is flagged as a ‘stalled’ IO as the OS and sub-system have not completed the specific IO.

 

The number of these outstanding IOs is accumulated and reported after all logical CPUs have been checked.

 

The following output shows 2 logical CPUs being used by the simulator.

 

03/10/07

15:35:58

3264

System

CPUCount = 2, Affinity = 0, IOAffinity = 0, MaxMemory = 630 MB, StopOnError = Yes, TestCycles = 1, TestCycleDuration = 300 s, CacheHitRatio = 100.00%

 

At 15:36:46 you have 5 IO requests that exceeded 15 seconds.  One second later another 72 requests exceeded the 15 second mark.  (The same IO is not reported again, once flagged as long it won’t be counted on secondary passes.)

 

What this does not show is how many IOs have completed.  SQLIOSim could have completed 1000s of IOs during this 15 second window but some of the requests are being stalled. 

 

03/10/07

15:36:45

3764

Display Monitor

Error: 0x00000000
Error Text:
Description: 5 IO requests are outstanding for more than 15 sec.

03/10/07

15:36:46

3764

Display Monitor

Error: 0x00000000
Error Text:
Description: 72 IO requests are outstanding for more than 15 sec.

 

The files are open with NO buffering so they should not involve system cache components.  You should review filter drivers and the like.  Likely something like elevator seek or something of this nature is at play when you see the outstanding IO warnings.

 

 

03/10/07

15:35:58

3264

System

NoBuffering = Yes, WriteThrough = Yes, UseScatterGather = Yes, ForceReadAhead = No, MaxOutstandingIO = 0, TargetIODuration = 100 ms

LOGICALCONFIG::PrintConfigurationToLog

e:\yukon\sosbranch\sql\ntdbms\storeng\util\sqliosim\logicalconfig.cpp

228

 

03/10/07

15:35:58

3264

System

AllowIOBursts = Yes, DeleteFilesAtStartup = Yes, DeleteFilesAtShutdown = No, StampFiles = No

     

 

The following error text is reported for each, individual file no more than every 30 seconds, indicating the number of total IOs identified as taking longer than 15 seconds.

 

The following indicates that 826 IO requests for the file sqliosim.mdx exceeded 15 seconds to complete in the last 30 second processing window.   The last IO that is currently STATUS_PENDING (256) is at the specified offset on the specified handle.

 

 

03/10/07

15:37:16

3764

Display Monitor

Error: 0x00000000
Error Text:
Description: 826 long IO requests are outstanding for file h:\sqliosim.mdx. Handle 0x2bc. Offset of latest long IO 0x3cf80000, duration 16000 ms

 

Support normally has to use the kernel debugger and !irpfind to locate the IRPs associated with the threads in SQLIOSim and see their IO stack location.

 

The way the output is sequenced can be puzzling until you understand the sequence.   You will see the series of CheckForLongIO (# IO requests are outstanding for more than 15 sec.) and then the ReportLongIO message (# long IO requests are outstanding for file …).  

 

The CheckForLongIO calls ReportLongIO but ReportLongIO has the 30 second reporting throttle.  When the 30 seconds is exceeded the message is logged.  This means that the next CheckForLongIO message (shown below with 887 IO requests) is included in the count of the ReportLongIO as the CheckForLongIO adds up all stalled IOs before reporting the message.

 

 

03/10/07

15:37:26

3764

Display Monitor

Error: 0x00000000
Error Text:
Description: 29 IO requests are outstanding for more than 15 sec.

CLogicalCPUs::CheckForLongIO

e:\yukon\sosbranch\sql\ntdbms\storeng\util\sqliosim\logicalcpu.cpp

1230

 

03/10/07

15:37:27

3764

Display Monitor

Error: 0x00000000
Error Text:
Description: 3 IO requests are outstanding for more than 15 sec.

CLogicalCPUs::CheckForLongIO

e:\yukon\sosbranch\sql\ntdbms\storeng\util\sqliosim\logicalcpu.cpp

1230

 

03/10/07

15:37:59

3764

Display Monitor

Error: 0x00000000
Error Text:
Description: 1138 long IO requests are outstanding for file h:\sqliosim.mdx. Handle 0x2bc. Offset of latest long IO 0x85640000, duration 15375 ms

CLogicalFile::ReportLongIO

e:\yukon\sosbranch\sql\ntdbms\storeng\util\sqliosim\fileio.cpp

811

 

03/10/07

15:37:59

3764

Display Monitor

Error: 0x00000000
Error Text:
Description: 887 IO requests are outstanding for more than 15 sec.

     

 

JAN 2008 UPDATE - SQLIOSim Stalled I/O Warnings During Checkpoint Phase of Test

Allow bursts is used to disable I/O target throttling.   This is in direct response to a stress scenario that has uncovered drivers that can blue screen when under large I/O depths and memory pressure.   

 

DEFAULT: TRUE   (Current decision was that SQLIOSim was designed to be a data consistency checker not performance test so it should allow burst activities.)

 

The current design is as follows.

 

·         During initial file build and checkpoint, when TRUE, I/O throttling is disabled.

·         During the actual test phase I/O throttling is always enabled, regardless of the allow burst setting.

·         During checkpoint phase of the test, when TRUE, I/O throttling is disabled.

·         Before the audit phase I/O throttling is enabled.

 

The design is to change the target I/O time for the entire set of logical files.   The normal behavior is to use the configuration setting for the I/O target response time.   Basically, we use it to determine how many OVERLAPPED structures are allowed to be used.  When we are not meeting the target response we throttle back the number of I/O requests that can be used.  When we are meeting the response rate we add I/O requests so we can post more I/O.

 

When Allow Bursts disables throttling is sets the target to MAX_DWORD.   When throttling is enabled again it returns to the configured target.

 

This allows the checkpoint phase to do what the older SQLIOStress did and post 1000s of I/Os in rapid succession to push the I/O depth.  This can result in the I/O warnings.  What you are looking for is that machine to cleanly recover from the situation and continue testing.

 

The one side effect this behavior may have that affects common testing scenarios, during a cycle is that when disabled the number of I/O request structures could grow large.   When throttling is re-enabled it may take a bit of time of the normal activities to come back into compliance with the target.  

 

If I was able to get a depth of 1000 during the (allow bursts checkpoint activity) these request structures are present when I start the standard cycle activities.   The target has returned to the configured value but nothing is preventing the initial I/O activity from getting a request.   It won’t be until the first set of I/Os are completed during a cycle that the number of request structures is trimmed.   So at the beginning of a cycle you might get more I/O depth than the end of the cycle.

 

I think the way the code it written you won’t see this very often.   The normal cycle does a checkpoint and then an audit.  The audit is limited by the configured I/O target so it should bring the number of requests back inline before the cycle starts all other user simulations.

 

Bob Dorr
SQL Server Senior Escalation Engineer

Leave a Comment
  • Please add 8 and 1 and type the answer here:
  • Post
  • One of the things that’s great about my job is the fact that I get to meet people who are a lot smarter

  • I had run the test for HP MSA drive wiht one local drive. But for MSA drvie it is printing following errors:-

    19/07 12:43:56 11688 CreateFileStream Error: 0x80070032

    Error Text: The request is not supported.

    Description: Unable to get disk cache info for Z:\

  • The utility attempts to check the caching state.   When it can't it logs an error.  This should be a warning and will be updated in future versions.

    [RDORR Jan 24, 2008] This is correct and is already updated in the SQL 2008 CTP builds.

  • "[RDORR Jan 24, 2008] This is correct and is already updated in the SQL 2008 CTP builds."

    In SQL2008 on Windows 2008, I still get a similar error message from "CreateFileStream":

       Error: 0x80070001

       Error Text: Incorrect function.

       Description: Unable to get disk cache info for C:\

    Version: 2007.100.1600.22

    Is it safe to say that it is just a warning? not an error?

  • According to this post, "The CheckForLongIO calls ReportLongIO but ReportLongIO has the 30 second reporting throttle." In my case it looks like it is in the reverse order. Can you please help me to understand is this is a potential problem.

    10:47:09 Description: 1 long IO requests are outstanding for file E:\sqliosim.mdx. Handle 0x598. Offset of latest long IO 0xab5f40000, duration 15366 ms CLogicalFile::ReportLongIO

    10:47:09 Description: 298 IO requests are outstanding for more than 15 sec. CLogicalCPUs::CheckForLongIO

    10:47:11 Description: 385 IO requests are outstanding for more than 15 sec. CLogicalCPUs::CheckForLongIO

    10:47:13 Description: 282 IO requests are outstanding for more than 15 sec. CLogicalCPUs::CheckForLongIO

    10:47:14 Description: 915 IO requests are outstanding for more than 15 sec. CLogicalCPUs::CheckForLongIO

    10:47:15 Description: 830 IO requests are outstanding for more than 15 sec. CLogicalCPUs::CheckForLongIO

    10:47:16 Description: 723 IO requests are outstanding for more than 15 sec. CLogicalCPUs::CheckForLongIO

    10:47:17 Description: 1446 IO requests are outstanding for more than 15 sec. CLogicalCPUs::CheckForLongIO

    10:47:18 Description: 101 IO requests are outstanding for more than 15 sec. CLogicalCPUs::CheckForLongIO

  • wonderful blog thanks for sharing such useful post.

    http://www.billergeeks.com

  • Thanks for sharing this information, its quite helpful.

    www.crystal-limousine.com

  • It's a world of SQL pain out there sometimes trying ot interpret these results; I've recently experienced issues with these long I/O timeouts in a client's SharePoint environment, here's a summary of my findings and thoughts:

    www.boxesandarrows.co.uk/testing-sql-io-for-sharepoint

Page 1 of 1 (8 items)