SQLIOSim - Outstanding I/O Warnings
I have encountered some questions pertaining to the SQLIOSim outstanding IO warnings and I am posting some details.
The Display monitor thread is responsible for updating the display (progress outputs) and checking for long IOs (CheckForLongIO/ReportLongIo). Each time it wakes (from 1 to 10 second range depending on state) it fires the check.
Each logical CPU (scheduler) in the simulator has an IO list. The CheckForLongIO loops over each logical CPU and checks the pending IO list of each logical scheduler.
When the IO is posted (async) we grab the current time and add the IO to the logical scheduler specific list. When the check is performed each IO in the list is checked with HasOverlappedIoCompleted. This is a WIN32 macro that checks the status stored in the OVERLAPPED structure of the IO to see if the system has reported the IO as complete. When FALSE is returned and the duration of the IO is > 15 seconds the condition is flagged as a ‘stalled’ IO as the OS and sub-system have not completed the specific IO.
The number of these outstanding IOs is accumulated and reported after all logical CPUs have been checked.
The following output shows 2 logical CPUs being used by the simulator.
| 03/10/07 | 15:35:58 | 3264 | System | CPUCount = 2, Affinity = 0, IOAffinity = 0, MaxMemory = 630 MB, StopOnError = Yes, TestCycles = 1, TestCycleDuration = 300 s, CacheHitRatio = 100.00% |
At 15:36:46 you have 5 IO requests that exceeded 15 seconds. One second later another 72 requests exceeded the 15 second mark. (The same IO is not reported again, once flagged as long it won’t be counted on secondary passes.)
What this does not show is how many IOs have completed. SQLIOSim could have completed 1000s of IOs during this 15 second window but some of the requests are being stalled.
| 03/10/07 | 15:36:45 | 3764 | Display Monitor | Error: 0x00000000 Error Text: Description: 5 IO requests are outstanding for more than 15 sec. |
| 03/10/07 | 15:36:46 | 3764 | Display Monitor | Error: 0x00000000 Error Text: Description: 72 IO requests are outstanding for more than 15 sec. |
The files are open with NO buffering so they should not involve system cache components. You should review filter drivers and the like. Likely something like elevator seek or something of this nature is at play when you see the outstanding IO warnings.
| | 03/10/07 | 15:35:58 | 3264 | System | NoBuffering = Yes, WriteThrough = Yes, UseScatterGather = Yes, ForceReadAhead = No, MaxOutstandingIO = 0, TargetIODuration = 100 ms | LOGICALCONFIG::PrintConfigurationToLog | e:\yukon\sosbranch\sql\ntdbms\storeng\util\sqliosim\logicalconfig.cpp | 228 |
| | 03/10/07 | 15:35:58 | 3264 | System | AllowIOBursts = Yes, DeleteFilesAtStartup = Yes, DeleteFilesAtShutdown = No, StampFiles = No | | | |
The following error text is reported for each, individual file no more than every 30 seconds, indicating the number of total IOs identified as taking longer than 15 seconds.
The following indicates that 826 IO requests for the file sqliosim.mdx exceeded 15 seconds to complete in the last 30 second processing window. The last IO that is currently STATUS_PENDING (256) is at the specified offset on the specified handle.
| | 03/10/07 | 15:37:16 | 3764 | Display Monitor | Error: 0x00000000 Error Text: Description: 826 long IO requests are outstanding for file h:\sqliosim.mdx. Handle 0x2bc. Offset of latest long IO 0x3cf80000, duration 16000 ms |
Support normally has to use the kernel debugger and !irpfind to locate the IRPs associated with the threads in SQLIOSim and see their IO stack location.
The way the output is sequenced can be puzzling until you understand the sequence. You will see the series of CheckForLongIO (# IO requests are outstanding for more than 15 sec.) and then the ReportLongIO message (# long IO requests are outstanding for file …).
The CheckForLongIO calls ReportLongIO but ReportLongIO has the 30 second reporting throttle. When the 30 seconds is exceeded the message is logged. This means that the next CheckForLongIO message (shown below with 887 IO requests) is included in the count of the ReportLongIO as the CheckForLongIO adds up all stalled IOs before reporting the message.
| | 03/10/07 | 15:37:26 | 3764 | Display Monitor | Error: 0x00000000 Error Text: Description: 29 IO requests are outstanding for more than 15 sec. | CLogicalCPUs::CheckForLongIO | e:\yukon\sosbranch\sql\ntdbms\storeng\util\sqliosim\logicalcpu.cpp | 1230 |
| | 03/10/07 | 15:37:27 | 3764 | Display Monitor | Error: 0x00000000 Error Text: Description: 3 IO requests are outstanding for more than 15 sec. | CLogicalCPUs::CheckForLongIO | e:\yukon\sosbranch\sql\ntdbms\storeng\util\sqliosim\logicalcpu.cpp | 1230 |
| | 03/10/07 | 15:37:59 | 3764 | Display Monitor | Error: 0x00000000 Error Text: Description: 1138 long IO requests are outstanding for file h:\sqliosim.mdx. Handle 0x2bc. Offset of latest long IO 0x85640000, duration 15375 ms | CLogicalFile::ReportLongIO | e:\yukon\sosbranch\sql\ntdbms\storeng\util\sqliosim\fileio.cpp | 811 |
| | 03/10/07 | 15:37:59 | 3764 | Display Monitor | Error: 0x00000000 Error Text: Description: 887 IO requests are outstanding for more than 15 sec. | | | |
JAN 2008 UPDATE - SQLIOSim Stalled I/O Warnings During Checkpoint Phase of Test
Allow bursts is used to disable I/O target throttling. This is in direct response to a stress scenario that has uncovered drivers that can blue screen when under large I/O depths and memory pressure.
DEFAULT: TRUE (Current decision was that SQLIOSim was designed to be a data consistency checker not performance test so it should allow burst activities.)
The current design is as follows.
· During initial file build and checkpoint, when TRUE, I/O throttling is disabled.
· During the actual test phase I/O throttling is always enabled, regardless of the allow burst setting.
· During checkpoint phase of the test, when TRUE, I/O throttling is disabled.
· Before the audit phase I/O throttling is enabled.
The design is to change the target I/O time for the entire set of logical files. The normal behavior is to use the configuration setting for the I/O target response time. Basically, we use it to determine how many OVERLAPPED structures are allowed to be used. When we are not meeting the target response we throttle back the number of I/O requests that can be used. When we are meeting the response rate we add I/O requests so we can post more I/O.
When Allow Bursts disables throttling is sets the target to MAX_DWORD. When throttling is enabled again it returns to the configured target.
This allows the checkpoint phase to do what the older SQLIOStress did and post 1000s of I/Os in rapid succession to push the I/O depth. This can result in the I/O warnings. What you are looking for is that machine to cleanly recover from the situation and continue testing.
The one side effect this behavior may have that affects common testing scenarios, during a cycle is that when disabled the number of I/O request structures could grow large. When throttling is re-enabled it may take a bit of time of the normal activities to come back into compliance with the target.
If I was able to get a depth of 1000 during the (allow bursts checkpoint activity) these request structures are present when I start the standard cycle activities. The target has returned to the configured value but nothing is preventing the initial I/O activity from getting a request. It won’t be until the first set of I/Os are completed during a cycle that the number of request structures is trimmed. So at the beginning of a cycle you might get more I/O depth than the end of the cycle.
I think the way the code it written you won’t see this very often. The normal cycle does a checkpoint and then an audit. The audit is limited by the configured I/O target so it should bring the number of requests back inline before the cycle starts all other user simulations.
Bob Dorr
SQL Server Senior Escalation Engineer