Hello my name is Bob Golding and I would like to share with you a new event that you may see in the system event log. Event ID 153 is an error associated with the storage subsystem. This event was new in Windows 8 and Windows Server 2012 and was added to Windows 7 and Windows Server 2008 R2 starting with hot fix KB2819485.
An event 153 is similar to an event 129. An event 129 is logged when the storport driver times out a request to the disk; I described event 129 messages in a previous article. The difference between a 153 and a 129 is that a 129 is logged when storport times out a request, a 153 is logged when the storport miniport driver times out a request. The miniport driver may also be referred to as an adapter driver or HBA driver, this driver is typically written the hardware vendor.
Because the miniport driver has a better knowledge of the request execution environment, some miniport drivers time the request themselves instead of letting storport handle request timing. This is because the miniport driver can abort the individual request and return an error rather than storport resetting the drive after a timeout. Resetting the drive is disruptive to the I/O subsystem and may not be necessary if only one request has timed out. The error returned from the miniport driver is bubbled up to the class driver who can log an event 153 and retry the request.
Below is an example event 153:
This error means that a request failed and was retried by the class driver. In the past no message would be logged in this situation because storport did not timeout the request. The lack of messages resulted in confusion when troubleshooting disk errors because timeouts would occur but there would be no evidence of the error.
The details section of the event the log record will present what error caused the retry and whether the request was a read or write. Below is the details output:
In the example above at byte offset 29 is the SCSI status, at offset 30 is the SRB status that caused the retry, and at offset 31 is the SCSI command that is being retried. In this case the SCSI status was 00 (SCSISTAT_GOOD), the SRB status was 09 (SRB_STATUS_TIMEOUT), and the command was 28 (SCSIOP_READ).
The most common SCSI commands are:
SCSIOP_READ - 0x28
SCSIOP_WRITE - 0x2A
The most common SRB statuses are below:
SRB_STATUS_TIMEOUT - 0x09
SRB_STATUS_BUS_RESET - 0x0E
SRB_STATUS_COMMAND_TIMEOUT - 0x0B
A complete list of SCSI operations and statuses can be found in scsi.h in the WDK. A list of SRB statuses can be found in srb.h.
The timeout errors (SRB_STATUS_TIMEOUT and SRB_STATUS_COMMAND_TIMEOUT) indicate a request timed out in the adapter. In other words a request was sent to the drive and there was no response within the timeout period. The bus reset error (SRB_STATUS_BUS_RESET) indicates that the device was reset and that the request is being retried due to the reset since all outstanding requests are aborted when a drive receives a reset.
A system administrator who encounters event 153 errors should investigate the health of the computer’s disk subsystem. Although an occasional timeout may be part of the normal operation of a system, the frequent need to retry requests indicates a performance issue with the storage that should be corrected.
I was working on a debug the other day when I ran the “!vm” command and saw that the system had some 48,000 commit requests that failed. This was strange as the system was not out of memory and the page file was not full. I was left scratching my head and thinking “I wish I knew where !vm got that information from.” So I went on a quest to find out, here is what I found.
13: kd> !vm 1
*** Virtual Memory Usage ***
Physical Memory: 12580300 ( 50321200 Kb)
Page File: \??\C:\pagefile.sys
Current: 50331648 Kb Free Space: 50306732 Kb
Minimum: 50331648 Kb Maximum: 50331648 Kb
Available Pages: 4606721 ( 18426884 Kb)
ResAvail Pages: 12189247 ( 48756988 Kb)
Locked IO Pages: 0 ( 0 Kb)
Free System PTEs: 33460257 ( 133841028 Kb)
Modified Pages: 20299 ( 81196 Kb)
Modified PF Pages: 6154 ( 24616 Kb)
NonPagedPool 0 Used: 19544 ( 78176 Kb)
NonPagedPool 1 Used: 22308 ( 89232 Kb)
NonPagedPool Usage: 53108 ( 212432 Kb)
NonPagedPool Max: 9408956 ( 37635824 Kb)
PagedPool 0 Usage: 168921 ( 675684 Kb)
PagedPool 1 Usage: 4149241 ( 16596964 Kb)
PagedPool 2 Usage: 17908 ( 71632 Kb)
PagedPool Usage: 4336070 ( 17344280 Kb)
PagedPool Maximum: 33554432 ( 134217728 Kb)
Session Commit: 3438 ( 13752 Kb)
Shared Commit: 6522 ( 26088 Kb)
Special Pool: 0 ( 0 Kb)
Shared Process: 53597 ( 214388 Kb)
PagedPool Commit: 4336140 ( 17344560 Kb)
Driver Commit: 5691 ( 22764 Kb)
Committed pages: 5565215 ( 22260860 Kb)
Commit limit: 25162749 ( 100650996 Kb)
********** 48440 commit requests have failed **********
It turns out that this calculation is from a global ULONG array named “nt!MiChargeCommitmentFailures”. The array has 4 members and they are used to trace the types of commit failures that have taken place. This is done by first calculating the new commit size NewCommitValue = CurrentCommitValue + SystemReservedMemory. Based on this calculation commit errors are tracked in a few different ways, which are listed below with the corresponding member in the array that is incremented.
MiChargeCommitmentFailures - If the system failed a commit request and an expansion of the pagefile has failed.
MiChargeCommitmentFailures - If the system failed a commit and we have already reached the maximum pagefile size.
MiChargeCommitmentFailures - If the system failed a commit while the pagefile lock is held.
MiChargeCommitmentFailures - If the system failed a commit and the NewCommitValue is less than or equal to CurrentCommitValue.
In order to calculate the count of failures, "!vm" adds up the values stored in each array member of the array. Members 0 and 1 are always counted, member 2 is counted if the OS version is Windows 2003/XP and member 3 is counted if the build version is newer than Windows 2003/XP.
Let's look at the array in the dump I was debugging:
13: kd> dc nt!MiChargeCommitmentFailures L4
fffff800`01e45ce0 00000000 0000bd38 00000000 00000000 ....8...........
Converting this to decimal we find the 48000+ commit failures I was seeing the in output of !VM.
13: kd> ?0000bd38
Evaluate expression: 48440 = 00000000`0000bd38
Since I now had my answer, “where does the number come from?”, I was left wanting to know a bit more about the overall flow of why a VirtualAlloc fails to commit.
When memory is allocated by VirtualAlloc the newly allocated memory is not committed to physical memory. Only when the memory is accessed by a read or write is it backed by physical memory.
When this newly allocated memory is accessed for the first time it will need to be backed by commit space. Under normal conditions this is a smooth process, however when the system hits what’s called the commit limit and can’t expand this limit we see commit failures.
So how is the commit limit calculated? Let’s say we have a system with 4GB of physical memory and a pagefile that is 6GB in size. To determine the commit limit we add physical memory and the pagefile size together - in this example the commit limit would be 10GB. Since memory manger will not let any user mode allocation consume every last morsel of commit space it keeps a small amount of the commit space for the system to avoid hangs. When the limit is reached the system tries to grow the page file. If there is no more room to grow the pagefile or the pagefile has reached its configured maximum size, the system will try and free some committed memory to make room for more requests. If expansion of the page file or the attempt to free memory do not allow the allocation to complete, the allocation fails and MiChargeCommitmentFailures is incremented.
To sum it all up, commit limit is RAM + pagefile, commit failures happen when we hit the commit limit and the system is unable to grow the pagefile because it is already at its max. It’s that simple, well almost.
For those that will want to know more about how memory manger works please see the post from Somak: The Memory Shell Game.