I was looking at a problem this morning about long blocking chains caused by the wait type DBMIRROR_DBM_EVENT, and it makes an interesting case study about database mirroring, blocking, undocumented wait types and general troubleshooting techniques for database mirroring. The information following relates ONLY to SQL 2005, SQL 2008 and SQL 2008 R2. However since the undocumented wait type is not supported, it is not guaranteed to be relevant to future versions higher than SQL 2008 R2 and considering the enhancements coming in this area of functionality, it may well be completely different.
First off some thoughts about looking at root cause analysis of database mirroring problems. We see quite a few of these and often people do not have the appropriate data with which is make a full analysis. Typical examples being unexpected failovers, slow performance and sometimes blocking. Obviously the more data you have the easier, but some of key parts for looking at DBM problems historically are this:
The perfmon counters in the SQLServer:Database Mirroring and SQLServer:Service Broker objects.
Default profiler trace files (people often forget these but they have very useful DBM state change audit information in them)
SQL Server Error Logs
windows event logs
backup of msdb database (this contains useful meta data such as the size of every log backup and the DBM monitor meta data if you've enabled it)
memory dumps (if any were produced)
Obviously if you've also proactively collected MPSReports or SQLDIAG as well, then this will be no bad thing. Back to my specific scenario.....
In this example the symptom was that a huge blocking chain built on a high volume OLTP database where all the head blockers were waiting on DBMIRROR_DBM_EVENT. Here is the sequence of events I went through:
1. Review the blocking chain itself - ho help here as all we can see is that we're waiting on DBMIRROR_DBM_EVENT
2. Review the source for the undocumented wait type. Obviously you can't do this outside of MS, but I can say that at the time of writing this wait type represents the wait used when the principal is waiting for the mirror to harden an LSN, meaning that the transaction it's part of cannot commit. This immediately points quite specifically to the problem that the principal cannot commit transactions as it's waiting on the mirror. Now we need to investigate why the mirror is not committing transactions or why the principal doesn't know whether it is.
3. Review the msdb system tables
(a) Look at the [backupset] table to see if the size of the logs produced at the time of the problem are significantly higher then normal. If they were exceptionally large it may be that the mirror was flooded with transactions and could simply not keep up with the volume. This is why books online will tell you sometimes to disable mirroring if you need to do an exceptionally large logged operation such as an index rebuild. (reference for why this is at http://technet.microsoft.com/en-us/library/cc917681.aspx). Here i used the following TSQL
SELECT backup_set_id,backup_start_date,database_name,has_bulk_logged_data,backup_size / 1000 FROM [backupset] where backup_start_date between '2011-01-05 14:00:00' and '2011-01-05 19:30:00' go
select round((AVG(backup_size)/1000),0) FROM [backupset] where database_name = 'mydatabase'
(b) secondly I looked at the data in the tables [dbm_monitor_data]. The key here is to locate the timeframe in which we had a problem and then see if we were significant experiencing changes in any of the following:
log_flush_rate send_queue_size send_rate redo_queue_size redo_rate
These are all indicators similar to part (a) in that they might show a component or piece of architecture that wasn't responding. For example if the send_queue suddenly starts to grow but the re_do queue doesn't grow, then it would imply that the the principal cannot send the log records to the mirror so you'd want to look at connectivity maybe, or the service broker queues dealing with the actual transmissions.
In this particular scenario we noted that all the counters appeared to have strange values, in that there were log backups going on of normal sizes, but there were no status changes, 0 send queue, 0 redo queue, a flat send rate and a flat redo rate. This is very strange as it implies that the DBM Monitor could not record any values from anywhere over the problem period.
4. Review the SQL Server error logs. In this case there were no errors or information messages whatsoever, but in other scenarios such as this, it’s very common for errors in the 1400 range to be reported, examples of which you can find in other places in my other mirroring blogs, such as this Error 1413 example
5. Review the default trace files – in this scenario I was not provided the default traces, however they are fantastic sources of DBM problem information, as they record state change events on all the partners.This is documented here:
Database Mirroring State Change Event Class
This often gives you a great picture of scenarios such as when network connectivity failed between one or all of the partners and then what the state of the partnership became afterwards.
In this particular scenario I’m currently missing 2 key points of data, but that apart I can still make a reasonable hypothesis on the above information. We certainly can say that the blocking was caused by the fact that DBM was enabled to the due the blockers all waiting on the DBMIRROR_DBM_EVENT wait type. Since we know we didn’t flood the mirror with a large logged operation and that this deployment normally runs happily in this mode, we can exclude unusual large operations. This means that we have 2 potential candidates at this stage:
1. Hardware problems on the connectivity between some or all of the partners.
2. CPU exhaustion on the mirror server – simply unable to keep up with redos – the CPU exhaustion could itself be from a process outside of SQL Server or outside of the this mirror partnership.
3. A problem with the mirroring code itself (we’d really need some memory dumps to confirm this though).
Based upon experience I’d suspect 1 or 2, but I always keep an open mind about 3 as well, we’re trying to collect some more data now to look at this problem in more detail.
As a final note though I noted some similar scenarios on the forums, and one particular one of note where the poster said that everything worked fine until they enabled mirroring, at which point the system ground to a halt with this wait type. This is far more likely to be a non-optimal configuration where some section of the architecture can simply not keep up with the volume of requests required by the partnership, and is a perfect example of why one must benchmark mirror deployments very carefully. In that scenario you should review this KB first : http://support.microsoft.com/kb/2001270