SQL Server troubleshooting thoughts from Graham Kent, (ex) Senior Support Escalation Engineer at Microsoft Sweden.
I was looking at a problem this morning about long blocking chains caused by the wait type DBMIRROR_DBM_EVENT, and it makes an interesting case study about database mirroring, blocking, undocumented wait types and general troubleshooting techniques for database mirroring. The information following relates ONLY to SQL 2005, SQL 2008 and SQL 2008 R2. However since the undocumented wait type is not supported, it is not guaranteed to be relevant to future versions higher than SQL 2008 R2 and considering the enhancements coming in this area of functionality, it may well be completely different.
First off some thoughts about looking at root cause analysis of database mirroring problems. We see quite a few of these and often people do not have the appropriate data with which is make a full analysis. Typical examples being unexpected failovers, slow performance and sometimes blocking. Obviously the more data you have the easier, but some of key parts for looking at DBM problems historically are this:
The perfmon counters in the SQLServer:Database Mirroring and SQLServer:Service Broker objects.
Default profiler trace files (people often forget these but they have very useful DBM state change audit information in them)
SQL Server Error Logs
windows event logs
backup of msdb database (this contains useful meta data such as the size of every log backup and the DBM monitor meta data if you've enabled it)
memory dumps (if any were produced)
Obviously if you've also proactively collected MPSReports or SQLDIAG as well, then this will be no bad thing. Back to my specific scenario.....
In this example the symptom was that a huge blocking chain built on a high volume OLTP database where all the head blockers were waiting on DBMIRROR_DBM_EVENT. Here is the sequence of events I went through:
1. Review the blocking chain itself - ho help here as all we can see is that we're waiting on DBMIRROR_DBM_EVENT
2. Review the source for the undocumented wait type. Obviously you can't do this outside of MS, but I can say that at the time of writing this wait type represents the wait used when the principal is waiting for the mirror to harden an LSN, meaning that the transaction it's part of cannot commit. This immediately points quite specifically to the problem that the principal cannot commit transactions as it's waiting on the mirror. Now we need to investigate why the mirror is not committing transactions or why the principal doesn't know whether it is.
3. Review the msdb system tables
(a) Look at the [backupset] table to see if the size of the logs produced at the time of the problem are significantly higher then normal. If they were exceptionally large it may be that the mirror was flooded with transactions and could simply not keep up with the volume. This is why books online will tell you sometimes to disable mirroring if you need to do an exceptionally large logged operation such as an index rebuild. (reference for why this is at http://technet.microsoft.com/en-us/library/cc917681.aspx). Here i used the following TSQL
SELECT backup_set_id,backup_start_date,database_name,has_bulk_logged_data,backup_size / 1000 FROM [backupset] where backup_start_date between '2011-01-05 14:00:00' and '2011-01-05 19:30:00' go
select round((AVG(backup_size)/1000),0) FROM [backupset] where database_name = 'mydatabase'
(b) secondly I looked at the data in the tables [dbm_monitor_data]. The key here is to locate the timeframe in which we had a problem and then see if we were significant experiencing changes in any of the following:
log_flush_rate send_queue_size send_rate redo_queue_size redo_rate
These are all indicators similar to part (a) in that they might show a component or piece of architecture that wasn't responding. For example if the send_queue suddenly starts to grow but the re_do queue doesn't grow, then it would imply that the the principal cannot send the log records to the mirror so you'd want to look at connectivity maybe, or the service broker queues dealing with the actual transmissions.
In this particular scenario we noted that all the counters appeared to have strange values, in that there were log backups going on of normal sizes, but there were no status changes, 0 send queue, 0 redo queue, a flat send rate and a flat redo rate. This is very strange as it implies that the DBM Monitor could not record any values from anywhere over the problem period.
4. Review the SQL Server error logs. In this case there were no errors or information messages whatsoever, but in other scenarios such as this, it’s very common for errors in the 1400 range to be reported, examples of which you can find in other places in my other mirroring blogs, such as this Error 1413 example
5. Review the default trace files – in this scenario I was not provided the default traces, however they are fantastic sources of DBM problem information, as they record state change events on all the partners.This is documented here:
Database Mirroring State Change Event Class
This often gives you a great picture of scenarios such as when network connectivity failed between one or all of the partners and then what the state of the partnership became afterwards.
In this particular scenario I’m currently missing 2 key points of data, but that apart I can still make a reasonable hypothesis on the above information. We certainly can say that the blocking was caused by the fact that DBM was enabled to the due the blockers all waiting on the DBMIRROR_DBM_EVENT wait type. Since we know we didn’t flood the mirror with a large logged operation and that this deployment normally runs happily in this mode, we can exclude unusual large operations. This means that we have 2 potential candidates at this stage:
1. Hardware problems on the connectivity between some or all of the partners.
2. CPU exhaustion on the mirror server – simply unable to keep up with redos – the CPU exhaustion could itself be from a process outside of SQL Server or outside of the this mirror partnership.
3. A problem with the mirroring code itself (we’d really need some memory dumps to confirm this though).
Based upon experience I’d suspect 1 or 2, but I always keep an open mind about 3 as well, we’re trying to collect some more data now to look at this problem in more detail.
As a final note though I noted some similar scenarios on the forums, and one particular one of note where the poster said that everything worked fine until they enabled mirroring, at which point the system ground to a halt with this wait type. This is far more likely to be a non-optimal configuration where some section of the architecture can simply not keep up with the volume of requests required by the partnership, and is a perfect example of why one must benchmark mirror deployments very carefully. In that scenario you should review this KB first : http://support.microsoft.com/kb/2001270
We've recently been looking at a problem which triggered this error:
[Microsoft][ODBC SQL Server Driver][SQL Server]To connect to this server you must use SQL Server Management Studio or SQL Server Management Objects (SMO)
In our case it was actually being thrown by the SQL Server Web Data Administrator which is available on codeplex, but it could just as well be any application that you've written yourselves. In this case the application in question was running on an old Windows 2003 server and had been working fine for a long time, when connecting to old SQL 2000 instances. However when they started trying to connect to later versions of SQL they were getting the above error within the following stack:
at SqlAdmin.ISqlServer.Connect(Object ServerName, Object Login, Object Password) at SqlAdmin.SqlServer.Connect() at SqlWebAdmin.databases.Page_Load(Object sender, EventArgs e) at System.Web.UI.Control.OnLoad(EventArgs e) at System.Web.UI.Control.LoadRecursive() at System.Web.UI.Page.ProcessRequestMain()
Being as you can check out the source of this application I did check it out, but excluded this as a problem fairly early on. We came back to focus on the SQLDMO.dll itself and via some rigorous structured testing in the lab we managed to reproduce the exact error and provide an appropriate solution. (If you wonder what I mean by rigorous structured troubleshooting, I mean that I build a windows 2003 virtual machine in a lab and then applied updates to it sequentially checkpointing everytime something changed, and testing various hypotheses that I had on each checkpoint. When you're working with DLL versioning problems I find this to be a very effective - and in my opinion necessary methodology).
Anyway the customer web server in question was runing the following version of SQLDMO.DLL on the web server
Module [C:\PROGRAM FILES\WINDOWS\SYSTEM32\SQLDMO.DLL] Company Name: Microsoft Corporation File Description: Distributed Management Objects OLE DLL for SQL Enterprise Manager Product Version: (8.0:760.0) File Version: (2000.80:760.0) File Size (bytes): 4215360 File Date: ti joulu 17 19:25:22 2002 Module TimeDateStamp = 0x3dffbfbc - Wed Dec 18 02:22:20 2002 Module Checksum = 0x00411b8a Module SizeOfImage = 0x00405000 Module Pointer to PDB = [dll\sqldmo.pdb] Module PDB Signature = 0x3dff938b Module PDB Age = 0x9e
(I was running MPSReports during my tests to extract and save exact configurations). This is a build from SQL Server 2000 SP3a, a fairly recognisable version number for those of us who have been around SQL for a few years!
To cut a long story short all SQL 2000 versions of this DLL are incompatible with higher major versions of SQL Server, and you need to patch this DLL. There are a number of articles about installing SQLDMO on the web, but I didn;t find any of them to be particularly 100% reliable for all circumstances, which is why I tested it so thoroughly. I wanted to be sure that I could patch this DLL and only this DLL without changing anything else on a production machine. (For example it was hypothesized that installing SQL2005 or higher would resolve the problem - this in fact turned out to be true, but we didn;t want to do this. I also didn't want to be manually registering DLLs.)
We instead used the backward compatibility MSI from SQL 2005, which can be downloaded and run indepently here:
Feature Pack for Microsoft SQL Server 2005 - November 2005
When running this we removed all the features apart from SQLDMO, and this ensured that we correctly upgraded SQLDMO.DLL (and its related objects such as SQLDMO.DLL) in the SQL 2000 directory here
C:\Program Files\Microsoft SQL Server\80\Tools\Binn
This upgrades SQLDMO.DLL to a version of this format
The 05 being the key identifier here. Everything will then work fine.
The main point being here that one can install and run a version of DMO which will be compatible with all major versions of SQL higher than 2000, and you can do it in a controlled and easy manner, without resorting to the manual registrations listed in older KBs like these
Hope this helps.