SQL Server troubleshooting thoughts from Graham Kent, (ex) Senior Support Escalation Engineer at Microsoft Sweden.
Today is my last day at Microsoft. I’m off to do something completely different, joining a small Scandinavian tech company called Basefarm.
You can continue to find me online here:
http://se.linkedin.com/in/globalgoat
www.globalgoat.com
Enjoy your SQL Servers ! It’s been a pleasure.
Graham
I was looking at a problem this morning about long blocking chains caused by the wait type DBMIRROR_DBM_EVENT, and it makes an interesting case study about database mirroring, blocking, undocumented wait types and general troubleshooting techniques for database mirroring. The information following relates ONLY to SQL 2005, SQL 2008 and SQL 2008 R2. However since the undocumented wait type is not supported, it is not guaranteed to be relevant to future versions higher than SQL 2008 R2 and considering the enhancements coming in this area of functionality, it may well be completely different.
First off some thoughts about looking at root cause analysis of database mirroring problems. We see quite a few of these and often people do not have the appropriate data with which is make a full analysis. Typical examples being unexpected failovers, slow performance and sometimes blocking. Obviously the more data you have the easier, but some of key parts for looking at DBM problems historically are this:
The perfmon counters in the SQLServer:Database Mirroring and SQLServer:Service Broker objects.
Default profiler trace files (people often forget these but they have very useful DBM state change audit information in them)
SQL Server Error Logs
windows event logs
backup of msdb database (this contains useful meta data such as the size of every log backup and the DBM monitor meta data if you've enabled it)
memory dumps (if any were produced)
Obviously if you've also proactively collected MPSReports or SQLDIAG as well, then this will be no bad thing. Back to my specific scenario.....
In this example the symptom was that a huge blocking chain built on a high volume OLTP database where all the head blockers were waiting on DBMIRROR_DBM_EVENT. Here is the sequence of events I went through:
1. Review the blocking chain itself - ho help here as all we can see is that we're waiting on DBMIRROR_DBM_EVENT
2. Review the source for the undocumented wait type. Obviously you can't do this outside of MS, but I can say that at the time of writing this wait type represents the wait used when the principal is waiting for the mirror to harden an LSN, meaning that the transaction it's part of cannot commit. This immediately points quite specifically to the problem that the principal cannot commit transactions as it's waiting on the mirror. Now we need to investigate why the mirror is not committing transactions or why the principal doesn't know whether it is.
3. Review the msdb system tables
(a) Look at the [backupset] table to see if the size of the logs produced at the time of the problem are significantly higher then normal. If they were exceptionally large it may be that the mirror was flooded with transactions and could simply not keep up with the volume. This is why books online will tell you sometimes to disable mirroring if you need to do an exceptionally large logged operation such as an index rebuild. (reference for why this is at http://technet.microsoft.com/en-us/library/cc917681.aspx). Here i used the following TSQL
SELECT backup_set_id,backup_start_date,database_name,has_bulk_logged_data,backup_size / 1000 FROM [backupset] where backup_start_date between '2011-01-05 14:00:00' and '2011-01-05 19:30:00' go
select round((AVG(backup_size)/1000),0) FROM [backupset] where database_name = 'mydatabase'
(b) secondly I looked at the data in the tables [dbm_monitor_data]. The key here is to locate the timeframe in which we had a problem and then see if we were significant experiencing changes in any of the following:
log_flush_rate send_queue_size send_rate redo_queue_size redo_rate
These are all indicators similar to part (a) in that they might show a component or piece of architecture that wasn't responding. For example if the send_queue suddenly starts to grow but the re_do queue doesn't grow, then it would imply that the the principal cannot send the log records to the mirror so you'd want to look at connectivity maybe, or the service broker queues dealing with the actual transmissions.
In this particular scenario we noted that all the counters appeared to have strange values, in that there were log backups going on of normal sizes, but there were no status changes, 0 send queue, 0 redo queue, a flat send rate and a flat redo rate. This is very strange as it implies that the DBM Monitor could not record any values from anywhere over the problem period.
4. Review the SQL Server error logs. In this case there were no errors or information messages whatsoever, but in other scenarios such as this, it’s very common for errors in the 1400 range to be reported, examples of which you can find in other places in my other mirroring blogs, such as this Error 1413 example
5. Review the default trace files – in this scenario I was not provided the default traces, however they are fantastic sources of DBM problem information, as they record state change events on all the partners.This is documented here:
Database Mirroring State Change Event Class
This often gives you a great picture of scenarios such as when network connectivity failed between one or all of the partners and then what the state of the partnership became afterwards.
CONCLUSIONS:
In this particular scenario I’m currently missing 2 key points of data, but that apart I can still make a reasonable hypothesis on the above information. We certainly can say that the blocking was caused by the fact that DBM was enabled to the due the blockers all waiting on the DBMIRROR_DBM_EVENT wait type. Since we know we didn’t flood the mirror with a large logged operation and that this deployment normally runs happily in this mode, we can exclude unusual large operations. This means that we have 2 potential candidates at this stage:
1. Hardware problems on the connectivity between some or all of the partners.
2. CPU exhaustion on the mirror server – simply unable to keep up with redos – the CPU exhaustion could itself be from a process outside of SQL Server or outside of the this mirror partnership.
3. A problem with the mirroring code itself (we’d really need some memory dumps to confirm this though).
Based upon experience I’d suspect 1 or 2, but I always keep an open mind about 3 as well, we’re trying to collect some more data now to look at this problem in more detail.
As a final note though I noted some similar scenarios on the forums, and one particular one of note where the poster said that everything worked fine until they enabled mirroring, at which point the system ground to a halt with this wait type. This is far more likely to be a non-optimal configuration where some section of the architecture can simply not keep up with the volume of requests required by the partnership, and is a perfect example of why one must benchmark mirror deployments very carefully. In that scenario you should review this KB first : http://support.microsoft.com/kb/2001270
We've recently been looking at a problem which triggered this error:
[Microsoft][ODBC SQL Server Driver][SQL Server]To connect to this server you must use SQL Server Management Studio or SQL Server Management Objects (SMO)
In our case it was actually being thrown by the SQL Server Web Data Administrator which is available on codeplex, but it could just as well be any application that you've written yourselves. In this case the application in question was running on an old Windows 2003 server and had been working fine for a long time, when connecting to old SQL 2000 instances. However when they started trying to connect to later versions of SQL they were getting the above error within the following stack:
at SqlAdmin.ISqlServer.Connect(Object ServerName, Object Login, Object Password) at SqlAdmin.SqlServer.Connect() at SqlWebAdmin.databases.Page_Load(Object sender, EventArgs e) at System.Web.UI.Control.OnLoad(EventArgs e) at System.Web.UI.Control.LoadRecursive() at System.Web.UI.Page.ProcessRequestMain()
Being as you can check out the source of this application I did check it out, but excluded this as a problem fairly early on. We came back to focus on the SQLDMO.dll itself and via some rigorous structured testing in the lab we managed to reproduce the exact error and provide an appropriate solution. (If you wonder what I mean by rigorous structured troubleshooting, I mean that I build a windows 2003 virtual machine in a lab and then applied updates to it sequentially checkpointing everytime something changed, and testing various hypotheses that I had on each checkpoint. When you're working with DLL versioning problems I find this to be a very effective - and in my opinion necessary methodology).
Anyway the customer web server in question was runing the following version of SQLDMO.DLL on the web server
Module[945] [C:\PROGRAM FILES\WINDOWS\SYSTEM32\SQLDMO.DLL] Company Name: Microsoft Corporation File Description: Distributed Management Objects OLE DLL for SQL Enterprise Manager Product Version: (8.0:760.0) File Version: (2000.80:760.0) File Size (bytes): 4215360 File Date: ti joulu 17 19:25:22 2002 Module TimeDateStamp = 0x3dffbfbc - Wed Dec 18 02:22:20 2002 Module Checksum = 0x00411b8a Module SizeOfImage = 0x00405000 Module Pointer to PDB = [dll\sqldmo.pdb] Module PDB Signature = 0x3dff938b Module PDB Age = 0x9e
(I was running MPSReports during my tests to extract and save exact configurations). This is a build from SQL Server 2000 SP3a, a fairly recognisable version number for those of us who have been around SQL for a few years!
To cut a long story short all SQL 2000 versions of this DLL are incompatible with higher major versions of SQL Server, and you need to patch this DLL. There are a number of articles about installing SQLDMO on the web, but I didn;t find any of them to be particularly 100% reliable for all circumstances, which is why I tested it so thoroughly. I wanted to be sure that I could patch this DLL and only this DLL without changing anything else on a production machine. (For example it was hypothesized that installing SQL2005 or higher would resolve the problem - this in fact turned out to be true, but we didn;t want to do this. I also didn't want to be manually registering DLLs.)
We instead used the backward compatibility MSI from SQL 2005, which can be downloaded and run indepently here:
Feature Pack for Microsoft SQL Server 2005 - November 2005
When running this we removed all the features apart from SQLDMO, and this ensured that we correctly upgraded SQLDMO.DLL (and its related objects such as SQLDMO.DLL) in the SQL 2000 directory here
C:\Program Files\Microsoft SQL Server\80\Tools\Binn
This upgrades SQLDMO.DLL to a version of this format
8.05.2312
The 05 being the key identifier here. Everything will then work fine.
The main point being here that one can install and run a version of DMO which will be compatible with all major versions of SQL higher than 2000, and you can do it in a controlled and easy manner, without resorting to the manual registrations listed in older KBs like these
http://support.microsoft.com/kb/326613http://support.microsoft.com/kb/248241http://msdn.microsoft.com/en-us/library/aa274817(v=sql.80).aspx
Hope this helps.
This might sounds like a strange requirement, but it can be converted to do a number of different things. The scenario is that for reason I won't discuss here, I want to track any user / application which issues an SQL batch which contains invalid syntax. The value here is that this would be extremely difficult to do on this system with tools like profiler due to the large volumes going through the server. This XEVENT will happily sit and monitor all batches for the error codes 102, 447 and 448 which are various derivatives of invalid syntax or invalid language usage. The output is written to a binary file which can then be queried using standard TSQL / XQuery. Alternatively if you don't like the scripting part of this you could look to use the Extended Events SSMS Add in from code plex.
http://extendedeventmanager.codeplex.com/
and here is the piece of code
-- XEvent Session looking for errors 447,448 and 102-- log examples to files so can be correlated back to netmon traces
if exists(select * from sys.server_event_sessions where name='SharePoint_Errors') DROP EVENT SESSION SharePoint_Errors ON SERVERgo
CREATE EVENT SESSION SharePoint_Errors ON SERVERADD EVENT sqlserver.error_reported (ACTION (sqlserver.session_id,sqlserver.tsql_stack, sqlserver.sql_text,sqlserver.client_hostname,sqlserver.client_pid)WHERE error = 102OR error = 447OR error = 448)ADD TARGET package0.asynchronous_file_target(SET filename='c:\temp\SharePoint_Errors.xet', metadatafile='c:\temp\SharePoint_Errors.xem')WITH (MAX_MEMORY = 4096KB, EVENT_RETENTION_MODE = ALLOW_MULTIPLE_EVENT_LOSS, MAX_DISPATCH_LATENCY = 1 SECONDS, MAX_EVENT_SIZE = 0KB, MEMORY_PARTITION_MODE = NONE, TRACK_CAUSALITY = OFF, STARTUP_STATE = OFF)
go
ALTER EVENT SESSION SharePoint_Errors ON SERVERstate=startgo
--stop the session if necessary--ALTER EVENT SESSION SharePoint_Errors ON SERVER--state=stop
-- Cleanup the session --drop EVENT SESSION SharePoint_Errors ON SERVER
********and here is the sample script to query the output files********
--get values from XET binary files
--straight select, view the XML manuallyselect CAST(event_data as XML) , * from sys.fn_xe_file_target_read_file('C:\temp\SharePoint_Errors_0_129291097895820000.xet', 'C:\temp\SharePoint_Errors_0_129291097896520000.xem', null, null)
--details
SELECT --*xdata.value('(/event/data[3]/@name)[1]', 'varchar(100)') AS [name],xdata.value('(/event/action[@name="sql_text"]/value)[1]', 'varchar(max)') AS [name]
--insert other XML nodes from the output as required here--you can then add condition checks on their values as wellFROM ( select CAST(event_data as XML) from sys.fn_xe_file_target_read_file ( 'C:\temp\SharePoint_Errors_0_129291097895820000.xet', 'C:\temp\SharePoint_Errors_0_129291097896520000.xem', null, null ) ) as xmlr(xdata)
********* remember to change the file unique identifiers to those of your output files*********
It's very rare that I tell people to blindy apply service packs or hotfixes (especially windows ones) without first analyzing where their performance problem lies, but if you are running SQL Server 2008 on Windows Server 2008 R2 (specifically this OS only) then you should absolutely ensure that you've installed the following fix:
976700 An application stops responding, experiences low performance, or experiences high privileged CPU usage if many large I/O operations are performed in Windows 7 or Windows Server 2008 R2http://support.microsoft.com/default.aspx?scid=kb;EN-US;976700
In this scenario, the key symptom which you normally ought to notice is that you are getting high kernel times, which can you see in the followiing perfmon counter:
Process(sqlservr)\% Privileged Time
We've seen a trend of this issue in recent months, so it's something to look to apply in your next maintenance window if you can.
Following the success of my previous post about recruitment, we're advertising another position from a different support group also located here in Stockholm. You can read the details here
http://blogs.technet.com/b/isablog/archive/2010/09/15/forefront-tmg-uag-help-wanted-at-microsoft-in-stockholm.aspx
....and you might notice the common language to my previous post :-)
Please don't contact me about this one though, ensure you email the contact listed in the new article.
Update 23th August
This position has now been filled - thanks for everything who submitted their CVs.
The Microsoft SQL Server support team in Stockholm is looking to hire a contractor to help out over the next year or so, to cover some staff absences. If you’d be interested to come and work with us and learn more about SQL Server internals than you probably would anywhere else in the world (apart form Redmond maybe), then this might be the contract for you.
Although you’ll get some training and access to innumerable amounts of internal only documentation, we’re looking for someone very experienced who can hit the ground running and could be expected to work the majority of our easier support cases without too much assistance. Therefore you’d need a minimum of 5 years experience (preferably more) working with SQL Server day to day in an administrative capacity (probably as a production DBA). Specialities of particular interest to us would be:
Setup / configuration / migration
Administration of HA solutions (clusters, DBM, replication etc)
Performance
Crashes, Hangs and Unexpected behaviour
Beyond that if you have experience of other Microsoft server products (Windows, SharePoint, BizTalk, SCCM, Navision), windbg, c# or c++, it won’t do you any harm.
If you think you’d be interested, then please send a brief summary (it’ll come out in plain text so keep it simple) of your experience via the “email blog author” section (to reduce my spam levels!). If we’re interested then we’ll get back to you with proper email details so you can send in a full CV and talk to our agency. You need to be resident in Sweden (Swedish language or other Nordic languages would be of use, but are not essential) and available to work full time in Kista / Akalla from some point around August.
Here’s a script to play with which is useful for simple management tasks. I wrote it to help someone with a TokenAndPermUserStore cache issue (http://support.microsoft.com/kb/927396/) which are relatively uncommon nowadays. I actually sent it to someone today which is why it came to mind, but it occurred to me that the template of the script itself is quite useful for any such TSQL SQL Agent based task, i.e. it does these tasks:
Runs a DMV query
Checks results of query and if appropriate takes an action
Writes what it did to the SQL error log as an info message
I use the theory for many different tasks. Feel free to use and alter it as you will, subject to the usual legal caveats of course!
Should you try to configure an SSIS package which uses the Transfer Objects Task, and then execute this package through SQL Agent, when you choose to run the task as a proxy credential, as opposed to the normal SQL Agent user, you may encounter a failure. The typical full text of the error message from the SQL Agent history is shown here.
Executed as user: GRAHAMK20\SSIS_Proxy. Microsoft (R) SQL Server Execute Package Utility Version 10.0.2531.0 for 64-bit Copyright (C) Microsoft Corp 1984-2005. All rights reserved. Started: 4:55:19 PM Error: 2010-02-04 16:55:19.65 Code: 0xC0016016 Source: Description: Failed to decrypt protected XML node "PassWord" with error 0x80070002 "The system cannot find the file specified.". You may not be authorized to access this information. This error occurs when there is a cryptographic error. Verify that the correct key is available. End Error Error: 2010-02-04 16:55:19.67 Code: 0xC0016016 Source: Description: Failed to decrypt protected XML node "PassWord" with error 0x80070002 "The system cannot find the file specified.". You may not be authorized to access this information. This error occurs when there is a cryptographic error. Verify that the correct key is available. End Error Error: 2010-02-04 16:55:20.00 Code: 0xC002F325 Source: Transfer SQL Server Objects Task Transfer SQL Server Objects Task Description: Execution failed with the following error: "Access to the path 'C:\Users\Default\AppData\Local\Microsoft\SQL Server\Smo' is denied.". End Error DTExec: The package execution returned DTSER_FAILURE (1). Started: 4:55:19 PM Finished: 4:55:20 PM Elapsed: 0.562 seconds. The package execution failed. The step failed.
Depending on how you view this message, you could decide it as being a number of different causes, but the actual important point to note is the red highlighted section. This is the lowest level error and the actual cause of the failure, the other errors are thrown further up the stack. You can confirm this by running Process Monitor on the server where the package is executing. You’ll see failures here
- C:\Users\Default\AppData\Local\Microsoft\SQL Server (create file) - C:\Users\Default\AppData\Local\Microsoft\SQL Server\Smo (create file)
However if you run the same package, but in the context of the SQL Agent account, you will notice that the same create file runs on a different directory:
- C:\Users\<sqlagentuserid>\AppData\Local\Microsoft\SQL Server\Smo
The package requires a temporary location to process some files when it prepared to run and there is a bug in the way the proxy account credential decides where it should place these files. This has been passed to the product group and should be fixed in SQL 11.
The reason that it is not being fixed immediately is that there is a relatively simple workaround which is to manually create the directory structure in the default user profile directory, and grant write modify permissions to the user or group which represents the proxy credential account. So in my specific example I logged on with an adminstrator account to the server, created the following directory
C:\Users\Default\AppData\Local\Microsoft\SQL Server\Smo
and then granted the users group (of which my proxy credential account is a member of) the following permissions:
Modify, Write, Read & Execute
The package will then execute successfully under the proxy credential. This problem occurs on both SQL 2005 and SQL 2008.
(An alternate title for this post could be “How to debug SSMS managed code”)
In the first 2 posts of this topic we used SQL Profiler and some common sense to diagnose a problem within SSMS. Then we used DebugDiag to collect a memory dump when a specific exception was thrown. Now in the 3rd part we’ll analyze that dump used windbg and public symbols from the Microsoft symbols server.
This post assumes a certain familiarity with managed debugging using windbg. If you need some background on this topic, a good place to start is by reading the blog of Tess.
To understand the context of this debugging, you should really read the first 2 posts in the series, but as a brief reminder, we collect a dump of the SSMS process (ssms.exe in SQL 2008 and sqlwb.exe in SQL 2005) when a specific condition occurs – in this case an exception thrown when trying to view SQL Agent job history. The dump below is from a sqlwb.exe 2005 client, but the theory is identical. The main point which this post demonstrates is that unlike when debugging the database engine process of SQL Server (sqlservr.exe), using purely public symbols , interesting information can be gleaned from the dump because most of the interesting structures are in managed code which can be interpretted with the windbg extension sos.dll.
My base setup for this debugging:
A blank hyper-V machine running Windows 7 Ultimate, connected to the internet
A default installation of debugging tools for windows downloaded from here:
http://www.microsoft.com/whdc/devtools/debugging/installx86.Mspx
A symbol path set exactly as per the example “getting started” page here:
http://www.microsoft.com/whdc/devtools/debugging/debugstart.mspx
i.e. the symbol path used is:
SRV*c:\websymbols*http://msdl.microsoft.com/download/symbols
So here’s the debugging output with notes:
* First off we load the public sos.dll so we can do managed debugging
0:018> .load c:\windows\microsoft.net\framework\v2.0.50727\sos.dll
* We’ve opened directly to the thread where the exception occurred
* let’s look at the CLRSTACK
0:018> !clrstack
OS Thread Id: 0x6ac (18)
ESP EIP
0822ef3c 77e4bef7 [HelperMethodFrame: 0822ef3c]
0822efe0 799d766a System.DateTime.TimeToTicks(Int32, Int32, Int32)
0822eff0 2ec8e92c Microsoft.SqlServer.Management.Smo.PostProcessCreateDateTime.GetDateTime(System.Object, System.Object)
0822f010 2ec91584 Microsoft.SqlServer.Management.Smo.PostProcessJobActivity.GetColumnData(System.String, System.Object, Microsoft.SqlServer.Management.Smo.DataProvider)
0822f064 2ec86ced Microsoft.SqlServer.Management.Smo.DataProvider.ManipulateRowDataPostProcess()
0822f07c 2ec872b9 Microsoft.SqlServer.Management.Smo.DataProvider.GetTable()
0822f08c 2ec84436 Microsoft.SqlServer.Management.Smo.ExecuteSql.ExecuteWithResults(System.Collections.Specialized.StringCollection, System.Object, Microsoft.SqlServer.Management.Smo.StatementBuilder)
0822f098 2ec804db Microsoft.SqlServer.Management.Smo.SqlObjectBase.FillData(Microsoft.SqlServer.Management.Smo.ResultType, System.Collections.Specialized.StringCollection, System.Object, Microsoft.SqlServer.Management.Smo.StatementBuilder)
0822f0b8 2ec803c4 Microsoft.SqlServer.Management.Smo.SqlObjectBase.FillDataWithUseFailure(Microsoft.SqlServer.Management.Smo.SqlEnumResult, Microsoft.SqlServer.Management.Smo.ResultType)
0822f0f8 2ec80339 Microsoft.SqlServer.Management.Smo.SqlObjectBase.BuildResult(Microsoft.SqlServer.Management.Smo.EnumResult)
0822f138 2ec7fea3 Microsoft.SqlServer.Management.Smo.SqlObjectBase.GetData(Microsoft.SqlServer.Management.Smo.EnumResult)
0822f148 2eb5daf8 Microsoft.SqlServer.Management.Smo.Environment.GetData()
0822f178 2eb5da5c Microsoft.SqlServer.Management.Smo.Environment.GetData(Microsoft.SqlServer.Management.Smo.Request, System.Object)
0822f18c 2eb5caef Microsoft.SqlServer.Management.Smo.Enumerator.GetData(System.Object, Microsoft.SqlServer.Management.Smo.Request)
0822f1b0 2eb5ccd6 Microsoft.SqlServer.Management.Smo.Enumerator.Process(System.Object, Microsoft.SqlServer.Management.Smo.Request)
0822f1e4 30983b38 Microsoft.SqlServer.Management.SqlManagerUI.JobFetcher.FetchJobs(System.String)
0822f200 3098349a Microsoft.SqlServer.Management.SqlManagerUI.JobFetcher.FetchJobs(Microsoft.SqlServer.Management.SqlManagerUI.JobActivityFilter)
0822f214 3098230e Microsoft.SqlServer.Management.SqlManagerUI.JobsPanel.RefreshJobs()
0822f240 3097f028 Microsoft.SqlServer.Management.SqlManagerUI.JobActivityMonitor.Microsoft.SqlServer.Management.UI.ActivityMonitor.IActivityMonitor.RefreshSnapshot()
0822f244 053f08f7 Microsoft.SqlServer.Management.UI.ActivityMonitor.ActivityMonitorUserControl.WorkerEntryPoint(System.Object)
0822f280 792c9e4f System.Threading._ThreadPoolWaitCallback.WaitCallback_Context(System.Object)
0822f288 792f5681 System.Threading.ExecutionContext.runTryCode(System.Object)
0822f6b8 79e71b4c [HelperMethodFrame_PROTECTOBJ: 0822f6b8] System.Runtime.CompilerServices.RuntimeHelpers.ExecuteCodeWithGuaranteedCleanup(TryCode, CleanupCode, System.Object)
0822f720 792f5577 System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object)
0822f73c 792e01c5 System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object)
0822f754 792ca3b3 System.Threading._ThreadPoolWaitCallback.PerformWaitCallbackInternal(System.Threading._ThreadPoolWaitCallback)
0822f768 792ca249 System.Threading._ThreadPoolWaitCallback.PerformWaitCallback(System.Object)
0822f8f8 79e71b4c [GCFrame: 0822f8f8]
* this shows us some fairly expected output that we have a stack which indicates we are using SMO to populate some data to do with SQL Agent jobs (Microsoft.SqlServer.Management.SqlManagerUI.JobFetcher.FetchJobs)
* let’s look at the exception to check it’s what we wanted
* it shows the details of the exception which SSMS reported (Hour, Minute, and Second parameters describe an un-representable DateTime)
0:018> !pe
Exception object: 016588b4
Exception type: System.ArgumentOutOfRangeException
Message: Hour, Minute, and Second parameters describe an un-representable DateTime.
InnerException: <none>
StackTrace (generated):
<none>
StackTraceString: <none>
HResult: 80131502
* now we want to look at what these functions were actually doing
* dump all objects on the stack to look back through recent history (most recent is at the top)
* scanning down here to look for anything of interest
* we see that we have a System.DBNull sitting next to a System.String named NextRunDate – this might be of interest
* in addition to that we can see the server connection object which will contain the details of the connection and also the actual TSQL executed
0:018> !dso
ESP/REG Object Name
0822eed0 016588b4 System.ArgumentOutOfRangeException
0822ef1c 016588b4 System.ArgumentOutOfRangeException
0822ef60 016588b4 System.ArgumentOutOfRangeException
0822efbc 0156b85c System.String ArgumentOutOfRange_BadHourMinuteSecond
0822efc4 016588b4 System.ArgumentOutOfRangeException
0822efe0 016588b4 System.ArgumentOutOfRangeException
0822eff8 015cd46c Microsoft.SqlServer.Management.Smo.PostProcessJobActivity
0822effc 0165889c System.Int32
0822f00c 016588a8 System.Int32
0822f010 01658880 System.Data.RBTree`1+RBTreeEnumerator[[System.Data.DataRow, System.Data]]
0822f014 01603e84 System.Data.DataRow
0822f018 015cc064 System.String NextRunDate
0822f01c 015cd46c Microsoft.SqlServer.Management.Smo.PostProcessJobActivity
0822f044 015d34f0 System.Data.DataRow
0822f048 015cd46c Microsoft.SqlServer.Management.Smo.PostProcessJobActivity
0822f04c 015cdf84 Microsoft.SqlServer.Management.Smo.DataProvider
0822f05c 015cdf84 Microsoft.SqlServer.Management.Smo.DataProvider
0822f060 011ee95c System.DBNull
0822f064 015cc064 System.String NextRunDate
0822f068 01597a6c Microsoft.SqlServer.Management.SqlManagerUI.JobFetcher
0822f06c 015cdf84 Microsoft.SqlServer.Management.Smo.DataProvider
0822f094 015ccd18 Microsoft.SqlServer.Management.Smo.StatementBuilder
0822f0b0 01594dbc Microsoft.SqlServer.Management.Common.ServerConnection
0822f0c0 015cd964 System.Collections.Specialized.StringCollection
0822f0c4 015cca78 Microsoft.SqlServer.Management.Smo.SqlEnumResult
0822f0c8 015c8a00 Microsoft.SqlServer.Management.Smo.SqlObject
0822f0e0 01597a6c Microsoft.SqlServer.Management.SqlManagerUI.JobFetcher
0822f124 01597a6c Microsoft.SqlServer.Management.SqlManagerUI.JobFetcher
0822f148 015cc7bc System.Collections.ArrayList+ArrayListEnumeratorSimple
0822f14c 015cca78 Microsoft.SqlServer.Management.Smo.SqlEnumResult
0822f164 01597a6c Microsoft.SqlServer.Management.SqlManagerUI.JobFetcher
0822f168 015c462c Microsoft.SqlServer.Management.Smo.Environment
0822f188 01594dbc Microsoft.SqlServer.Management.Common.ServerConnection
0822f194 01594dbc Microsoft.SqlServer.Management.Common.ServerConnection
0822f19c 01597a6c Microsoft.SqlServer.Management.SqlManagerUI.JobFetcher
0822f1cc 01597a6c Microsoft.SqlServer.Management.SqlManagerUI.JobFetcher
0822f1e0 015c4590 Microsoft.SqlServer.Management.Smo.Request
0822f1e4 0150cb08 System.String /Job
0822f1e8 011d1198 System.String
0822f1ec 0158101c Microsoft.SqlServer.Management.SqlManagerUI.JobActivityFilter
0822f1f0 01597a6c Microsoft.SqlServer.Management.SqlManagerUI.JobFetcher
0822f204 01591e78 Microsoft.SqlServer.Management.SqlManagerUI.JobsPanel
0822f214 01591e78 Microsoft.SqlServer.Management.SqlManagerUI.JobsPanel
0822f21c 01217510 System.Runtime.CompilerServices.RuntimeHelpers+CleanupCode
0822f224 014f9c50 System.Threading.ExecutionContext
0822f250 015a5174 Microsoft.SqlServer.Management.UI.ActivityMonitor.ActivityMonitorUserControl+BackgroundRefreshJobDescription
0822f254 01581054 Microsoft.SqlServer.Management.UI.ActivityMonitor.ActivityMonitorUserControl
0822f344 012174f0 System.Runtime.CompilerServices.RuntimeHelpers+TryCode
0822f410 01217510 System.Runtime.CompilerServices.RuntimeHelpers+CleanupCode
0822f4b8 012174f0 System.Runtime.CompilerServices.RuntimeHelpers+TryCode
0822f638 01217510 System.Runtime.CompilerServices.RuntimeHelpers+CleanupCode
0822f6f4 012174f0 System.Runtime.CompilerServices.RuntimeHelpers+TryCode
0822f6f8 01217510 System.Runtime.CompilerServices.RuntimeHelpers+CleanupCode
0822f6fc 015a561c System.Threading.ExecutionContext+ExecutionContextRunData
0822f724 011f4ecc System.Threading.ContextCallback
0822f72c 014f9c50 System.Threading.ExecutionContext
0822f738 015a51cc System.Threading._ThreadPoolWaitCallback
0822f750 015a51cc System.Threading._ThreadPoolWaitCallback
* dump the server connection object – this is always a useful object to be able to locate and look at, because within it you can see all manner of properties about the connection and the batch executing
0:018> !do 01594dbc
Name: Microsoft.SqlServer.Management.Common.ServerConnection
MethodTable: 2e2bda28
EEClass: 2e283d48
Size: 144(0x90) bytes
(C:\WINDOWS\assembly\GAC_MSIL\Microsoft.SqlServer.ConnectionInfo\9.0.242.0__89845dcd8080cc91\Microsoft.SqlServer.ConnectionInfo.dll)
Fields:
MT Field Offset Type VT Attr Value Name
793044cc 4000036 40 System.Boolean 1 instance 1 m_BlockUpdates
79330a00 4000037 4 System.String 0 instance 01594eac m_ServerInstance
79330a00 4000038 8 System.String 0 instance 011d1198 m_Login
7995d420 4000039 c ...rity.SecureString 0 instance 00000000 m_Password
793044cc 400003a 41 System.Boolean 1 instance 1 m_LoginSecure
79330a00 400003b 10 System.String 0 instance 011d1198 m_ConnectAsUserName
79330a00 400003c 14 System.String 0 instance 011d1198 m_ConnectAsUserPassword
793044cc 400003d 42 System.Boolean 1 instance 0 m_ConnectAsUser
793044cc 400003e 43 System.Boolean 1 instance 1 m_NonPooledConnection
79332c4c 400003f 28 System.Int32 1 instance 0 m_PooledConnectionLifetime
79332c4c 4000040 2c System.Int32 1 instance 0 m_MinPoolSize
79332c4c 4000041 30 System.Int32 1 instance 100 m_MaxPoolSize
79332c4c 4000042 34 System.Int32 1 instance 15 m_ConnectTimeout
2e2bc5b4 4000043 38 System.Int32 1 instance 8 m_NetworkProtocol
79330a00 4000044 18 System.String 0 instance 01594e4c m_ApplicationName
79330a00 4000045 1c System.String 0 instance 011d1198 m_WorkstationId
79330a00 4000046 20 System.String 0 instance 011d1198 m_DatabaseName
79332c4c 4000047 3c System.Int32 1 instance 4096 m_PacketSize
7995d420 4000048 24 ...rity.SecureString 0 instance 00000000 m_ConnectionString
793044cc 4000049 44 System.Boolean 1 instance 0 m_MultipleActiveResultSets
793044cc 400004a 45 System.Boolean 1 instance 0 shouldEncryptConnection
6523d7a0 400004b 48 ...ent.SqlConnection 0 instance 01594f10 m_SqlConnectionObject
793044cc 400004c 46 System.Boolean 1 instance 1 m_InUse
79332c4c 400004d 68 System.Int32 1 instance 0 m_LoginFailedClients
2e2bc50c 400004e 4c ...ssageEventHandler 0 instance 00000000 m_RemoteLoginFailedHandler
656629a8 400004f 50 ...ssageEventHandler 0 instance 00000000 m_SqlInfoMessageHandler
2e2bd89c 4000050 54 ...ommon.CapturedSql 0 instance 01594ee0 m_CapuredSQL
2e2bc8d4 4000051 6c System.Int32 1 instance 1 m_AutoDisconnectMode
793044cc 4000052 47 System.Boolean 1 instance 0 bIsUserConnected
2e2be0b0 4000053 58 ...mon.ServerVersion 0 instance 015c172c m_serverVersion
79332c4c 4000054 70 System.Int32 1 instance 10 lockTimeout
2e2bc50c 4000055 5c ...ssageEventHandler 0 instance 00000000 ServerMessageInternal
2e2bc480 4000056 60 ...ementEventHandler 0 instance 00000000 StatementExecuted
2e2bc50c 4000057 64 ...ssageEventHandler 0 instance 00000000 RemoteLoginFailedInternal
79332c4c 400005b 80 System.Int32 1 instance 0 m_StatementTimeout
79330a00 400005c 74 System.String 0 instance 01428dcc m_BatchSeparator
79332c4c 400005d 84 System.Int32 1 instance 0 m_TransactionDepth
2e2bc6f4 400005e 88 System.Int32 1 instance 1 m_ExecutionMode
6523db4c 400005f 78 ...Client.SqlCommand 0 instance 01594f48 m_SqlCommand
79330a00 4000060 7c System.String 0 instance 015c1a70 m_TrueName
*dump the SqlCommand object
0:018> !do 01594f48
Name: System.Data.SqlClient.SqlCommand
MethodTable: 6523db4c
EEClass: 6515c250
Size: 132(0x84) bytes
(C:\WINDOWS\assembly\GAC_32\System.Data\2.0.0.0__b77a5c561934e089\System.Data.dll)
7933061c 400018a 4 System.Object 0 instance 00000000 __identity
7a5d2a58 40008c3 8 ...ponentModel.ISite 0 instance 00000000 site
7a5ecd2c 40008c4 c ....EventHandlerList 0 instance 00000000 events
7933061c 40008c2 108 System.Object 0 static 01201278 EventDisposed
79332c4c 40016e3 58 System.Int32 1 instance 58 ObjectID
79330a00 40016e4 10 System.String 0 instance 014e3850 _commandText
6523e698 40016e5 5c System.Int32 1 instance 0 _commandType
79332c4c 40016e6 60 System.Int32 1 instance 0 _commandTimeout
6565989c 40016e7 64 System.Int32 1 instance 3 _updatedRowSource
793044cc 40016e8 78 System.Boolean 1 instance 0 _designTimeInvisible
6567ecdc 40016e9 14 ...ent.SqlDependency 0 instance 00000000 _sqlDep
793044cc 40016ea 79 System.Boolean 1 instance 0 _inPrepare
79332c4c 40016eb 68 System.Int32 1 instance -1 _prepareHandle
793044cc 40016ec 7a System.Boolean 1 instance 0 _hiddenPrepare
6523e8b4 40016ed 18 ...rameterCollection 0 instance 00000000 _parameters
6523d7a0 40016ee 1c ...ent.SqlConnection 0 instance 01594f10 _activeConnection
793044cc 40016ef 7b System.Boolean 1 instance 0 _dirty
656616a4 40016f0 6c System.Int32 1 instance 0 _execType
793041d0 40016f1 20 System.Object[] 0 instance 00000000 _rpcArrayOf1
6524233c 40016f2 24 ...t._SqlMetaDataSet 0 instance 015f9100 _cachedMetaData
65241e94 40016f3 28 ...+CachedAsyncState 0 instance 015c19fc _cachedAsyncState
79332c4c 40016f4 70 System.Int32 1 instance -1 _rowsAffected
6567d4b8 40016f5 2c ...tificationRequest 0 instance 00000000 _notification
793044cc 40016f6 7c System.Boolean 1 instance 1 _notificationAutoEnlist
6567da8c 40016f7 30 ...nt.SqlTransaction 0 instance 00000000 _transaction
65659328 40016f8 34 ...letedEventHandler 0 instance 00000000 _statementCompletedEventHandler
65241814 40016f9 38 ...ParserStateObject 0 instance 00000000 _stateObj
793044cc 40016fa 7d System.Boolean 1 instance 0 _pendingCancel
793044cc 40016fb 7e System.Boolean 1 instance 0 _batchRPCMode
00000000 40016fc 3c 0 instance 00000000 _RPCList
793041d0 40016fd 40 System.Object[] 0 instance 00000000 _SqlRPCBatchArray
00000000 40016fe 44 0 instance 00000000 _parameterCollectionList
79332c4c 40016ff 74 System.Int32 1 instance 0 _currentlyExecutingBatch
656732b0 4001700 48 ...miRequestExecutor 0 instance 00000000 _smiRequest
65672df8 4001701 4c ...Server.SmiContext 0 instance 00000000 _smiRequestContext
6567e144 4001702 50 ...+CommandEventSink 0 instance 00000000 _smiEventSink
65672f4c 4001703 54 ...DeferedProcessing 0 instance 00000000 _outParamEventSink
79332c4c 40016e2 878 System.Int32 1 static 59 _objectTypeCount
793041d0 4001704 78c System.Object[] 0 static 01427968 PreKatmaiProcParamsNames
793041d0 4001705 790 System.Object[] 0 static 014279b4 KatmaiProcParamsNames
* dump the command text
0:018> !do 014e3850
Name: System.String
MethodTable: 79330a00
EEClass: 790ed64c
Size: 68(0x44) bytes
(C:\WINDOWS\assembly\GAC_32\mscorlib\2.0.0.0__b77a5c561934e089\mscorlib.dll)
String: exec msdb.dbo.sp_help_job
79332c4c 4000096 4 System.Int32 1 instance 26 m_arrayLength
79332c4c 4000097 8 System.Int32 1 instance 25 m_stringLength
793316e0 4000098 c System.Char 1 instance 65 m_firstChar
79330a00 4000099 10 System.String 0 shared static Empty
>> Domain:Value 001831e0:011d1198 <<
79331630 400009a 14 System.Char[] 0 shared static WhitespaceChars
>> Domain:Value 001831e0:011d17d4 <<
* and we have now managed to get the actual TSQL batch executed by SSMS which caused the exception to happen
* this would be enough information to then leave the debugging and go and investigate this from another direction.
* this is effectively what we did in post of the series, by tracing the command through SQL Profiler
So we've seen that we've managed to get the code which is executing, and we've also noted that there is a System.DBNull on the stack next to the NextRunDate object. The point here really though is that because we're dealing with the managed code within SSMS alot more is exposed to us using a memory dump. In this case you could continue the debugging to look at the actual data table which is returned by the stored procedure, and then read through it see which row of the data table contains the Null, which then leads you right back to the source data corruption.
The thing that should be really occurring to you now though is that you can get all others sorts of information out , from this simple example, or from any dump which you might trigger of SSMS, if you ever get unexpected behavior.
For example, imagine the problem you had captured a dump for was more complex involving connections to multiple servers, you can review some of the other structures in the Connection object.
* dump the connection object again to remind us of its contents
* let’s assume we wanted to check and confirm what server we were actually connecting to
* and what version it was running
0:018> * dump the m_serverVersion object
0:018> !do 015c172c
Name: Microsoft.SqlServer.Management.Common.ServerVersion
MethodTable: 2e2be0b0
EEClass: 2e28425c
Size: 20(0x14) bytes
79332c4c 400008c 4 System.Int32 1 instance 9 m_nMajor
79332c4c 400008d 8 System.Int32 1 instance 0 m_nMinor
79332c4c 400008e c System.Int32 1 instance 4262 m_nBuildNumber
* now dump the m_ServerInstance object
0:018> !do 01594eac
Size: 50(0x32) bytes
String: grahamk20\gk2005
79332c4c 4000096 4 System.Int32 1 instance 17 m_arrayLength
79332c4c 4000097 8 System.Int32 1 instance 16 m_stringLength
793316e0 4000098 c System.Char 1 instance 67 m_firstChar
* and you can see the name of the server I ran this command against
and of course you can follow this theory on and on depending on the actual problem you are troubleshooting. I hope this provides some insights into some of things you can do when looking at SSMS problems. It’s not something I deal with that often as mostly I focus on the databas engine, but sometimes its interesting to do some managed debugging as opposed to native, and of course its always easier to demonstrate something when you can see real values and references with public symbols alone.
On Monday we released Cumulative update package 5 for SQL Server 2008 Service Pack 1, under KB975977. Contained in this was a very interesting fix to a long running issue with regards to the option recompile syntax and its changed behaviour in SQL Server 2008. This fix has its own KB article at KB976603.
In SQL Server 2008 RTM we introduced behaviour that allowed potentially better plans to be created when using the OPTION RECOMPILE syntax. The way this worked in principal was that if the OPTION RECOMPILE syntax was used, SQL could assume that the plan for the statement in question would not need to re-used, since it was always going to recompile. Therefore the query optimizer could safely use the actual parameter values passed to the statement when choosing a plan, as opposed to just using a parameterised template. In certain circumstances this can lead to dramatic performance gains, especially when you have parameter values which vary wildly and need different plans dependent upon the values passed.
The well known MVP Erland Sommarskog demonstrates these techniques (along with many others) in his excellent article
Dynamic Search Conditions in T-SQL
However currently this article has been edited to remove reference to this technique (but I believe that Erland will soon add it back). We were forced to withdraw this functionality as part of Cumulative update package 4 for SQL Server 2008. The reason for this was that on rare occasions batches using this syntax could return incorrect results if multiple connections were running the same batch simultaneously, due to the fact that the query optimizer associated such plans with the specific thread upon which they were compiled, whereas the actual execution of the batch could occur subsequently on a different thread.
Therefore if you run SQL Server 2008 RTM CU#4 or above, or SQL Server 2008 SP1 between release and CU#4, the OPTION RECOMPILE syntax works as it did in SQL Server 2005.
However this latest update releases a full and thorough fix for the original problem. It has re-instated the enhanced functionality of parameter embedding optimization, and resolves the incorrect results problem seen earlier. The reason that this fix took a while to come out in this form was down to the fact that the changes required in the query optimizer were relatively complex. Initially it was not viable to release them quickly due to the amount of testing required, so this was why we were forced to just withdraw the functionality, because the potential for incorrect results is one of the most serious types of problem SQL Server can encounter.
In summary, here is a matrix showing the different builds and functionality:
SQL 2005 – no parameter embedding optimization
SQL 2008 RTM to SQL 2008 CU#3 – new functionality , but potential for incorrect results
SQL 2008 CU#4 onwards - no parameter embedding optimization
SQL 2008 SP1 RTM to SQL 2008 SP1 CU#4 - no parameter embedding optimization
SQL 2008 SP1 CU#5 onwards – new functionality enabled again, no possibility of incorrect results
Remember that RTM and SP1 are separate branches, so even though the fix is now out in the SP1 branch, it is not in the latest cumulative update on the RTM branch, and it is unlikely to ever be put in there.
This is the second in a series of 3 posts which discuss troubleshooting SQL Server Management Studio when it encounters errors or unexpected behaviour.
In this post we collect a memory dump of SSMS based upon specific .NET exception codes, for subsequent analysis.
As a reminder, here are the symptoms of the problem we are looking at:
1. Unable to view Job Activity in the SQL Server Agent Job Activity Monitor
2. Progress bar shows “Cannot display activity data”
3. Click on more details throws a message box with the error
Failed to retrieve data for this request. (Microsoft.SqlServer.Management.Sdk.Sfc)
with the additional details
Hour, Minute, and Second parameters describe an un-representable DateTime. (mscorlib)
It actually looks like this (click to enlarge)
4. The error repros with all versions of SSMS (2005 and 2008)
5. The error repros for all users whatever workstation they connect from.
The debugging approach
Initially I was thinking that since I had a .NET error, I could run adplus and use a config file to configure the debugger to capture the .NET errors. This would work and I’ve done it before, but then Tess reminded me that I could use DebugDiag and it would probably be easier and quicker. So I tried this:
Firstly I installed the DebugDiag tool on the machine where I was repro’ing the problem. I started it up and created a new rule, the properties of which are shown below. Being as I didn’t know exactly what was going on at this stage, I just elected to log everything to a file to begin with, rather than creating dumps at first, so I could get a better idea of what was really happening in the process. I expected that I was getting some .NET errors based upon the message box shown, but if I just created a dump on all .NET exceptions, I might miss something else that came before it. The rule looked like this:
1. Type = Crash
2. Process = ssms.exe (i was using SQL 2008 tools, if I’d been using SQL 2005 tools the process name would be sqlwb.exe)
3. Action type for unconfigured first chance exceptions = Log Stack Trace
4. Exceptions - CLR (.NET) Exception - All Exception Types = Log Stack Trace
5. Action Limits = 0 (since I’m only logging to file at this stage)
Here’s what it looks like in the GUI (click to enlarge)
I saved the rule, enabled it straight away, and then repro’d the problem again. I then disabled the rule and browsed for the log file which in my case was held here:
C:\Program Files (x86)\DebugDiag\Logs\
it was called:
Ssms__PID__2912__Date__10_19_2009__Time_03_35_46PM__898__Log.txt
Review of this file showed that actually 3 different exceptions had occurred in this order:
[10/19/2009 3:35:55 PM] First chance exception - 0xe0434f4d caused by thread with system id 5824 [10/19/2009 3:35:55 PM] CLR Exception Type - 'System.ArgumentOutOfRangeException'
then
[10/19/2009 3:35:55 PM] First chance exception - 0xe0434f4d caused by thread with system id 5824 [10/19/2009 3:35:55 PM] CLR Exception Type - 'Microsoft.SqlServer.Management.Sdk.Sfc.EnumeratorException'
then finally
It may be in this case that all these errors are related, however the theory of only logging information first proved valuable, as I want to create a dump of when the very first error occurs. Therefore I now know that I can edit my rule to tell DebugDiag to create a full user dump, when it encounters a CLR Exception Type - 'System.ArgumentOutOfRangeException', and that it should only do the dump once. The theory of logging first is also backed up by the fact that repro’ing this problem on a x86 machine throws a different chain of .NET exceptions (although the first one is still 'System.ArgumentOutOfRangeException')
That said, I’m now ready to capture a full user dump of the ssms.exe process, based upon the above rule, so I can debug it properly, and see the true root of the problem. I can edit my rule created earlier, and change the Exceptions dialogue box to add an additional Exception sub-rule, which looks like this (click to enlarge)
and has the properties
1. Exceptions - CLR (.NET) Exception - .NET Exception Types = System.ArgumentOutOfRangeException
2. Action Type = Full Userdump
3. Action Limit = 1
I add this exception rule, click OK a couple of times and re-enable the rule. Then run the repro again, and I notice the SSMS dialogue hang for a couple of seconds whilst it opens. This is because DebugDiag is creating the memory dump.
I disable the rule again, and then open the log file.The output is similar to before, except that after the first exception is logged, and after the stack trace of this exception, I note like the following:
[10/20/2009 3:14:08 PM] Created dump file C:\Program Files (x86)\DebugDiag\Logs\Crash rule for all instances of Ssms.exe\Ssms__PID__2912__Date__10_20_2009__Time_03_14_02PM__806__First Chance System.ArgumentOutOfRangeException.dmp [10/20/2009 3:14:08 PM] Action limit of 1 reached for Exception 0xE0434F4D:SYSTEM.ARGUMENTOUTOFRANGEEXCEPTION.
This confirms that I have successfully created a dump, and I have created it at the point when the first exception was encountered. This technique ought to work for any CLR Exception that SSMS can encounter. As I mentioned above it would also be just as acceptable to use adplus with a config file to create the dump.
in Part 3, it’s time to analyze that dump.
This is the first in a series of 3 posts which discuss troubleshooting SQL Server Management Studio when it encounters errors or unexpected behaviour.
In this post we look at using profiler, in the 2nd and 3rd we debug SSMS using public Microsoft debugging tools.
Here are the symptoms of the problem we are looking at:
It actually looks like this
These last 2 points are particularly relevant. Whenever you are looking at an SSMS problem, you always need to be asking these 2 questions. Straight at the off, you have potentially then eliminated whole areas of research (such as corruptions of .NET framework on specific workstations, incompatibilities in processor architecture specific versions of the framework, and localising whether the problem is with the SSMS client or the server instance being managed, or a combination of both)
The initial approach – using SQL Profiler and some common sense
When I initially came to this problem, the customer was experiencing the errors above. Normally when I get any errors or problems in SSMS, rather than reaching for debugger, I reach for SQL Profiler first (having asked the above 2 questions about where and when the issue repros), and this is a good rule of thumb. At the end of the day SSMS is just a GUI which issues TSQL commands to achieve functions in SQL Server. Whenever I get unexpected results in SSMS, the first thing I therefore do is to run a profiler trace against the server which SSMS is connected to. This means that you can see the TSQL which is issued, and you can sometimes find the problem straight off, if for example it issues some piece of bad code which obviously causes your error.
In this case if you run the SQL profiler GUI tool (you can simply point it at the appropriate server instance and use the default options), you’ll note that the SSMS issues the following SQL batches.
SET LOCK_TIMEOUT 10000
select SERVERPROPERTY(N'servername')
declare @RegPathParams sysname
declare @Arg sysname
declare @Param sysname
declare @MasterPath nvarchar(512)
declare @LogPath nvarchar(512)
declare @ErrorLogPath nvarchar(512)
declare @n int
select @n=0
select @RegPathParams=N'Software\Microsoft\MSSQLServer\MSSQLServer'+'\Parameters'
select @Param='dummy'
while(not @Param is null)
begin
select @Param=null
select @Arg='SqlArg'+convert(nvarchar,@n)
exec master.dbo.xp_instance_regread N'HKEY_LOCAL_MACHINE', @RegPathParams, @Arg, @Param OUTPUT
if(@Param like '-d%')
select @Param=substring(@Param, 3, 255)
select @MasterPath=substring(@Param, 1, len(@Param) - charindex('\', reverse(@Param)))
end
else if(@Param like '-l%')
select @LogPath=substring(@Param, 1, len(@Param) - charindex('\', reverse(@Param)))
else if(@Param like '-e%')
select @ErrorLogPath=substring(@Param, 1, len(@Param) - charindex('\', reverse(@Param)))
select @n=@n+1
declare @SmoRoot nvarchar(512)
exec master.dbo.xp_instance_regread N'HKEY_LOCAL_MACHINE', N'SOFTWARE\Microsoft\MSSQLServer\Setup', N'SQLPath', @SmoRoot OUTPUT
SELECT
CAST(case when 'a' <> 'A' then 1 else 0 end AS bit) AS [IsCaseSensitive],
CAST(FULLTEXTSERVICEPROPERTY('IsFullTextInstalled') AS bit) AS [IsFullTextInstalled],
@LogPath AS [MasterDBLogPath],
@MasterPath AS [MasterDBPath],
@ErrorLogPath AS [ErrorLogPath],
@@MAX_PRECISION AS [MaxPrecision],
@SmoRoot AS [RootDirectory],
SERVERPROPERTY(N'ProductVersion') AS [VersionString],
CAST(SERVERPROPERTY(N'MachineName') AS sysname) AS [NetName],
CAST(SERVERPROPERTY(N'Edition') AS sysname) AS [Edition],
CAST(SERVERPROPERTY(N'ProductLevel') AS sysname) AS [ProductLevel],
CAST(SERVERPROPERTY('IsClustered') AS bit) AS [IsClustered],
CAST(SERVERPROPERTY('IsSingleUser') AS bit) AS [IsSingleUser],
CAST(SERVERPROPERTY('EngineEdition') AS int) AS [EngineEdition],
SERVERPROPERTY(N'ResourceVersion') AS [ResourceVersionString],
SERVERPROPERTY(N'ResourceLastUpdateDateTime') AS [ResourceLastUpdateDateTime],
SERVERPROPERTY(N'BuildClrVersion') AS [BuildClrVersionString],
SERVERPROPERTY(N'ComputerNamePhysicalNetBIOS') AS [ComputerNamePhysicalNetBIOS],
SERVERPROPERTY(N'CollationID') AS [CollationID],
SERVERPROPERTY(N'ComparisonStyle') AS [ComparisonStyle],
SERVERPROPERTY(N'SqlCharSet') AS [SqlCharSet],
SERVERPROPERTY(N'SqlCharSetName') AS [SqlCharSetName],
SERVERPROPERTY(N'SqlSortOrder') AS [SqlSortOrder],
SERVERPROPERTY(N'SqlSortOrderName') AS [SqlSortOrderName],
convert(sysname, serverproperty(N'collation')) AS [Collation]
exec sp_executesql N'SELECT
dtb.collation_name AS [Collation],
dtb.name AS [DatabaseName2]
FROM
master.sys.databases AS dtb
WHERE
(dtb.name=@_msparam_0)',N'@_msparam_0 nvarchar(4000)',@_msparam_0=N'msdb' Microsoft SQL Server Management Studio grahamk EUROPE\grahamk 0 2 0 0 2912 52 2009-10-19 14:28:13.143 2009-10-19 14:28:13.143 0X00000000040000001A00730070005F006500780065006300750074006500730071006C002E01000082001A00E7206E0076006100720063006800610072002800310033003100290006010000530045004C004500430054000A006400740062002E006300
create table #tmp_sp_help_category
(category_id int null, category_type tinyint null, name nvarchar(128) null)
insert into #tmp_sp_help_category exec msdb.dbo.sp_help_category
sv.name AS [Name],
CAST(sv.enabled AS bit) AS [IsEnabled],
tshc.name AS [Category],
sv.category_id AS [CategoryID],
tshc.category_type AS [CategoryType],
null AS [CurrentRunStatus],
null AS [CurrentRunStep],
null AS [HasSchedule],
null AS [HasStep],
null AS [HasServer],
null AS [LastRunDate],
null AS [NextRunDate],
null AS [LastRunOutcome],
CAST(sv.job_id AS nvarchar(100)) AS [job_id]
msdb.dbo.sysjobs_view AS sv
INNER JOIN #tmp_sp_help_category AS tshc ON sv.category_id = tshc.category_id
drop table #tmp_sp_help_category
and finally:
exec msdb.dbo.sp_help_job
These batches don’t appear to return any errors or warnings to profiler. The logical next step was to take a look at what this code does. Once you read through the various batches, you notice 2 types of batch. The first 5 batches are just required to get certain properties about the server instance and to populate properties of the tree view in SSMS Object Explorer. Whenever you profile SSMS you see lots of batches like this, which tend to precede the “real” work.
The you see the final single batch, which executes msdb.dbo.sp_help_job. This is the real work that is required to populate the dialogue box that SSMS is opening. Since these batches are all just queries, and nothing is being updated, we can easily re-run them in SSMS query windows to see if anything else is thrown up, or see if weird results are returned that might explain our strange date time error.
We tried this and it returned a wide record set of several hundred rows, which on initial glance was all well and good. At this stage, the pragmatists amongst you may feel it’s appropriate to just scan through this data and see if there are any glaring inconsistencies which can be noticed which might cause a DateTime related error. The truth is that there was such a piece of data, if you could be bothered to read all the data manually, but it was only a hypothesis. The stored procedure in question returns some columns that are related to dates and times in a true DateTime representation. However it also returns some columns as text representations of date parts. In this record set, these columns should always be populated with values, but in this case there were a number of rows which showed the value “0” in these columns. We therefore made the guess that this data was the problem. We removed this data and tried the repro again, and everything worked ok.
We were lucky in this situation that the record set to review was only a few hundreds rows, and that the data corruption was obvious. On the basis that neither of these situations is always true, part 2 will show how to continue to troubleshoot this with debugging tools, to categorically prove what the problem was.
I noticed that we have posted a really good summary of the above points through the new fast publish system for KBs. (Note the disclaimer at the bottom which is not normally shown on full KBs). There are some really good points here to consider about setting up and configuring a mirroring environment.
Things to consider when setting up database mirroring in SQL Server.
In addition to this of course, you should remember my previous points made on this blog about Virtual Log File fragmentation which I recently saw in a production environment again.
1413 error when starting Database Mirroring – How many virtual log files is too many?
Slow recovery times and slow performance due to large numbers of Virtual Log Files
This is the follow up (part 2) to the following post:
http://blogs.msdn.com/grahamk/archive/2009/09/23/realistic-troubleshooting-example-of-extended-events-xevent-usage-in-sql-server-2008-part-1.aspx
In the original post we configured an extended event session and issued a very simple query to review the results as XML. Here are some more intelligent XQuery examples. I’m documenting these purely because I hate XQuery so much, and it takes so damn long to do (well it does for me anyway). Please feel free to post in the comments section if you think XQuery is the best damn query language ever invented :-)
Hopefully the code comment explain everything, I have attempted to use very descriptive alias names as well
-- get information from XET binary files representing extended event session date
-- as per part 1 blog post:
-- straight select, view the XML manually in an SSMS window by clicking on the XML results column
select CAST(event_data as XML) , *
from sys.fn_xe_file_target_read_file
(
'C:\temp\ArithmeticErrors_0_128981071730370000.xet',
'C:\temp\ArithmeticErrors_0_128981071730390000.xem',
null,
null
)
-- Get only specific nodes
-- 2 different methods shown of the xquery, you can specify the name of the node
-- or you can specify its location in the structure
-- this means that you return every row for every event captured in the file
-- the results are then converted to SQL data types
xdata.value('(/event/data[4]/value)[1]', 'varchar(100)') AS [error],
xdata.value('(/event/data[@name="severity"]/value)[1]', 'int') AS [severity],
xdata.value('(/event/action[@name="sql_text"]/value)[1]', 'varchar(max)') AS [sql_text]
select CAST(event_data as XML)
) as xmlr(xdata)
-- using the above technique, you can now wrap the entire query and give it an alias
-- and then perform more useful T-SQL language queries on the results
-- here is an example of an aggregate
convert(varchar(100), outer_results.[error]) as 'error',
convert(int, outer_results.[severity]) as 'severity',
convert(varchar(max), outer_results.[sql_text]) as 'sql_text',
count(*) as [count_occurances]
) as outer_results
group by error, severity,sql_text
-- finally using the same technique as above , here we have simple predicates as well
-- add a predicate into the outer query
-- here is the extra predicate we have added
where outer_results.severity < 17
I fully accept that this may not be the most reasonable way, or the most efficient way to do this, but it’s the way that I’ll be using for now.
If you are trying to create extended stored procedures in SQL Server 2008, you will probably have read the BOL article here:
Creating Extended Stored Procedures
Hopefully you will have noticed that this feature is deprecated, and so firstly you’ll only be using this functionality for legacy requirements. That said you may have noticed that you will require the file opends60.lib to be able to create your extended stored procedure DLL. You may have noticed that you cannot find this after installing SQL Server 2008 (even if you have installed the SDK).
The default location for the SDK directory in SQL Server 2008 is here:
C:\Program Files\Microsoft SQL Server\100\SDK\
However it does not contain the opends60.lib. If you need to acquire a copy of this file, currently you need to obtain it from an alternate source, namely a previous version of SQL Server SDK or from an install of a previous version of visual studio. For example the x64 version of the file can be found in the SQL Server 2005 SDK directory here:
C:\Program Files\Microsoft SQL Server\90\SDK\Lib\x64
You should be able to use one of these other versions instead.
Here’s a script I was using last week. I wrote this so that it could be used in a custom PSSDIAG / SQLDIAG task, but you can also run it in other automated methods as well.
To include it in SQLDIAG / PSSDIAG I added a custom group section into the XML config file (see previous posts on this blog if you don’t know how to do this), and saved the file in the root of the SQLDIAG directory as exec_userstore1.sql.
<CustomGroup name="userstore" enabled="true" />
<CustomTask enabled="true" groupname="userstore" taskname="detailed_scripts" type="TSQL_Script" point="Startup" wait="No" cmd="exec_userstore1.sql"/>
Here’s the script itself:
/*
detailed USERSTORE_TOKENPERM analysis for SQL 2005
script designed to be packaged into PSSDIAG and run over long time frame to monitor
but can be run individually at a fixed point to capture a snapshot
can be used in association with script number 2 which is verbose and does a snapshot
of all logins and token in cache, so can be HUGE.
graham kent - microsoft css - 15th September 2009
*/
SET NOCOUNT ON
SET QUOTED_IDENTIFIER ON
GO
-- get the token type distribution
WHILE 1=1
BEGIN
select 'start time:', GETDATE();
convert(varchar(50),[name]) as 'SOS StoreName',
convert(varchar(50),[TokenName]) as 'TokenName',
convert(varchar(50),[Class]) as 'Class',
convert(varchar(50),[SubClass]) as 'SubClass',
count(*) as [Num Entries]
(SELECT name,
x.value('(//@name)[1]', 'varchar (100)') AS [TokenName],
x.value('(//@class)[1]', 'varchar (100)') AS [Class],
x.value('(//@subclass)[1]', 'varchar (100)') AS [SubClass]
(SELECT CAST (entry_data as xml),name
FROM sys.dm_os_memory_cache_entries
WHERE type = 'USERSTORE_TOKENPERM')
AS R(x,name)
) a
group by a.name,a.TokenName,a.Class,a.SubClass
-- loop on 1 minute basis
WAITFOR DELAY '00:01:00'
END
When run in this format you’ll get a text file created in the output directory used by SQLDIAG which will contain the results from the script. You can then monitor the individual items in the cache over long periods of time to see if you’re encountering any of the known issues around this security cache.
I guess everyone has their own set of DMV scripts to hand when working with SQL Server nowadays. I thought I’d post a few of mine. Pretty much as the title says: comment in whichever clauses suit you and then run the query. Returns a result set with an XML column that can be clicked on if you want, to view the plan itself in SSMS. Works with 2005 and 2008.
select
bucketid,
a.plan_handle,
refcounts,
usecounts,
execution_count,
size_in_bytes,
cacheobjtype,
objtype,
text,
query_plan,
creation_time,
last_execution_time,
total_elapsed_time,
last_elapsed_time
from sys.dm_exec_cached_plans a
inner join sys.dm_exec_query_stats b on a.plan_handle=b.plan_handle
cross apply sys.dm_exec_sql_text(b.sql_handle) as sql_text
cross apply sys.dm_exec_query_plan(a.plan_handle) as query_plan
where 1=1
and text like '%mybadproc%'
-- and a.plan_handle = 0x06000B00C96DEC2AB8A16D06000000000000000000000000
and b.last_execution_time between '2009-09-24 09:00' and '2009-09-25 17:00'
order by last_execution_time desc
Here’s an example of how to collect and consume extended events in SQL Server 2008. It’s something that we in CSS are starting to use more and more frequently with the uptake of SQL 2008, and I believe it’s something that everyone who troubleshoots SQL can gain benefit from.
Here’s a relatively simple example which is based upon real world experiences of looking at this bug in Connect, which I was looking at recently. The main thing I want to show here is how to review the results of collection yourself, as obviously the setup of the actual event collection is already described on the connect site.
The scenario is relatively simple, and a typical one for extended event usage. You have a process which produces an arithmetic overflow at an unknown point. In the example of this bug, it’s within the SSIS package managing data upload for the performance data warehouse. It could be any large process application or tool though, which would prove more difficult to trace with something like profiler or any of the other troubleshooting tools.
From the connect site, a colleague in the product group has asked the reporting customer to configure and run the following event session:
-- XEvent Session Definition Arithmetic error, Log additional info
-- Sample Query: select convert(bigint,21568194 * 100)
if exists(select * from sys.server_event_sessions where name='ArithmeticError')
DROP EVENT SESSION ArithmeticError ON SERVER
CREATE EVENT SESSION ArithmeticError ON SERVER
ADD EVENT sqlos.exception_ring_buffer_recorded (
ACTION (sqlserver.tsql_stack, sqlserver.sql_text)
WHERE (error = 8115 ))
ADD TARGET package0.ring_buffer,
ADD TARGET package0.asynchronous_file_target(
SET filename='c:\temp\ArithmeticErrors.xet', metadatafile='c:\temp\ArithmeticErrors.xem')
WITH (MAX_MEMORY = 4096KB,
EVENT_RETENTION_MODE = ALLOW_MULTIPLE_EVENT_LOSS,
MAX_DISPATCH_LATENCY = 1 SECONDS,
MAX_EVENT_SIZE = 0KB,
MEMORY_PARTITION_MODE = NONE,
TRACK_CAUSALITY = OFF,
STARTUP_STATE = OFF)
ALTER EVENT SESSION ArithmeticError ON SERVER
state=start
-- Cleanup the session
-- drop EVENT SESSION ArithmeticError ON SERVER
You can run this code as is on a test server to try out the demo. You’ll need to ensure that you have a c:\temp directory for the results to output to a file. There are several different ways that you can consume or record the output from an extended event, but here we are writing to a file as said files can then easily be uploaded to us here at Microsoft. You can read about some of the other alternatives in the Books Online link above if you want.
Once you have the event set up, you can trigger the 8115 error, by un-commenting the “sample query” line at the top of the script. When you run it in SSMS, you should encounter the following error:
Msg 8115, Level 16, State 2, Line 1 Arithmetic overflow error converting expression to data type int.
If you then look in the c:\temp directory, you ought to notice the files the session is writing to have been created, however at this stage, the .xet file (which actually holds the event data as opposed to the xem file which holds the event meta data required to decode the event itself) is actually empty. This is because the session is still active and the data has not been flushed to the file. If you run the following to stop (but not drop) the event
state=stop
drop EVENT SESSION ArithmeticError ON SERVER
So the question then comes, what to do with the files. I have to admit it took me a while to work this out, as it wasn’t obviously documented. I also needed the help of my good friend and pool partner, Spike, to write some XQuery in a nice way, as I don’t really like XQuery. The function is documented in Books Online though to be fair (although not the XQuery!)
There are 2 different ways you can go about reading this type of file in my opinion. Firstly you can just run the following query in SSMS and you’ll get an XML column in the results pain, if you output to grid. You can then just click on the column and you’ll get to look at the XML document in a new SSMS window. Quick and undeniably dirty yes, but it works quite effectively for a simple scenario like this:
-- straight select, view the XML manually
'C:\temp\ArithmeticErrors_0_128981030796910000.xet',
'C:\temp\ArithmeticErrors_0_128981030796920000.xem',
Note that the actual file names have been modified automatically by the event engine to guarantee uniqueness. When you run this, you should get an output grid which looks like this:
If you then click on the first XML column, you’ll open up a new tab with the XML document for viewing, which will look something like this:
<event name="exception_ring_buffer_recorded" package="sqlos" id="21" version="1" timestamp="2009-09-22T14:33:06.537Z">
<data name="id">
<value>0</value>
<text />
</data>
<data name="timestamp">
<data name="task_address">
<value>0x00000000000ffdc8</value>
<data name="error">
<value>8115</value>
<data name="severity">
<value>16</value>
<data name="state">
<value>2</value>
<data name="user_defined">
<value>false</value>
<data name="call_stack">
<value />
<action name="tsql_stack" package="sqlserver">
<value><frame level='1' handle='0x01000100FC030E3850E97786000000000000000000000000' line='1' offsetStart='0' offsetEnd='0'/></value>
</action>
<action name="sql_text" package="sqlserver">
<value>select convert(bigint,21568194 * 100)</value>
</event>
Most of this XML ought to be self explanatory. The bit that is most interesting to us here is the “sql_text” section which shows the actual text being executed. In a real world scenario where we didn’t know what code was causing our error, we’ve now tracked it down.
The tsql_stack section is also useful. You can pass the handle you get back from this section into a DMV to get further details about the object or statement being executed (if it’s part of a stored procedure for example). A typical syntax for this would be
SELECT * FROM sys.dm_exec_sql_text(0x01000100FC030E3850E97786000000000000000000000000)
However, you should note that in this very simple example, this will return nothing. This is because a plan has not been cached for our simple select statement that only does a mathematical calculation. This is of course possible in other scenarios where the plan which was used at the time the event captured it, is no longer cached, due to it being aged out, or the server being re-started.
If you want to follow this example and be able to look at the sql_text plan handle, then you need to write an actual query which fetches data and has an arithmetic overflow. For example, simple procedure like this
create procedure mybadproc
as
select *
from sys.servers
select convert(bigint,999999999 * 100)
will cache a plan when executed, and will be captured by the same event session we set up above. If you execute this, or other more complex statements / procedures, you can get back multiple items in the tsql_stack, each with separate handles which you can then pass into DMV based queries which accept plan handles. If you capture the execution of the above procedure in this event session, you’ll get back a slightly different XML document, where the tsql_stack has multiple values and the sql_text is now the name of the procedure that we executed. You can then pass any of the handles into DMV queries which accept such parameters.
<value><frame level='1' handle='0x03000D00E7A4787D959020018C9C00000100000000000000' line='8' offsetStart='132' offsetEnd='-1'/>
<frame level='2' handle='0x01000D00D8E5380F60E77786000000000000000000000000' line='1' offsetStart='0' offsetEnd='-1'/></value>
<value>mybadproc</value>
/event>
So to summarise part 1 of this post, we’ve looked at how to setup an extended event session and capture the output to a file. We’ve then seen a very simple way of looking at the output to manually identify a condition captured by our session. In the next part I’ll show some XQuery to review the output in more detail.
I recently received a question through this blog about the use of the sys.dm_os_ring_buffers DMV with regard to troubleshooting performance problems and low memory conditions. Here are some excerpts – edits from my reply which might provide some ideas for you.
Here's an interesting article you might like to read about monitoring memory usage with ring buffers:
http://blogs.msdn.com/slavao/archive/2005/02/19/376714.aspx
In terms of my personal preferences regarding performance monitoring in SQL Server 2005, I find it rare that I have to use the ring buffers. They are of course undocumented externally (which is why you are having trouble finding information on them!), and so I tend to discuss with customers the usage of published documented tools as in 99% of cases these are enough to diagnose memory problems (and most others).
I would potentially take a different approach. As I often write in my blog I like to use the SQLDIAG tool to collect a variety of data when I am experiencing / monitoring a problem. In this case it might be useful to follow my post about collecting the DBCC MEMORYSTATUS output, in addition to the standard data when you are experiencing this problem. If you have experiencing a memory issue somewhere, this will give you the breakdown to diagnose it pretty much in all cases, and all the results of this output are publically documented. Alternatively though you could change the script which collects DBCC MEMORYSTATUS to collect some other hybrid selection of DMV based output, or even the ring buffers output.
The other benefit of this approach is that you collect a whole set of data which allows you to keep an open mind (which I like to do) about what the cause of the performance problem really is.
So in summary, should a customer come to me with a rough issue description like this (I have performance problems and I believe I am experiencing low memory conditions, my initial action plan would be as follows:
Run SQLDIAG for 30 minutes when the problem was occurring
Collect the SQL 2005 perfstats scripts
Collect DBCC MEMORYSTATUS add in
Collect a lightweight profiler (do not include statement level events or showplans)
Collect perfmon with all SQL Server counters like Buffer Manager added in .
And then I would review all the data. You could of course consider using SQLNEXUS to do this, or alternatively review the output manually as its all from documented in sources.
Here is a particularly good article which covers a lot of these documented issues, and has an excellent section of memory troubleshooting.
http://technet.microsoft.com/en-us/library/cc966540.aspx
Finally I just remembered that there is this public article as well which does document a good use for the sy.dm_os_ring_buffer output for a specific condition.
How to reduce paging of buffer pool memory in the 64-bit version of SQL Server
Update: another more detailed post about resource manager and the ring buffer records it writes is now available here:
http://blogs.msdn.com/psssql/archive/2009/09/17/how-it-works-what-are-the-ring-buffer-resource-monitor-telling-me.aspx
That said my comments still remain the same when you are going into a more generic performance problem.
I was discussing with Elisabeth back in this blog post about the above problem, and made comments that there were no plans currently to backport this to SQL 2005 or 2008. This changed following feedback both from customers and from us within CSS, and the fix was made part of SQL Server 2008 SP1. You'll note if you look at the KB article for SP1, that it doesn't have it's own KB, but it does get a mention in the fix list.
Enjoy!
I’ve worked on a lot of cases this year involving SharePoint. Here are some example problems I encountered recently. If you want to read the background or the formal advice first, here are some good starting links
Planning and Monitoring SQL Server Storage for Office SharePoint Server: Performance Recommendations and Best Practices (white paper)
Plan for performance and capacity (Office SharePoint Server)
In addition to this, you need to consider what you can and can’t do from a SQL Server perspective, if you’re trying to optimize or troubleshoot your SharePoint deployment. You need to pay detailed attention to these articles:
http://support.microsoft.com/kb/932744/
http://support.microsoft.com/kb/841057/
In simple terms, don’t go amending the schema or the objects! So back to my main point…
Things to do first:
1. Collect some decent SQL Server diagnostic data – my preferred option,as ever, SQLDIAG with perfstats scripts
2. Turn on trace flag 1222 to output deadlock information to the errorlog, or capture deadlock graphs in your profiler traces.
3. Collect this data at period of both good and bad performance, so you have a benchmark to compare against
4. Aggregate the profiler output. Personally I use readtrace.exe directly from RMLUtilities, but other like to use SQLNexus as well.
Here are the 2 real examples of problems I encountered recently, and diagnosed using the above data collection:
1. Are you experiencing significant parallel waits – SQL Server installations will have a default Max Degree of Parallelism set to 0. I have observed certain MOSS deployments this year where this is not the best MAXDOP, dependent upon the specifics of the data distribution and deployment. If you look in the perfstats output you might see evidence like this:
-- requests --
runtime session_id request_id ecid blocking_session_id blocking_ecid task_state wait_type
------------------------------ ---------- ----------- ----------- ------------------- ------------- --------------- ---------
2009-05-27T15:06:38.060 68 0 4 0 0 RUNNING CXPACKET
2009-05-27T15:06:38.060 68 0 0 0 4 SUSPENDED CXPACKET
2009-05-27T15:06:38.060 96 0 0 0 0 SUSPENDED WAITFOR
2009-05-27T15:06:38.060 116 0 0 0 0 RUNNING
2009-05-27T15:06:38.060 128 0 12 0 0 RUNNING CXPACKET
2009-05-27T15:06:38.060 128 0 9 0 0 RUNNABLE CXPACKET
2009-05-27T15:06:38.060 128 0 0 0 0 SUSPENDED CXPACKET
2009-05-27T15:06:38.060 128 0 5 0 0 SUSPENDED CXPACKET
2009-05-27T15:06:38.060 128 0 11 0 0 RUNNABLE CXPACKET
2009-05-27T15:06:38.060 128 0 6 0 0 SUSPENDED CXPACKET
2009-05-27T15:06:38.060 128 0 2 0 0 SUSPENDED CXPACKET
2009-05-27T15:06:38.060 128 0 3 0 0 SUSPENDED CXPACKET
2009-05-27T15:06:38.060 128 0 4 0 0 SUSPENDED CXPACKET
2009-05-27T15:06:38.060 128 0 7 0 0 SUSPENDED CXPACKET
2009-05-27T15:06:38.060 128 0 8 0 0 SUSPENDED CXPACKET
2009-05-27T15:06:38.060 128 0 1 0 0 SUSPENDED CXPACKET
2009-05-27T15:06:38.060 128 0 10 0 0 RUNNABLE CXPACKET
2009-05-27T15:06:38.060 138 0 10 0 0 SUSPENDED CXPACKET
2009-05-27T15:06:38.060 138 0 12 0 0 RUNNABLE CXPACKET
2009-05-27T15:06:38.060 138 0 11 0 0 SUSPENDED CXPACKET
2009-05-27T15:06:38.060 138 0 9 0 0 SUSPENDED CXPACKET
2009-05-27T15:06:38.060 138 0 0 0 3 SUSPENDED CXPACKET
2009-05-27T15:06:38.060 138 0 3 0 8 SUSPENDED CXPACKET
2009-05-27T15:06:38.060 138 0 1 0 8 SUSPENDED CXPACKET
2009-05-27T15:06:38.060 138 0 2 0 8 SUSPENDED CXPACKET
2009-05-27T15:06:38.060 138 0 4 0 8 SUSPENDED CXPACKET
2009-05-27T15:06:38.060 138 0 7 0 12 SUSPENDED CXPACKET
2009-05-27T15:06:38.060 138 0 5 0 12 SUSPENDED CXPACKET
2009-05-27T15:06:38.060 138 0 6 0 12 SUSPENDED CXPACKET
2009-05-27T15:06:38.060 138 0 8 0 12 SUSPENDED CXPACKET
2009-05-27T15:06:38.060 163 0 0 0 0 RUNNABLE
2009-05-27T15:06:38.060 170 0 0 163 0 SUSPENDED LCK_M_X
This is a classic example, where a long blocking chain starts to build due to CXPACKET waits. Note that the same sessionid is reported on multiple lines, but the ecid column changes. This indicates that the batch (sessionid) is being processed in parallel by more than one worked (ecid). This extract is very early on in the problem, from here the chains grow and grow, and eventually can result in multiple deadlocks. In this scenario the batches being executed by sessions 128 and 138 (which are both running PROC_LISTUNGHOSTEDFIELDTEMPLATESINLIST ) are picking parallel plans inappropriately. We worked around this by changing MAXDOP on the server to a value of 1. This is definitely something that you should test in your environment, since it’s a server wide setting. Currently according to the above KB articles there is no fixed recommendation about MAXDOP for MOSS deployments, but things can change, so you should check the fixed requirements of your version. In this scenario we saw significant other number of procedures which encountered the same issue, procedures such as proc_CheckDirectoryTree, proc_DeplGetListItemLinks and PROC_DEPLGETLISTITEMANDRESERVEID amongst others. Typical errors you might see in the application logs for SharePoint would be
Unexpected query execution failure, error code 1205. Additional error information from SQL Server is included below. "Transaction (Process ID 163) was deadlocked on lock resources with another process and has been chosen as the deadlock victim. Rerun the transaction."
A global change of MAXDOP = 1 resolved this specific situation.
2. Are you encountering long blocking chains accessing the AllDocs table? This object is commonly accessed by a number of different stored procedures, and I’ve seen some scenarios where it becomes a point of contention. The most problematic one I came across recently resulted in numerous deadlocks, which firstly throw the above 1205 error to the SharePoint application log, and then in addition they throw the following error:
InitializePublishingWebDefaults failed. Microsoft.SharePoint.SPException: Operation aborted (Exception from HRESULT: 0x80004004 (E_ABORT)) ---> System.Runtime.InteropServices.COMException (0x80004004): Operation aborted (Exception from HRESULT: 0x80004004 (E_ABORT)) at Microsoft.SharePoint.Library.SPRequestInternalClass.SetWebProps
This second error occurs as SharePoint does not roll back properly after encountering the deadlock. If you have collected your SQLDIAG data as noted above, you’ll see symptoms like this in the perfstats script output:
a) long blocking chains waiting on LCK_M_U
runtime session_id request_id ecid blocking_session_id blocking_ecid task_state wait_type wait_duration_ms
------------------------------ ---------- ----------- ----------- ------------------- ------------- --------------- -------------------------------------------------- --------------------
2009-05-28T10:01:40.973 60 0 0 197 0 SUSPENDED LCK_M_X 3201
2009-05-28T10:01:40.973 72 0 0 60 0 SUSPENDED LCK_M_S 3224
2009-05-28T10:01:40.973 77 0 0 0 0 SUSPENDED WAITFOR 9087
2009-05-28T10:01:40.973 84 0 0 72 0 SUSPENDED LCK_M_S 1018
2009-05-28T10:01:40.973 197 0 0 0 0 RUNNING 0
2009-05-28T10:01:40.973 268 0 0 270 0 SUSPENDED LCK_M_U 36299
2009-05-28T10:01:40.973 270 0 0 197 0 SUSPENDED LCK_M_U 96753
2009-05-28T10:01:40.973 277 0 0 270 0 SUSPENDED LCK_M_U 44290
2009-05-28T10:01:40.973 280 0 0 270 0 SUSPENDED LCK_M_U 60461
2009-05-28T10:01:40.973 292 0 0 270 0 SUSPENDED LCK_M_U 52389
b) the notableactivequeries part of the perfstats script will potentially note that the plan_total_exec_count is very low. This indicates that you appear to be running a freshly compiled plan for the blocking statement. This will be all the more likely if your system was running well, and then suddenly started degrading performance and encountering errors
-- notableactivequeries --
runtime session_id request_id plan_total_exec_count plan_total_cpu_ms plan_total_duration_ms plan_total_physical_reads plan_total_logical_writes plan_total_logical_reads
------------------------------ ---------- ----------- --------------------- -------------------- ---------------------- ------------------------- ------------------------- ------------------------
2009-05-28T10:01:40.973 197 0 5 33057 406079 2550 0 2352015
2009-05-28T10:01:40.973 268 0 5 33057 406079 2550 0 2352015
2009-05-28T10:01:40.973 270 0 5 33057 406079 2550 0 2352015
2009-05-28T10:01:40.973 277 0 5 33057 406079 2550 0 2352015
2009-05-28T10:01:40.973 280 0 5 33057 406079 2550 0 2352015
2009-05-28T10:01:40.973 292 0 5 33057 406079 2550 0 2352015
c) when reviewing the error logs you note that the same statement shown below, being executed within the procedure proc_UpdateTpWebMetaData, is always involved in the deadlocking
UPDATE
Docs
SET
Docs.BuildDependencySet = NULL
Docs INNER JOIN BuildDependencies
ON
BuildDependencies.SiteId = Docs.SiteId AND
BuildDependencies.LeafName = Docs.LeafName AND
BuildDependencies.DirName = Docs.DirName
BuildDependencies.SiteId = @SiteId AND
BuildDependencies.TargetDirName = @PreviousCustomMasterDir AND
BuildDependencies.TargetLeafName = @PreviousCustomMasterLeaf AND
Docs.BuildDependencySet IS NOT NULL
The reason for this problem, in this particular scenario, is that a new plan has been compiled recently for this statement which is using a HASH join, as opposed to a nested loop join which it was using earlier. This is caused by a lack of up to date statistics on the AllDocs table. We noted scenarios where the regular update statistics jobs (issued by SharePoint as part of the changes in 2007 SP2) were not thorough enough, and we were forced to run update statistics with the fullscan option on the AllDocs table.
The bottom line here being that although SharePoint does now do database maintenance, as shown in KB932744, there are still times potentially where you will need to do more detailed targeted maintenance.
I refer to this many times in my other posts, so here’s a direct link to the download location of the perfstats scripts for SQLDIAG which you can use yourself. I use this everyday, you should too!
http://blogs.msdn.com/psssql/archive/2007/02/21/sql-server-2005-performance-statistics-script.aspx
Here’s another interesting example which we worked recently in the Stockholm office, which relates also back to one of my previous entries. We had a customer whose error log was filled with a variety of critical errors such as these:
Error: 701, Severity: 17, State: 193. There is insufficient system memory to run this query. Error: 1204, Severity: 19, State: 2. The instance of the SQL Server Database Engine cannot obtain a LOCK resource at this time. Rerun your statement when there are fewer active users. Ask the database administrator to check the lock and memory configuration for this instance, or to check for long-running transactions. Error: 17803, Severity: 20, State: 13. There was a memory allocation failure during connection establishment. Reduce nonessential memory load, or increase system memory. The connection has been closed. [CLIENT: <local machine>]
Error: 701, Severity: 17, State: 193. There is insufficient system memory to run this query.
Error: 1204, Severity: 19, State: 2. The instance of the SQL Server Database Engine cannot obtain a LOCK resource at this time. Rerun your statement when there are fewer active users. Ask the database administrator to check the lock and memory configuration for this instance, or to check for long-running transactions.
Error: 17803, Severity: 20, State: 13. There was a memory allocation failure during connection establishment. Reduce nonessential memory load, or increase system memory. The connection has been closed. [CLIENT: <local machine>]
This was a fairly high end system with 64GB RAM, 16 logical CPUs and was generally well configured in terms of its Max Server Memory and other basic sp_configure settings. As you may well be aware, there are innumerable causes that can actually cause you to receive a memory related error such as this, and some of the more basic troubleshooting tips are shown in the following MSDN article
http://msdn.microsoft.com/en-us/library/aa337311(SQL.90).aspx
However in this case none of the basics were really of any help. Typical starting questions to ask yourself in this scenario are:
1. Is there a pattern to when this occurs
2. Do I know of any particular workloads which are running when I see this problem
…just to understand if you have any particular potential causes in your own code. Our customer in this scenario was not aware of any, and unfortunately we didn’t have any dumps created which might allow us to see retrospectively what batches were running when the problem occurred.
However we did have the partial output of from DBCC MEMORYSTATUS which SQL had kindly written to the error log for us. An examination of this showed the following suspicious looking section:
OBJECTSTORE_LOCK_MANAGER (node 2) VM Reserved = 0 KB VM Committed = 0 KB AWE Allocated = 0 KB SM Reserved = 0 KB SM Committed = 0 KB SinglePage Allocator = 29398856 KB MultiPage Allocator = 0 KB Global Memory Objects: Resource= 709 Locks= 3676149 SE Schema Mgr= 1646 SQLCache= 261 Replication= 2 ServerGlobal= 109 XPGlobal= 2 Xact= 101 SETLS= 16 DatasetMemObjs= 32 SubpDescPmos= 16 SortTables= 2
OBJECTSTORE_LOCK_MANAGER (node 2) VM Reserved = 0 KB VM Committed = 0 KB AWE Allocated = 0 KB SM Reserved = 0 KB SM Committed = 0 KB SinglePage Allocator = 29398856 KB MultiPage Allocator = 0 KB
Global Memory Objects: Resource= 709 Locks= 3676149 SE Schema Mgr= 1646 SQLCache= 261 Replication= 2 ServerGlobal= 109 XPGlobal= 2 Xact= 101 SETLS= 16 DatasetMemObjs= 32 SubpDescPmos= 16 SortTables= 2
From this we can infer that we seemed to experience an equivalent problem to the one I wrote about here
http://blogs.msdn.com/grahamk/archive/2009/02/03/compilation-bottlenecks-error-8628-severity-17-state-0-part-1.aspx
although manifesting distinctly different symptoms.
At the end of the day one of the NUMA nodes has taken up about 29GB of memory just managing lock resources, and this is confirmed by the number of locks which are reported in the global memory objects. Remember that the majority of the output from DBCC MEMORYSTATUS is publically documented here:
How to use the DBCC MEMORYSTATUS command to monitor memory usage on SQL Server 2005
http://support.microsoft.com/kb/907877
This 29GB of memory purely managing locks appears to be placing an undue load of the memory resources of the system, and what we appear to have here is a runaway or huge batch of some type, or maybe a very large open transaction.
The way to resolve this problem is to identify the batch and stop it from running. One good way to identify this problem proactively might be to use the tiny DMV query I wrote in my previous post about this type of situation and attach a monitor to it, warning you if the value goes above 1 million say, or some other high number of your choice.
select request_session_id, count(*) from sys.dm_tran_locks group by request_session_id
You could then easily identify the problem session_id before it got completely out of hand. There are of course many different alternative ways of capturing such data or doing such monitoring. Maybe you’d prefer to create a custom SQLDIAG package to monitor your server in more detail.
The moral of this story being though, that even in the most critical system failures such as this, there may well be enough info to get a hint of where the problem lies, if you make sure you go through the error logs in detail. Remember not to be overwhelmed by the sheer volume of errors that might be reported; ensure you go through all the lines removing the duplicates to see whether something of interest is hidden amongst all the noise.
Here's a little something that me and my colleagues were looking at this week. Our customer was executing queries against a DB2 linked server using the latest provider and occasionally got these errors. One of my colleagues suggested to set a dump trigger on the error 7320 so that we could capture a sqlservr.exe dump when the problem next occurred. We achieved this by running
dbcc dumptrigger('SET',7320)
on the server. By default this will collect a mini-dump, unless you have enabled any other trace flags previously. When we had encountered the error and analyzed the dump, it showed up the following stack
sqlservr!ex_raise+0x52sqlservr!COledbError::HandleExecuteError+0x11bsqlservr!COledbRangeRowset::FDoPerformSchemaCheckForQuery+0xc7sqlservr!COledbRangeRowset::FPerformSchemaCheck+0x1fbsqlservr!COledbRangeRowset::FCheckSchema+0x2asqlservr!CEnvOledbRange::XretSchemaChanged+0x16sqlservr!CXStmtDML::XretDMLExecute+0x41037bsqlservr!CMsqlExecContext::ExecuteStmts<1,1>+0xacesqlservr!CMsqlExecContext::FExecute+0x439sqlservr!CSQLSource::Execute+0x355sqlservr!CStmtExecStr::XretExecStrExecute+0x48dsqlservr!CXStmtExecStr::XretExecute+0x20sqlservr!CMsqlExecContext::ExecuteStmts<1,1>+0x39esqlservr!CMsqlExecContext::FExecute+0x439sqlservr!CSQLSource::Execute+0x355sqlservr!CStmtExecProc::XretLocalExec+0x125sqlservr!CStmtExecProc::XretExecExecute+0x132sqlservr!CXStmtExec::XretExecute+0x20sqlservr!CMsqlExecContext::ExecuteStmts<1,1>+0x39esqlservr!CMsqlExecContext::FExecute+0x439sqlservr!CSQLSource::Execute+0x355sqlservr!process_request+0x312sqlservr!process_commands+0x3fd
Before going into any in depth debugging which would involve monitoring the remote server and the components in between, my colleague thought it would be sensible to check the setting surrounding Schema Validation as we were obviously throwing the error in a routine related to this functionality (FDoPerformSchemaCheckForQuery). So we changed the option of the linked server to say enable "Lazy Schema Validation", using this syntax:
exec sp_serveroption xxxxx, 'lazy schema validation', true
If you look this option up in MSDN under "Configuring OLE DB Providers for Distributed Queries" you'll note the following regarding this option:
"If this option is set to false, the default value, SQL Server checks for schema changes that have occurred since compilation in remote tables. This check occurs before query execution. If there is a change in the schema, SQL Server recompiles the query with the new schema. If this option is set to true, schema checking of remote tables is delayed until execution. This can cause a distributed query to fail with an error if the schema of a remote table has changed between query compilation and execution. You may want to set this option to true when distributed partitioned views are being used against a linked server that is running SQL Server. A specified table that participates in the partitioned view may not be actually used in a specific execution of a query against the view. Therefore, deferring the schema validation may improve performance."
"If this option is set to false, the default value, SQL Server checks for schema changes that have occurred since compilation in remote tables. This check occurs before query execution. If there is a change in the schema, SQL Server recompiles the query with the new schema.
If this option is set to true, schema checking of remote tables is delayed until execution. This can cause a distributed query to fail with an error if the schema of a remote table has changed between query compilation and execution.
You may want to set this option to true when distributed partitioned views are being used against a linked server that is running SQL Server. A specified table that participates in the partitioned view may not be actually used in a specific execution of a query against the view. Therefore, deferring the schema validation may improve performance."
Now in all truth this was quite a pragmatic approach, in that we hadn't fully debugged the problem, we were just taking an educated guess, that we were experiencing a problem surrounding schema validation, and so far, since implementing this option, the errors have not re-occurred, so we've managed to workaround the problem.
For reference the customer in question was also experiencing other errors in the 7300 to 7320 range, which have also stopped occurring since this option was set. Errors such as
Error: 7350 Cannot get the column information from OLE DB provider "DB2OLEDB" for linked server "xxx" Error: 7321 An error occurred while preparing the query "xxx" for execution against OLE DB provider "DB2OLEDB" for linked server "xxx" Error: 7303 Cannot initialize the data source object of OLE DB provider "DB2OLEDB" for linked server "xxx"
Error: 7350 Cannot get the column information from OLE DB provider "DB2OLEDB" for linked server "xxx"
Error: 7321 An error occurred while preparing the query "xxx" for execution against OLE DB provider "DB2OLEDB" for linked server "xxx"
Error: 7303 Cannot initialize the data source object of OLE DB provider "DB2OLEDB" for linked server "xxx"
The moral of this story is that linked servers can be very difficult and time consuming to troubleshoot, due to the number of distributed components, including ones outside of the SQL Server area completely. This is one time where I am a fan of testing some of the different settings in the provider to see if the problem can be worked around initially, before doing some in depth analysis. In most other cases I'd far prefer to know why, and we may yet revisit this particular situation to fully debug it, but sometimes a workaround and an error free production environment is also nice for a quick win.
For future reference another interesting quick diagnostic that can be performed on linked servers is to run Process Monitor on the server in question to see whether you are having any obscure permission of registry problems. I've seen this give some quick wins as well.