Welcome to MSDN Blogs Sign in | Join | Help

Troubleshooting SQL Server Management Studio with SQL Profiler and Debugging Tools – Part 2

This is the second in a series of 3 posts which discuss troubleshooting SQL Server Management Studio when it encounters errors or unexpected behaviour.

In this post we collect a memory dump of SSMS based upon specific .NET exception codes, for subsequent analysis.

As a reminder, here are the symptoms of the problem we are looking at:

1. Unable to view Job Activity in the SQL Server Agent Job Activity Monitor

2. Progress bar shows “Cannot display activity data”

3. Click on more details throws a message box with the error

Failed to retrieve data for this request. (Microsoft.SqlServer.Management.Sdk.Sfc)

with the additional details

Hour, Minute, and Second parameters describe an un-representable DateTime. (mscorlib)

It actually looks like this (click to enlarge)

image

4. The error repros with all versions of SSMS (2005 and 2008)

5. The error repros for all users whatever workstation they connect from.

The debugging approach

Initially I was thinking that since I had a .NET error, I could run adplus and use a config file to configure the debugger to capture the .NET errors. This would work and I’ve done it before, but then Tess reminded me that I could use DebugDiag and it would probably be easier and quicker. So I tried this:

Firstly I installed the DebugDiag tool on the machine where I was repro’ing the problem. I started it up and created a new rule, the properties of which are shown below. Being as I didn’t know exactly what was going on at this stage, I just elected to log everything to a file to begin with, rather than creating dumps at first, so I could get a better idea of what was really happening in the process. I expected that I was getting some .NET errors based upon the message box shown, but if I just created a dump on all .NET exceptions, I might miss something else that came before it. The rule looked like this:

1. Type = Crash

2. Process = ssms.exe (i was using SQL 2008 tools, if I’d been using SQL 2005 tools the process name would be sqlwb.exe)

3. Action type for unconfigured first chance exceptions = Log Stack Trace

4. Exceptions - CLR (.NET) Exception - All Exception Types = Log Stack Trace

5. Action Limits = 0 (since I’m only logging to file at this stage)

Here’s what it looks like in the GUI (click to enlarge)

image

I saved the rule, enabled it straight away, and then repro’d the problem again. I then disabled the rule and browsed for the log file which in my case was held here:

C:\Program Files (x86)\DebugDiag\Logs\

it was called:

Ssms__PID__2912__Date__10_19_2009__Time_03_35_46PM__898__Log.txt

Review of this file showed that actually 3 different exceptions had occurred in this order:

[10/19/2009 3:35:55 PM] First chance exception - 0xe0434f4d caused by thread with system id 5824
[10/19/2009 3:35:55 PM] CLR Exception Type - 'System.ArgumentOutOfRangeException'

then

[10/19/2009 3:35:55 PM] First chance exception - 0xe0434f4d caused by thread with system id 5824
[10/19/2009 3:35:55 PM] CLR Exception Type - 'Microsoft.SqlServer.Management.Sdk.Sfc.EnumeratorException'

then finally

[10/19/2009 3:35:55 PM] First chance exception - 0xe0434f4d caused by thread with system id 5824
[10/19/2009 3:35:55 PM] CLR Exception Type - 'Microsoft.SqlServer.Management.Sdk.Sfc.EnumeratorException'

It may be in this case that all these errors are related, however the theory of only logging information first proved valuable, as I want to create a dump of when the very first error occurs. Therefore I now know that I can edit my rule to tell DebugDiag to create a full user dump, when it encounters a CLR Exception Type - 'System.ArgumentOutOfRangeException', and that it should only do the dump once. The theory of logging first is also backed up by the fact that repro’ing this problem on a x86 machine throws a different chain of .NET exceptions (although the first one is still 'System.ArgumentOutOfRangeException')

That said, I’m now ready to capture a full user dump of the ssms.exe process, based upon the above rule, so I can debug it properly, and see the true root of the problem. I can edit my rule created earlier, and change the Exceptions dialogue box to add an additional Exception sub-rule, which looks like this (click to enlarge)

image

and has the properties

1. Exceptions - CLR (.NET) Exception - .NET Exception Types = System.ArgumentOutOfRangeException

2. Action Type = Full Userdump

3. Action Limit = 1

I add this exception rule, click OK a couple of times and re-enable the rule. Then run the repro again, and I notice the SSMS dialogue hang for a couple of seconds whilst it opens. This is because DebugDiag is creating the memory dump.

I disable the rule again, and then open the log file.The output is similar to before, except that after the first exception is logged, and after the stack trace of this exception, I note like the following:

[10/20/2009 3:14:08 PM] Created dump file C:\Program Files (x86)\DebugDiag\Logs\Crash rule for all instances of Ssms.exe\Ssms__PID__2912__Date__10_20_2009__Time_03_14_02PM__806__First Chance System.ArgumentOutOfRangeException.dmp
[10/20/2009 3:14:08 PM] Action limit of 1 reached for Exception 0xE0434F4D:SYSTEM.ARGUMENTOUTOFRANGEEXCEPTION.

This confirms that I have successfully created a dump, and I have created it at the point when the first exception was encountered. This technique ought to work for any CLR Exception that SSMS can encounter. As I mentioned above it would also be just as acceptable to use adplus with a config file to create the dump.

in Part 3, it’s time to analyze that dump.

Posted by grahamk | 0 Comments
Filed under: ,

Troubleshooting SQL Server Management Studio with SQL Profiler and Debugging Tools – Part 1

This is the first in a series of 3 posts which discuss troubleshooting SQL Server Management Studio when it encounters errors or unexpected behaviour.

In this post we look at using profiler, in the 2nd and 3rd we debug SSMS using public Microsoft debugging tools.

Here are the symptoms of the problem we are looking at:

1. Unable to view Job Activity in the SQL Server Agent Job Activity Monitor

2. Progress bar shows “Cannot display activity data”

3. Click on more details throws a message box with the error

Failed to retrieve data for this request. (Microsoft.SqlServer.Management.Sdk.Sfc)

with the additional details

Hour, Minute, and Second parameters describe an un-representable DateTime. (mscorlib)

It actually looks like this

image

4. The error repros with all versions of SSMS (2005 and 2008)

5. The error repros for all users whatever workstation they connect from.

These last 2 points are particularly relevant. Whenever you are looking at an SSMS problem, you always need to be asking these 2 questions. Straight at the off, you have potentially then eliminated whole areas of research (such as corruptions of .NET framework on specific workstations, incompatibilities in processor architecture specific versions of the framework, and localising whether the problem is with the SSMS client or the server instance being managed, or a combination of both)

The initial approach – using SQL Profiler and some common sense

When I initially came to this problem, the customer was experiencing the errors above. Normally when I get any errors or problems in SSMS, rather than reaching for debugger, I reach for SQL Profiler first (having asked the above 2 questions about where and when the issue repros), and this is a good rule of thumb. At the end of the day SSMS is just a GUI which issues TSQL commands to achieve functions in SQL Server. Whenever I get unexpected results in SSMS, the first thing I therefore do is to run a profiler trace against the server which SSMS is connected to. This means that you can see the TSQL which is issued, and you can sometimes find the problem straight off, if for example it issues some piece of bad code which obviously causes your error.

In this case if you run the SQL profiler GUI tool (you can simply point it at the appropriate server instance and use the default options), you’ll note that the SSMS issues the following SQL batches.

SET LOCK_TIMEOUT 10000
select SERVERPROPERTY(N'servername')
declare @RegPathParams sysname
                declare @Arg sysname
                declare @Param sysname
                declare @MasterPath nvarchar(512)
 
                declare @LogPath nvarchar(512)
                declare @ErrorLogPath nvarchar(512)
                declare @n int
 
                select @n=0
                select @RegPathParams=N'Software\Microsoft\MSSQLServer\MSSQLServer'+'\Parameters'
                select @Param='dummy'
                while(not @Param is null)
                begin
                    select @Param=null
                    select @Arg='SqlArg'+convert(nvarchar,@n)
 
                    exec master.dbo.xp_instance_regread N'HKEY_LOCAL_MACHINE', @RegPathParams, @Arg, @Param OUTPUT
                    if(@Param like '-d%')
                    begin
                        select @Param=substring(@Param, 3, 255)
                        select @MasterPath=substring(@Param, 1, len(@Param) - charindex('\', reverse(@Param)))
                    end
                    else if(@Param like '-l%')
                    begin
                        select @Param=substring(@Param, 3, 255)
                        select @LogPath=substring(@Param, 1, len(@Param) - charindex('\', reverse(@Param)))
                    end
                    else if(@Param like '-e%')
                    begin
                        select @Param=substring(@Param, 3, 255)
                        select @ErrorLogPath=substring(@Param, 1, len(@Param) - charindex('\', reverse(@Param)))
                    end
                    
                    select @n=@n+1
                end
            
 
 
                declare @SmoRoot nvarchar(512)
                exec master.dbo.xp_instance_regread N'HKEY_LOCAL_MACHINE', N'SOFTWARE\Microsoft\MSSQLServer\Setup', N'SQLPath', @SmoRoot OUTPUT
            
 
 
SELECT
CAST(case when 'a' <> 'A' then 1 else 0 end AS bit) AS [IsCaseSensitive],
CAST(FULLTEXTSERVICEPROPERTY('IsFullTextInstalled') AS bit) AS [IsFullTextInstalled],
@LogPath AS [MasterDBLogPath],
@MasterPath AS [MasterDBPath],
@ErrorLogPath AS [ErrorLogPath],
@@MAX_PRECISION AS [MaxPrecision],
@SmoRoot AS [RootDirectory],
SERVERPROPERTY(N'ProductVersion') AS [VersionString],
CAST(SERVERPROPERTY(N'MachineName') AS sysname) AS [NetName],
CAST(SERVERPROPERTY(N'Edition') AS sysname) AS [Edition],
CAST(SERVERPROPERTY(N'ProductLevel') AS sysname) AS [ProductLevel],
CAST(SERVERPROPERTY('IsClustered') AS bit) AS [IsClustered],
CAST(SERVERPROPERTY('IsSingleUser') AS bit) AS [IsSingleUser],
CAST(SERVERPROPERTY('EngineEdition') AS int) AS [EngineEdition],
SERVERPROPERTY(N'ResourceVersion') AS [ResourceVersionString],
SERVERPROPERTY(N'ResourceLastUpdateDateTime') AS [ResourceLastUpdateDateTime],
SERVERPROPERTY(N'BuildClrVersion') AS [BuildClrVersionString],
SERVERPROPERTY(N'ComputerNamePhysicalNetBIOS') AS [ComputerNamePhysicalNetBIOS],
SERVERPROPERTY(N'CollationID') AS [CollationID],
SERVERPROPERTY(N'ComparisonStyle') AS [ComparisonStyle],
SERVERPROPERTY(N'SqlCharSet') AS [SqlCharSet],
SERVERPROPERTY(N'SqlCharSetName') AS [SqlCharSetName],
SERVERPROPERTY(N'SqlSortOrder') AS [SqlSortOrder],
SERVERPROPERTY(N'SqlSortOrderName') AS [SqlSortOrderName],
convert(sysname, serverproperty(N'collation')) AS [Collation]
exec sp_executesql N'SELECT
dtb.collation_name AS [Collation],
dtb.name AS [DatabaseName2]
FROM
master.sys.databases AS dtb
WHERE
(dtb.name=@_msparam_0)',N'@_msparam_0 nvarchar(4000)',@_msparam_0=N'msdb'    Microsoft SQL Server Management Studio    grahamk    EUROPE\grahamk    0    2    0    0    2912    52    2009-10-19 14:28:13.143    2009-10-19 14:28:13.143    0X00000000040000001A00730070005F006500780065006300750074006500730071006C002E01000082001A00E7206E0076006100720063006800610072002800310033003100290006010000530045004C004500430054000A006400740062002E006300    
create table #tmp_sp_help_category
(category_id int null, category_type tinyint null, name nvarchar(128) null)
insert into #tmp_sp_help_category exec msdb.dbo.sp_help_category
SELECT
sv.name AS [Name],
CAST(sv.enabled AS bit) AS [IsEnabled],
tshc.name AS [Category],
sv.category_id AS [CategoryID],
tshc.category_type AS [CategoryType],
null AS [CurrentRunStatus],
null AS [CurrentRunStep],
null AS [HasSchedule],
null AS [HasStep],
null AS [HasServer],
null AS [LastRunDate],
null AS [NextRunDate],
null AS [LastRunOutcome],
CAST(sv.job_id AS nvarchar(100)) AS [job_id]
FROM
msdb.dbo.sysjobs_view AS sv
INNER JOIN #tmp_sp_help_category AS tshc ON sv.category_id = tshc.category_id
 
drop table #tmp_sp_help_category
        

and finally:

exec msdb.dbo.sp_help_job

 

These batches don’t appear to return any errors or warnings to profiler. The logical next step was to take a look at what this code does. Once you read through the various batches, you notice 2 types of batch. The first 5 batches are just required to get certain properties about the server instance and to populate properties of the tree view in SSMS Object Explorer. Whenever you profile SSMS you see lots of batches like this, which tend to precede the “real” work.

The you see the final single batch, which executes msdb.dbo.sp_help_job. This is the real work that is required to populate the dialogue box that SSMS is opening. Since these batches are all just queries, and nothing is being updated, we can easily re-run them in SSMS query windows to see if anything else is thrown up, or see if weird results are returned that might explain our strange date time error.

We tried this and it returned a wide record set of several hundred rows, which on initial glance was all well and good. At this stage, the pragmatists amongst you may feel it’s appropriate to just scan through this data and see if there are any glaring inconsistencies which can be noticed which might cause a DateTime related error. The truth is that there was such a piece of data, if you could be bothered to read all the data manually, but it was only a hypothesis. The stored procedure in question returns some columns that are related to dates and times in a true DateTime representation. However it also returns some columns as text representations of date parts. In this record set, these columns should always be populated with values, but in this case there were a number of rows which showed the value “0” in these columns. We therefore made the guess that this data was the problem. We removed this data and tried the repro again, and everything worked ok.

We were lucky in this situation that the record set to review was only a few hundreds rows, and that the data corruption was obvious. On the basis that neither of these situations is always true, part 2 will show how to continue to troubleshoot this with debugging tools, to categorically prove what the problem was.

SQL Server Database Mirroring Pre-requisites and guidelines

I noticed that we have posted a really good summary of the above points through the new fast publish system for KBs. (Note the disclaimer at the bottom which is not normally shown on full KBs). There are some really good points here to consider about setting up and configuring a mirroring environment.

Things to consider when setting up database mirroring in SQL Server.

In addition to this of course, you should remember my previous points made on this blog about Virtual Log File fragmentation which I recently saw in a production environment again.

1413 error when starting Database Mirroring – How many virtual log files is too many?

Slow recovery times and slow performance due to large numbers of Virtual Log Files

Using Xquery to query Extended Events asynchronous file target results

This is the follow up (part 2) to the following post:

http://blogs.msdn.com/grahamk/archive/2009/09/23/realistic-troubleshooting-example-of-extended-events-xevent-usage-in-sql-server-2008-part-1.aspx

In the original post we configured an extended event session and issued a very simple query to review the results as XML. Here are some more intelligent XQuery examples. I’m documenting these purely because I hate XQuery so much, and it takes so damn long to do (well it does for me anyway). Please feel free to post in the comments section if you think XQuery is the best damn query language ever invented :-)

Hopefully the code comment explain everything, I have attempted to use very descriptive alias names as well

-- get information from XET binary files representing extended event session date
 
-- as per part 1 blog post:
-- straight select, view the XML manually in an SSMS window by clicking on the XML results column
select CAST(event_data as XML) , * 
from sys.fn_xe_file_target_read_file
(
'C:\temp\ArithmeticErrors_0_128981071730370000.xet', 
'C:\temp\ArithmeticErrors_0_128981071730390000.xem', 
null, 
null
)
 
-- Get only specific nodes
-- 2 different methods shown of the xquery, you can specify the name of the node
-- or you can specify its location in the structure
-- this means that you return every row for every event captured in the file
-- the results are then converted to SQL data types
 
SELECT
xdata.value('(/event/data[4]/value)[1]', 'varchar(100)') AS [error],
xdata.value('(/event/data[@name="severity"]/value)[1]', 'int') AS [severity],
xdata.value('(/event/action[@name="sql_text"]/value)[1]', 'varchar(max)') AS [sql_text]
 
FROM
    (
        select CAST(event_data as XML)
        from sys.fn_xe_file_target_read_file
        (
        'C:\temp\ArithmeticErrors_0_128981071730370000.xet', 
        'C:\temp\ArithmeticErrors_0_128981071730390000.xem', 
        null, 
        null
        )
    ) as xmlr(xdata)
 
-- using the above technique, you can now wrap the entire query and give it an alias
-- and then perform more useful T-SQL language queries on the results
-- here is an example of an aggregate
 
SELECT
convert(varchar(100), outer_results.[error]) as 'error',
convert(int, outer_results.[severity]) as 'severity',
convert(varchar(max), outer_results.[sql_text]) as 'sql_text',
count(*) as [count_occurances]
FROM
    (
    SELECT
    xdata.value('(/event/data[4]/value)[1]', 'varchar(100)') AS [error],
    xdata.value('(/event/data[@name="severity"]/value)[1]', 'int') AS [severity],
    xdata.value('(/event/action[@name="sql_text"]/value)[1]', 'varchar(max)') AS [sql_text]
 
    FROM
        (
            select CAST(event_data as XML)
            from sys.fn_xe_file_target_read_file
            (
            'C:\temp\ArithmeticErrors_0_128981071730370000.xet', 
            'C:\temp\ArithmeticErrors_0_128981071730390000.xem', 
            null, 
            null
            )
        ) as xmlr(xdata)
    ) as outer_results
group by error, severity,sql_text
 
-- finally using the same technique as above , here we have simple predicates as well
 
-- add a predicate into the outer query
SELECT
convert(varchar(100), outer_results.[error]) as 'error',
convert(int, outer_results.[severity]) as 'severity',
convert(varchar(max), outer_results.[sql_text]) as 'sql_text',
count(*) as [count_occurances]
FROM
    (
    SELECT
    xdata.value('(/event/data[4]/value)[1]', 'varchar(100)') AS [error],
    xdata.value('(/event/data[@name="severity"]/value)[1]', 'int') AS [severity],
    xdata.value('(/event/action[@name="sql_text"]/value)[1]', 'varchar(max)') AS [sql_text]
 
    FROM
        (
            select CAST(event_data as XML)
            from sys.fn_xe_file_target_read_file
            (
            'C:\temp\ArithmeticErrors_0_128981071730370000.xet', 
            'C:\temp\ArithmeticErrors_0_128981071730390000.xem', 
            null, 
            null
            )
        ) as xmlr(xdata)
    ) as outer_results
 
-- here is the extra predicate we have added
where outer_results.severity < 17
group by error, severity,sql_text
 

I fully accept that this may not be the most reasonable way, or the most efficient way to do this, but it’s the way that I’ll be using for now.

SQL Server 2008 SDK directory does not contain the opends60.lib file required for the creation of extended stored procedures

If you are trying to create extended stored procedures in SQL Server 2008, you will probably have read the BOL article here:

Creating Extended Stored Procedures

Hopefully you will have noticed that this feature is deprecated, and so firstly you’ll only be using this functionality for legacy requirements. That said you may have noticed that you will require the file opends60.lib to be able to create your extended stored procedure DLL. You may have noticed that you cannot find this after installing SQL Server 2008 (even if you have installed the SDK).

The default location for the SDK directory in SQL Server 2008 is here:

C:\Program Files\Microsoft SQL Server\100\SDK\

However it does not contain the opends60.lib. If you need to acquire a copy of this file, currently you need to obtain it from an alternate source, namely a previous version of SQL Server SDK or from an install of a previous version of visual studio. For example the x64 version of the file can be found in the SQL Server 2005 SDK directory here:

C:\Program Files\Microsoft SQL Server\90\SDK\Lib\x64

You should be able to use one of these other versions instead.

Posted by grahamk | 0 Comments

SQL Server DMV script for monitoring USERSTORE_TOKENPERM size by token type and number

Here’s a script I was using last week. I wrote this so that it could be used in a custom PSSDIAG / SQLDIAG task, but you can also run it in other automated methods as well.

To include it in SQLDIAG / PSSDIAG I added a custom group section into the XML config file (see previous posts on this blog if you don’t know how to do this), and saved the file in the root of the SQLDIAG directory as exec_userstore1.sql.

<CustomGroup name="userstore" enabled="true" />
<CustomTask enabled="true" groupname="userstore" taskname="detailed_scripts" type="TSQL_Script" point="Startup" wait="No" cmd="exec_userstore1.sql"/>

Here’s the script itself:

/*
detailed USERSTORE_TOKENPERM analysis for SQL 2005
script designed to be packaged into PSSDIAG and run over long time frame to monitor
but can be run individually at a fixed point to capture a snapshot
can be used in association with script number 2 which is verbose and does a snapshot
of all logins and token in cache, so can be HUGE.
graham kent - microsoft css - 15th September 2009
*/
 
SET NOCOUNT ON
SET QUOTED_IDENTIFIER ON
GO
 
-- get the token type distribution
 
WHILE 1=1
BEGIN
 
select 'start time:', GETDATE();
 
SELECT 
convert(varchar(50),[name]) as 'SOS StoreName', 
convert(varchar(50),[TokenName]) as 'TokenName',
convert(varchar(50),[Class]) as 'Class',
convert(varchar(50),[SubClass]) as 'SubClass',
count(*) as [Num Entries]
FROM
(SELECT name,
x.value('(//@name)[1]', 'varchar (100)') AS [TokenName],
x.value('(//@class)[1]', 'varchar (100)') AS [Class],
x.value('(//@subclass)[1]', 'varchar (100)') AS [SubClass]
FROM
(SELECT CAST (entry_data as xml),name
FROM sys.dm_os_memory_cache_entries
WHERE type = 'USERSTORE_TOKENPERM') 
AS R(x,name)
) a
group by a.name,a.TokenName,a.Class,a.SubClass
 
-- loop on 1 minute basis
WAITFOR DELAY '00:01:00'
 
END

When run in this format you’ll get a text file created in the output directory used by SQLDIAG which will contain the results from the script. You can then monitor the individual items in the cache over long periods of time to see if you’re encountering any of the known issues around this security cache.

Posted by grahamk | 0 Comments

SQL Server DMV script for searching the plan cache by query text, plan handle or execution time

I guess everyone has their own set of DMV scripts to hand when working with SQL Server nowadays. I thought I’d post a few of mine. Pretty much as the title says: comment in whichever clauses suit you and then run the query. Returns a result set with an XML column that can be clicked on if you want, to view the plan itself in SSMS. Works with 2005 and 2008.

select 
bucketid,
a.plan_handle,
refcounts, 
usecounts,
execution_count,
size_in_bytes,
cacheobjtype,
objtype,
text,
query_plan,
creation_time,
last_execution_time,
execution_count,
total_elapsed_time,
last_elapsed_time
 
from sys.dm_exec_cached_plans a 
      inner join sys.dm_exec_query_stats b on a.plan_handle=b.plan_handle
    cross apply sys.dm_exec_sql_text(b.sql_handle) as sql_text
    cross apply sys.dm_exec_query_plan(a.plan_handle) as query_plan
where 1=1
and text like '%mybadproc%'
-- and a.plan_handle = 0x06000B00C96DEC2AB8A16D06000000000000000000000000
and b.last_execution_time between '2009-09-24 09:00' and '2009-09-25 17:00'
order by last_execution_time desc
Posted by grahamk | 0 Comments

Realistic troubleshooting example of extended events (XEvent) usage in SQL Server 2008 – Part 1

Here’s an example of how to collect and consume extended events in SQL Server 2008. It’s something that we in CSS are starting to use more and more frequently with the uptake of SQL 2008, and I believe it’s something that everyone who troubleshoots SQL can gain benefit from.

Here’s a relatively simple example which is based upon real world experiences of looking at this bug in Connect, which I was looking at recently. The main thing I want to show here is how to review the results of collection yourself, as obviously the setup of the actual event collection is already described on the connect site.

The scenario is relatively simple, and a typical one for extended event usage. You have a process which produces an arithmetic overflow at an unknown point. In the example of this bug, it’s within the SSIS package managing data upload for the performance data warehouse. It could be any large process application or tool though, which would prove more difficult to trace with something like profiler or any of the other troubleshooting tools.

From the connect site, a colleague in the product group has asked the reporting customer to configure and run the following event session:

-- XEvent Session Definition Arithmetic error, Log additional info
 
-- Sample Query: select convert(bigint,21568194 * 100)
if exists(select * from sys.server_event_sessions where name='ArithmeticError')
     DROP EVENT SESSION ArithmeticError ON SERVER
go
 
CREATE EVENT SESSION ArithmeticError ON SERVER
ADD EVENT sqlos.exception_ring_buffer_recorded (
ACTION (sqlserver.tsql_stack, sqlserver.sql_text)
                 WHERE (error = 8115 ))
ADD TARGET package0.ring_buffer,
ADD TARGET package0.asynchronous_file_target(
SET filename='c:\temp\ArithmeticErrors.xet', metadatafile='c:\temp\ArithmeticErrors.xem')
WITH (MAX_MEMORY = 4096KB, 
EVENT_RETENTION_MODE = ALLOW_MULTIPLE_EVENT_LOSS, 
MAX_DISPATCH_LATENCY = 1 SECONDS, 
MAX_EVENT_SIZE = 0KB, 
MEMORY_PARTITION_MODE = NONE, 
TRACK_CAUSALITY = OFF, 
STARTUP_STATE = OFF)
 
go
 
ALTER EVENT SESSION ArithmeticError ON SERVER
state=start
go
 
-- Cleanup the session 
-- drop EVENT SESSION ArithmeticError ON SERVER
 
 

You can run this code as is on a test server to try out the demo. You’ll need to ensure that you have a c:\temp directory for the results to output to a file. There are several different ways that you can consume or record the output from an extended event, but here we are writing to a file as said files can then easily be uploaded to us here at Microsoft. You can read about some of the other alternatives in the Books Online link above if you want.

Once you have the event set up, you can trigger the 8115 error, by un-commenting the “sample query” line at the top of the script. When you run it in SSMS, you should encounter the following error:

Msg 8115, Level 16, State 2, Line 1
Arithmetic overflow error converting expression to data type int.

If you then look in the c:\temp directory, you ought to notice the files the session is writing to have been created, however at this stage, the .xet file (which actually holds the event data as opposed to the xem file which holds the event meta data required to decode the event itself) is actually empty. This is because the session is still active and the data has not been flushed to the file. If you run the following to stop (but not drop) the event

 
ALTER EVENT SESSION ArithmeticError ON SERVER
state=stop
go
Then the data will be written to the file. You’ve now capture enough data for the purposes of this test, so you can drop the event session, the syntax for which is
drop EVENT SESSION ArithmeticError ON SERVER

So the question then comes, what to do with the files. I have to admit it took me a while to work this out, as it wasn’t obviously documented. I also needed the help of my good friend and pool partner, Spike, to write some XQuery in a nice way, as I don’t really like XQuery. The function is documented in Books Online though to be fair (although not the XQuery!)

There are 2 different ways you can go about reading this type of file in my opinion. Firstly you can just run the following query in SSMS and you’ll get an XML column in the results pain, if you output to grid. You can then just click on the column and you’ll get to look at the XML document in a new SSMS window. Quick and undeniably dirty yes, but it works quite effectively for a simple scenario like this:

-- straight select, view the XML manually
select CAST(event_data as XML) , * 
from sys.fn_xe_file_target_read_file
(
'C:\temp\ArithmeticErrors_0_128981030796910000.xet', 
'C:\temp\ArithmeticErrors_0_128981030796920000.xem', 
null, 
null
)

Note that the actual file names have been modified automatically by the event engine to guarantee uniqueness. When you run this, you should get an output grid which looks like this:

image

If you then click on the first XML column, you’ll open up a new tab with the XML document for viewing, which will look something like this:

<event name="exception_ring_buffer_recorded" package="sqlos" id="21" version="1" timestamp="2009-09-22T14:33:06.537Z">
  <data name="id">
    <value>0</value>
    <text />
  </data>
  <data name="timestamp">
    <value>0</value>
    <text />
  </data>
  <data name="task_address">
    <value>0x00000000000ffdc8</value>
    <text />
  </data>
  <data name="error">
    <value>8115</value>
    <text />
  </data>
  <data name="severity">
    <value>16</value>
    <text />
  </data>
  <data name="state">
    <value>2</value>
    <text />
  </data>
  <data name="user_defined">
    <value>false</value>
    <text />
  </data>
  <data name="call_stack">
    <value />
    <text />
  </data>
  <action name="tsql_stack" package="sqlserver">
    <value>&lt;frame level='1' handle='0x01000100FC030E3850E97786000000000000000000000000' line='1' offsetStart='0' offsetEnd='0'/&gt;</value>
    <text />
  </action>
  <action name="sql_text" package="sqlserver">
    <value>select convert(bigint,21568194 * 100)</value>
    <text />
  </action>
</event>

Most of this XML ought to be self explanatory. The bit that is most interesting to us here is the “sql_text” section which shows the actual text being executed. In a real world scenario where we didn’t know what code was causing our error, we’ve now tracked it down.

The tsql_stack section is also useful. You can pass the handle you get back from this section into a DMV to get further details about the object or statement being executed (if it’s part of a stored procedure for example). A typical syntax for this would be

SELECT * FROM sys.dm_exec_sql_text(0x01000100FC030E3850E97786000000000000000000000000)

However, you should note that in this very simple example, this will return nothing. This is because a plan has not been cached for our simple select statement that only does a mathematical  calculation. This is of course possible in other scenarios where the plan which was used at the time the event captured it, is no longer cached, due to it being aged out, or the server being re-started.

If you want to follow this example and be able to look at the sql_text plan handle, then you need to write an actual query which fetches data and has an arithmetic overflow. For example, simple procedure like this

create procedure mybadproc
 
as
 
select *
from sys.servers
 
select convert(bigint,999999999 * 100)

will cache a plan when executed, and will be captured by the same event session we set up above. If you execute this, or other more complex statements / procedures, you can get back multiple items in the tsql_stack, each with separate handles which you can then pass into DMV based queries which accept plan handles. If you capture the execution of the above procedure in this event session, you’ll get back a slightly different XML document, where the tsql_stack has multiple values and the sql_text is now the name of the procedure that we executed. You can then pass any of the handles into DMV queries which accept such parameters.

<action name="tsql_stack" package="sqlserver">
   <value>&lt;frame level='1' handle='0x03000D00E7A4787D959020018C9C00000100000000000000' line='8' offsetStart='132' offsetEnd='-1'/&gt;
          &lt;frame level='2' handle='0x01000D00D8E5380F60E77786000000000000000000000000' line='1' offsetStart='0' offsetEnd='-1'/&gt;</value>
   <text />
 </action>
 <action name="sql_text" package="sqlserver">
   <value>mybadproc</value>
   <text />
 </action>
/event>

So to summarise part 1 of this post, we’ve looked at how to setup an extended event session and capture the output to a file. We’ve then seen a very simple way of looking at the output to manually identify a condition captured by our session. In the next part I’ll show some XQuery to review the output in more detail.

Troubleshooting SQL Server performance problems associated with low memory conditions – should I use sys.dm_os_ring_buffers

I recently received a question through this blog about the use of the sys.dm_os_ring_buffers DMV with regard to troubleshooting performance problems and low memory conditions. Here are some excerpts – edits from my reply which might provide some ideas for you.

Here's an interesting article you might like to read about monitoring memory usage with ring buffers:

http://blogs.msdn.com/slavao/archive/2005/02/19/376714.aspx

In terms of my personal preferences regarding performance monitoring in SQL Server 2005, I find it rare that I have to use the ring buffers. They are of course undocumented externally (which is why you are having trouble finding information on them!), and so I tend to discuss with customers the usage of published documented tools as in 99% of cases these are enough to diagnose memory problems (and most others).

I would potentially take a different approach. As I often write in my blog I like to use the SQLDIAG tool to collect a variety of data when I am experiencing / monitoring a problem. In this case it might be useful to follow my post about collecting the DBCC MEMORYSTATUS output, in addition to the standard data when you are experiencing this problem. If you have experiencing a memory issue somewhere, this will give you the breakdown to diagnose it pretty much in all cases, and all the results of this output are publically documented. Alternatively though you could change the script which collects DBCC MEMORYSTATUS to collect some other hybrid selection of DMV based output, or even the ring buffers output.

The other benefit of this approach is that you collect a whole set of data which allows you to keep an open mind (which I like to do) about what the cause of the performance problem really is.

So in summary, should a customer come to me with a rough issue description like this (I have performance problems and I believe I am experiencing low memory conditions, my initial action plan would be as follows:

Run SQLDIAG for 30 minutes when the problem was occurring

Collect the SQL 2005 perfstats scripts

Collect DBCC MEMORYSTATUS add in

Collect a lightweight profiler (do not include statement level events or showplans)

Collect perfmon with all SQL Server counters like Buffer Manager added in .

And then I would review all the data. You could of course consider using SQLNEXUS to do this, or alternatively review the output manually as its all from documented in sources.

Here is a particularly good article which covers a lot of these documented issues, and has an excellent section of memory troubleshooting.

http://technet.microsoft.com/en-us/library/cc966540.aspx

Finally I just remembered that there is this public article as well which does document a good use for the sy.dm_os_ring_buffer output for a specific condition.

How to reduce paging of buffer pool memory in the 64-bit version of SQL Server

Update: another more detailed post about resource manager and the ring buffer records it writes is now available here:

http://blogs.msdn.com/psssql/archive/2009/09/17/how-it-works-what-are-the-ring-buffer-resource-monitor-telling-me.aspx

That said my comments still remain the same when you are going into a more generic performance problem.

Graham

Posted by grahamk | 0 Comments

Fixed - incorrect CPU time reported by sys.dm_exec_query_stats for parallel plans

I was discussing with Elisabeth back in this blog post about the above problem, and made comments that there were no plans currently to backport this to SQL 2005 or 2008. This changed following feedback both from customers and from us within CSS, and the fix was made part of SQL Server 2008 SP1. You'll note if you look at the KB article for SP1, that it doesn't have it's own KB, but it does get a mention in the fix list.

Enjoy!

Posted by grahamk | 0 Comments
Filed under: ,

Troubleshooting SQL Server deadlocks and performance when running SharePoint

I’ve worked on a lot of cases this year involving SharePoint. Here are some example problems I encountered recently. If you want to read the background or the formal advice first, here are some good starting links

Planning and Monitoring SQL Server Storage for Office SharePoint Server: Performance Recommendations and Best Practices (white paper)

Plan for performance and capacity (Office SharePoint Server)

In addition to this, you need to consider what you can and can’t do from a SQL Server perspective, if you’re trying to optimize or troubleshoot your SharePoint deployment. You need to pay detailed attention to these articles:

http://support.microsoft.com/kb/932744/

http://support.microsoft.com/kb/841057/

In simple terms, don’t go amending the schema or the objects! So back to my main point…

Things to do first:

1. Collect some decent SQL Server diagnostic data – my preferred option,as ever, SQLDIAG with perfstats scripts

2. Turn on trace flag 1222 to output deadlock information to the errorlog, or capture deadlock graphs in your profiler traces.

3. Collect this data at period of both good and bad performance, so you have a benchmark to compare against

4. Aggregate the profiler output. Personally I use readtrace.exe directly from RMLUtilities, but other like to use SQLNexus as well.

Here are the 2 real examples of problems I encountered recently, and diagnosed using the above data collection:

1. Are you experiencing significant parallel waits – SQL Server installations will have a default Max Degree of Parallelism set to 0. I have observed certain MOSS deployments this year where this is not the best MAXDOP, dependent upon the specifics of the data distribution and deployment. If you look in the perfstats output you might see evidence like this:

-- requests --
runtime                        session_id request_id  ecid        blocking_session_id blocking_ecid task_state      wait_type
------------------------------ ---------- ----------- ----------- ------------------- ------------- --------------- ---------
2009-05-27T15:06:38.060                68           0           4                   0             0 RUNNING         CXPACKET 
2009-05-27T15:06:38.060                68           0           0                   0             4 SUSPENDED       CXPACKET 
2009-05-27T15:06:38.060                96           0           0                   0             0 SUSPENDED       WAITFOR  
2009-05-27T15:06:38.060               116           0           0                   0             0 RUNNING                  
2009-05-27T15:06:38.060               128           0          12                   0             0 RUNNING         CXPACKET 
2009-05-27T15:06:38.060               128           0           9                   0             0 RUNNABLE        CXPACKET 
2009-05-27T15:06:38.060               128           0           0                   0             0 SUSPENDED       CXPACKET 
2009-05-27T15:06:38.060               128           0           5                   0             0 SUSPENDED       CXPACKET 
2009-05-27T15:06:38.060               128           0          11                   0             0 RUNNABLE        CXPACKET 
2009-05-27T15:06:38.060               128           0           6                   0             0 SUSPENDED       CXPACKET 
2009-05-27T15:06:38.060               128           0           2                   0             0 SUSPENDED       CXPACKET 
2009-05-27T15:06:38.060               128           0           3                   0             0 SUSPENDED       CXPACKET 
2009-05-27T15:06:38.060               128           0           4                   0             0 SUSPENDED       CXPACKET 
2009-05-27T15:06:38.060               128           0           7                   0             0 SUSPENDED       CXPACKET 
2009-05-27T15:06:38.060               128           0           8                   0             0 SUSPENDED       CXPACKET 
2009-05-27T15:06:38.060               128           0           1                   0             0 SUSPENDED       CXPACKET 
2009-05-27T15:06:38.060               128           0          10                   0             0 RUNNABLE        CXPACKET 
2009-05-27T15:06:38.060               138           0          10                   0             0 SUSPENDED       CXPACKET 
2009-05-27T15:06:38.060               138           0          12                   0             0 RUNNABLE        CXPACKET 
2009-05-27T15:06:38.060               138           0          11                   0             0 SUSPENDED       CXPACKET 
2009-05-27T15:06:38.060               138           0           9                   0             0 SUSPENDED       CXPACKET 
2009-05-27T15:06:38.060               138           0           0                   0             3 SUSPENDED       CXPACKET 
2009-05-27T15:06:38.060               138           0           3                   0             8 SUSPENDED       CXPACKET 
2009-05-27T15:06:38.060               138           0           1                   0             8 SUSPENDED       CXPACKET 
2009-05-27T15:06:38.060               138           0           2                   0             8 SUSPENDED       CXPACKET 
2009-05-27T15:06:38.060               138           0           4                   0             8 SUSPENDED       CXPACKET 
2009-05-27T15:06:38.060               138           0           7                   0            12 SUSPENDED       CXPACKET 
2009-05-27T15:06:38.060               138           0           5                   0            12 SUSPENDED       CXPACKET 
2009-05-27T15:06:38.060               138           0           6                   0            12 SUSPENDED       CXPACKET 
2009-05-27T15:06:38.060               138           0           8                   0            12 SUSPENDED       CXPACKET 
2009-05-27T15:06:38.060               163           0           0                   0             0 RUNNABLE                 
2009-05-27T15:06:38.060               170           0           0                 163             0 SUSPENDED       LCK_M_X  

This is a classic example, where a long blocking chain starts to build due to CXPACKET waits. Note that the same sessionid is reported on multiple lines, but the ecid column changes. This indicates that the batch (sessionid) is being processed in parallel by more than one worked (ecid). This extract is very early on in the problem, from here the chains grow and grow, and eventually can result in multiple deadlocks. In this scenario the batches being executed by sessions 128 and 138 (which are both running PROC_LISTUNGHOSTEDFIELDTEMPLATESINLIST ) are picking parallel plans inappropriately. We worked around this by changing MAXDOP on the server to a value of 1. This is definitely something that you should test in your environment, since it’s a server wide setting. Currently according to the above KB articles there is no fixed recommendation about MAXDOP for MOSS deployments, but things can change, so you should check the fixed requirements of your version. In this scenario we saw significant other number of procedures which encountered the same issue, procedures such as proc_CheckDirectoryTree, proc_DeplGetListItemLinks and PROC_DEPLGETLISTITEMANDRESERVEID amongst others. Typical errors you might see in the application logs for SharePoint would be

Unexpected query execution failure, error code 1205. Additional error information from SQL Server is included below. "Transaction (Process ID 163) was deadlocked on lock resources with another process and has been chosen as the deadlock victim. Rerun the transaction."

A global change of MAXDOP = 1 resolved this specific situation.

2. Are you encountering long blocking chains accessing the AllDocs table? This object is commonly accessed by a number of different stored procedures, and I’ve seen some scenarios where it becomes a point of contention. The most problematic one I came across recently resulted in numerous deadlocks, which firstly throw the above 1205 error to the SharePoint application log, and then in addition they throw the following error:

InitializePublishingWebDefaults failed. Microsoft.SharePoint.SPException: Operation aborted (Exception from HRESULT: 0x80004004 (E_ABORT)) ---> System.Runtime.InteropServices.COMException (0x80004004): Operation aborted (Exception from HRESULT: 0x80004004 (E_ABORT))     at Microsoft.SharePoint.Library.SPRequestInternalClass.SetWebProps

This second error occurs as SharePoint does not roll back properly after encountering the deadlock. If you have collected your SQLDIAG data as noted above, you’ll see symptoms like this in the perfstats script output:

a) long blocking chains waiting on LCK_M_U

-- requests --
runtime                        session_id request_id  ecid        blocking_session_id blocking_ecid task_state      wait_type                                          wait_duration_ms     
------------------------------ ---------- ----------- ----------- ------------------- ------------- --------------- -------------------------------------------------- -------------------- 
2009-05-28T10:01:40.973                60           0           0                 197             0 SUSPENDED       LCK_M_X                                                            3201 
2009-05-28T10:01:40.973                72           0           0                  60             0 SUSPENDED       LCK_M_S                                                            3224 
2009-05-28T10:01:40.973                77           0           0                   0             0 SUSPENDED       WAITFOR                                                            9087 
2009-05-28T10:01:40.973                84           0           0                  72             0 SUSPENDED       LCK_M_S                                                            1018 
2009-05-28T10:01:40.973               197           0           0                   0             0 RUNNING                                                                               0 
2009-05-28T10:01:40.973               268           0           0                 270             0 SUSPENDED       LCK_M_U                                                           36299 
2009-05-28T10:01:40.973               270           0           0                 197             0 SUSPENDED       LCK_M_U                                                           96753 
2009-05-28T10:01:40.973               277           0           0                 270             0 SUSPENDED       LCK_M_U                                                           44290 
2009-05-28T10:01:40.973               280           0           0                 270             0 SUSPENDED       LCK_M_U                                                           60461 
2009-05-28T10:01:40.973               292           0           0                 270             0 SUSPENDED       LCK_M_U                                                           52389 

b) the notableactivequeries part of the perfstats script will potentially note that the plan_total_exec_count is very low. This indicates that you appear to be running a freshly compiled plan for the blocking statement. This will be all the more likely if your system was running well, and then suddenly started degrading performance and encountering errors

-- notableactivequeries --
runtime                        session_id request_id  plan_total_exec_count plan_total_cpu_ms    plan_total_duration_ms plan_total_physical_reads plan_total_logical_writes plan_total_logical_reads 
------------------------------ ---------- ----------- --------------------- -------------------- ---------------------- ------------------------- ------------------------- ------------------------ 
2009-05-28T10:01:40.973               197           0                     5                33057                 406079                      2550                         0                  2352015 
2009-05-28T10:01:40.973               268           0                     5                33057                 406079                      2550                         0                  2352015 
2009-05-28T10:01:40.973               270           0                     5                33057                 406079                      2550                         0                  2352015 
2009-05-28T10:01:40.973               277           0                     5                33057                 406079                      2550                         0                  2352015 
2009-05-28T10:01:40.973               280           0                     5                33057                 406079                      2550                         0                  2352015 
2009-05-28T10:01:40.973               292           0                     5                33057                 406079                      2550                         0                  2352015 

 

 

 

 

c) when reviewing the error logs you note that the same statement shown below, being executed within the procedure proc_UpdateTpWebMetaData, is always involved in the deadlocking

UPDATE
    Docs
SET
    Docs.BuildDependencySet = NULL
FROM
    Docs INNER JOIN BuildDependencies
ON
    BuildDependencies.SiteId = Docs.SiteId AND
    BuildDependencies.LeafName = Docs.LeafName AND
    BuildDependencies.DirName = Docs.DirName
WHERE
    BuildDependencies.SiteId = @SiteId AND
    BuildDependencies.TargetDirName = @PreviousCustomMasterDir AND
    BuildDependencies.TargetLeafName = @PreviousCustomMasterLeaf AND
    Docs.BuildDependencySet IS NOT NULL

The reason for this problem, in this particular scenario, is that a new plan has been compiled recently for this statement which is using a HASH join, as opposed to a nested loop join which it was using earlier. This is caused by a lack of up to date statistics on the AllDocs table. We noted scenarios where the regular update statistics jobs (issued by SharePoint as part of the changes in 2007 SP2) were not thorough enough, and we were forced to run update statistics with the fullscan option on the AllDocs table.

The bottom line here being that although SharePoint does now do database maintenance, as shown in KB932744, there are still times potentially where you will need to do more detailed targeted maintenance.

Download location for SQLDIAG Perfstats DMV scripts

I refer to this many times in my other posts, so here’s a direct link to the download location of the perfstats scripts for SQLDIAG which you can use yourself. I use this everyday, you should too!

http://blogs.msdn.com/psssql/archive/2007/02/21/sql-server-2005-performance-statistics-script.aspx

Posted by grahamk | 1 Comments
Filed under: ,

SQL Server 2005 Error: 701, Severity: 17, State: 193

Here’s another interesting example which we worked recently in the Stockholm office, which relates also back to one of my previous entries. We had a customer whose error log was filled with a variety of critical errors such as these:

Error: 701, Severity: 17, State: 193. There is insufficient system memory to run this query.

Error: 1204, Severity: 19, State: 2. The instance of the SQL Server Database Engine cannot obtain a LOCK resource at this time. Rerun your statement when there are fewer active users. Ask the database administrator to check the lock and memory configuration for this instance, or to check for long-running transactions.

Error: 17803, Severity: 20, State: 13. There was a memory allocation failure during connection establishment. Reduce nonessential memory load, or increase system memory. The connection has been closed. [CLIENT: <local machine>]

This was a fairly high end system with 64GB RAM, 16 logical CPUs and was generally well configured in terms of its Max Server Memory and other basic sp_configure settings. As you may well be aware, there are innumerable causes that can actually cause you to receive a memory related error such as this, and some of the more basic troubleshooting tips are shown in the following MSDN article

http://msdn.microsoft.com/en-us/library/aa337311(SQL.90).aspx

However in this case none of the basics were really of any help. Typical starting questions to ask yourself in this scenario are:

1. Is there a pattern to when this occurs

2. Do I know of any particular workloads which are running when I see this problem

…just to understand if you have any particular potential causes in your own code. Our customer in this scenario was not aware of any, and unfortunately we didn’t have any dumps created which might allow us to see retrospectively what batches were running when the problem occurred.

However we did have the partial output of from DBCC MEMORYSTATUS which SQL had kindly written to the error log for us. An examination of this showed the following suspicious looking section:

OBJECTSTORE_LOCK_MANAGER (node 2)
    VM Reserved = 0 KB
    VM Committed = 0 KB
    AWE Allocated = 0 KB
    SM Reserved = 0 KB
    SM Committed = 0 KB
    SinglePage Allocator = 29398856 KB
    MultiPage Allocator = 0 KB

Global Memory Objects:
        Resource=    709
        Locks=    3676149
        SE Schema Mgr=    1646
        SQLCache=    261
        Replication=    2
        ServerGlobal=    109
        XPGlobal=    2
        Xact=    101
        SETLS=    16
        DatasetMemObjs=    32
        SubpDescPmos=    16
        SortTables=    2

From this we can infer that we seemed to experience an equivalent problem to the one I wrote about here

http://blogs.msdn.com/grahamk/archive/2009/02/03/compilation-bottlenecks-error-8628-severity-17-state-0-part-1.aspx

although manifesting distinctly different symptoms.

At the end of the day one of the NUMA nodes has taken up about 29GB of memory just managing lock resources, and this is confirmed by the number of locks which are reported in the global memory objects. Remember that the majority of the output from DBCC MEMORYSTATUS is publically documented here:

How to use the DBCC MEMORYSTATUS command to monitor memory usage on SQL Server 2005

http://support.microsoft.com/kb/907877

This 29GB of memory purely managing locks appears to be placing an undue load of the memory resources of the system, and what we appear to have here is a runaway or huge batch of some type, or maybe a very large open transaction.

The way to resolve this problem is to identify the batch and stop it from running. One good way to identify this problem proactively might be to use the tiny DMV query I wrote in my previous post about this type of situation and attach a monitor to it, warning you if the value goes above 1 million say, or some other high number of your choice.

select request_session_id, count(*)
from sys.dm_tran_locks
group by request_session_id

You could then easily identify the problem session_id before it got completely out of hand. There are of course many different alternative ways of capturing such data or doing such monitoring. Maybe you’d prefer to create a custom SQLDIAG package to monitor your server in more detail.

The moral of this story being though, that even in the most critical system failures such as this, there may well be enough info to get a hint of where the problem lies, if you make sure you go through the error logs in detail. Remember not to be overwhelmed by the sheer volume of errors that might be reported; ensure you go through all the lines removing the duplicates to see whether something of interest is hidden amongst all the noise.

Posted by grahamk | 1 Comments

SQL Server Error: 7320 Cannot execute the query "xxx" against OLE DB provider "DB2OLEDB" for linked server "xxx"

Here's a little something that me and my colleagues were looking at this week. Our customer was executing queries against a DB2 linked server using the latest provider and occasionally got these errors. One of my colleagues suggested to set a dump trigger on the error 7320 so that we could capture a sqlservr.exe dump when the problem next occurred. We achieved this by running

dbcc dumptrigger('SET',7320)

on the server. By default this will collect a mini-dump, unless you have enabled any other trace flags previously. When we had encountered the error and analyzed the dump, it showed up the following stack

sqlservr!ex_raise+0x52
sqlservr!COledbError::HandleExecuteError+0x11b
sqlservr!COledbRangeRowset::FDoPerformSchemaCheckForQuery+0xc7
sqlservr!COledbRangeRowset::FPerformSchemaCheck+0x1fb
sqlservr!COledbRangeRowset::FCheckSchema+0x2a
sqlservr!CEnvOledbRange::XretSchemaChanged+0x16
sqlservr!CXStmtDML::XretDMLExecute+0x41037b
sqlservr!CMsqlExecContext::ExecuteStmts<1,1>+0xace
sqlservr!CMsqlExecContext::FExecute+0x439
sqlservr!CSQLSource::Execute+0x355
sqlservr!CStmtExecStr::XretExecStrExecute+0x48d
sqlservr!CXStmtExecStr::XretExecute+0x20
sqlservr!CMsqlExecContext::ExecuteStmts<1,1>+0x39e
sqlservr!CMsqlExecContext::FExecute+0x439
sqlservr!CSQLSource::Execute+0x355
sqlservr!CStmtExecProc::XretLocalExec+0x125
sqlservr!CStmtExecProc::XretExecExecute+0x132
sqlservr!CXStmtExec::XretExecute+0x20
sqlservr!CMsqlExecContext::ExecuteStmts<1,1>+0x39e
sqlservr!CMsqlExecContext::FExecute+0x439
sqlservr!CSQLSource::Execute+0x355
sqlservr!process_request+0x312
sqlservr!process_commands+0x3fd

 

Before going into any in depth debugging which would involve monitoring the remote server and the components in between, my colleague thought it would be sensible to check the setting surrounding Schema Validation as we were obviously throwing the error in a routine related to this functionality (FDoPerformSchemaCheckForQuery). So we changed the option of the linked server to say enable "Lazy Schema Validation", using this syntax:

exec sp_serveroption xxxxx, 'lazy schema validation', true

If you look this option up in MSDN under "Configuring OLE DB Providers for Distributed Queries" you'll note the following regarding this option:

"If this option is set to false, the default value, SQL Server checks for schema changes that have occurred since compilation in remote tables. This check occurs before query execution. If there is a change in the schema, SQL Server recompiles the query with the new schema.

If this option is set to true, schema checking of remote tables is delayed until execution. This can cause a distributed query to fail with an error if the schema of a remote table has changed between query compilation and execution.

You may want to set this option to true when distributed partitioned views are being used against a linked server that is running SQL Server. A specified table that participates in the partitioned view may not be actually used in a specific execution of a query against the view. Therefore, deferring the schema validation may improve performance."

Now in all truth this was quite a pragmatic approach, in that we hadn't fully debugged the problem, we were just taking an educated guess, that we were experiencing a problem surrounding schema validation, and so far, since implementing this option, the errors have not re-occurred, so we've managed to workaround the problem.

For reference the customer in question was also experiencing other errors in the 7300 to 7320 range, which have also stopped occurring since this option was set. Errors such as

Error: 7350 Cannot get the column information from OLE DB provider "DB2OLEDB" for linked server "xxx"

Error: 7321 An error occurred while preparing the query "xxx" for execution against OLE DB provider "DB2OLEDB" for linked server "xxx"

Error: 7303 Cannot initialize the data source object of OLE DB provider "DB2OLEDB" for linked server "xxx"

The moral of this story is that linked servers can be very difficult and time consuming to troubleshoot, due to the number of distributed components, including ones outside of the SQL Server area completely. This is one time where I am a fan of testing some of the different settings in the provider to see if the problem can be worked around initially, before doing some in depth analysis. In most other cases I'd far prefer to know why, and we may yet revisit this particular situation to fully debug it, but sometimes a workaround and an error free production environment is also nice for a quick win.

For future reference another interesting quick diagnostic that can be performed on linked servers is to run Process Monitor on the server in question to see whether you are having any obscure permission of registry problems. I've seen this give some quick wins as well.

New SQL Server 2008 DMV which I noticed this week - sys.dm_os_sys_memory

In the continuing series of "things I notice in SQL Server which I didn't know were there", I noticed the very simple but very useful sys.dm_os_sys_memory earlier this week.

Why do I like this? Well obviously you can extract the memory information exposed by the OS in a number of different places, but as much as anything, I like the simple record of the state of the CreateMemoryResourceNotification Function , from the system_low_memory_signal_state column in the DMV.

We've added this into the internel PSSDIAG / SQLDIAG perfstats scripts for 2008 (which is where I noticed it). For those of you that use proactive monitoring and record the state of your system over time, this is a good simple set of data to record.

Posted by grahamk | 1 Comments
Filed under: ,
More Posts Next page »
 
Page view tracker