SQL in Stockholm

SQL Server thoughts from the support team in Stockholm

SQL Server troubleshooting thoughts from Graham Kent, (ex) Senior Support Escalation Engineer at Microsoft Sweden.

  • SQL in Stockholm

    Performance Problems in SQL Server - What data should I collect?

    • 5 Comments

    If you ever talk to me or any of my colleagues in the SQL Server PSS support team, and you have a suspected SQL Server performance problem, you’ll undoubtedly be asked to run the PSSDIAG tool which usually collects the data that we need. This tool acts as a one stop shop for us to pick up a variety of data sets, including SQL Profiler, Performance Monitor, SQL Server configuration information, SQL Server logs, blocking scripts and many more custom scripts of our choosing. We published a version of the tool for SQL Server 7 and 2000 under http://support.microsoft.com/kb/830232 . Although we still maintain a version for SQL Server 2005, the tool was such a success that it began to be incorporated into the core product itself thereby enabling people to analyze their own performance problems more easily, or collect the data that PSS would need to help them.

    Nowadays the public version of the tool is named SQLDIAG and installs by default with all SQL Server 2005 installations. You can read about its options in Books Online http://msdn2.microsoft.com/en-us/library/ms162833.aspx.

    It works very simply in default mode, try it now on a test box: simply open a command prompt, switch to this directory

    C:\Program Files\Microsoft SQL Server\90\Tools\binn

    And run the following:

    sqldiag.exe

    You’ll see something like the following

    image

    By default it connects to the local machine name, and every SQL 2005 instance it can find. There are many switches and options and a detailed configuration file that can be used to tailor the process (more on this later) but if you’re experiencing a problem and you want to grab some data super quickly without thinking too much, you can just do this. The collector is now writing out data to a sub-directory named sqldiag, and as it says, when you’re ready stop, simply press Ctrl+C. The collector might take a few minutes to shutdown depending upon the speed of your server. When it’s finished, take a look in the sqldiag sub-directory and you’ll see a collection of files. By default you’ll have collected the following:

    Default Trace files (if server option enabled)

    sqldiag output text file (contains sql error logs, configuration information, internal performance counters , the output from several key DMVs and much more)

    msinfo32.txt

    sqldumper error log

    log and control files for the sqldiag process.

    I’ll talk about interpreting this data at another time, but you’ve already collected some interesting data about your SQL Server instance and what it was up to whilst the collector was running.

    What you should also notice is that when you ran the tool in this mode for the first time, it unpackaged several .XML files which it uses for configuration purposes. These are as follows:

    SQLDiag.XML

    SD_Detailed.XML

    SD_General.XML

    By default if you don’t use the /I switch (to specify an input file) the tool uses the SQLDiag.XML config file and you get the results as above. The true power of SQLDiag in my opinion though is in amending these config files and collecting data sets of your choice. The extra config files automatically created have the following collections turned on:

    SD_Detailed.XML

    Windows Event Logs

    Perfmon

    SQL Profiler (including several verbose events such as performance statistics, SQL:BatchStarted and SQL:BatchCompleted)

    SD_General.XML

    Windows Event Logs

    Perfmon

    SQL Profiler (with less verbose events)

    Within both the perfmon and SQL Profiler events, you have the ability to configure every potential sub event on or off, based upon the situation that you wish to monitor. All you do is edit a copy of the XML file and change individual elements like these from “true” to “false” and vice versa.

    <PerfmonObject name="\MSSQL$%s:Buffer Manager" enabled="true">
    <PerfmonCounter name="\*" enabled="true" />
    <PerfmonCounter name="\Buffer cache hit ratio" enabled="true" />
    
    <EventType name="Locks">
    <Event id="148" name="Deadlock Graph" enabled="true" description="Occurs when an attempt to acquire a lock is ......" />

    In many cases though, one of the default templates with contain enough info for you to make an initial assessment of situation. To run sqldiag with one of the other config files (or one you create yourself) use the following syntax:

    sqldiag.exe /ISD_General.XML

    and you might also want to add a new output directory:

    sqldiag.exe /ISD_General.XML /Oc:\sqldiagoutput

    One thing that the default configs don’t enable though is the blocking collector, which I do think is particularly useful. To enable this collector you need to create a copy of the SQLDiag.XML and edit the following section:

    <BlockingCollector enabled="false" pollinginterval="5" maxfilesize="350"/>

    Changing the BlockingCollector enabled value to “true” will turn on the blocking collector, and provide a blocked process report for any blocking experienced whilst the collector is running. This is a particularly common cause of application performance issues and well worth enabling.

    In addition to these configurations, one of my colleagues in the US has posted the internal performance statistic T-SQL script which we use to capture all sorts of data from the SQL Server DMVs. This is the principal script that I use every day to troubleshoot performance problems. This script doesn’t install by default with a standard SQL Server setup like the rest of SQLDIAG, but you can obtain the script and download it from the PSS SQL Server Engineers blog here. He also helpfully provided an XML config file with all the required entries to make it run. I found the easiest way to deploy it in a default install of SQL Server was to unzip the 2 SQL scripts and the 2 XML files into the directory where sqldiag.exe lives. You can then simply run the same diagnostics as above, but include these extra T-SQL scripts, by running something like this:

    sqldiag /ISQLDiagPerfStats_Trace.XML /Oc:\sqldiag

    Although you can also use the cmd files he provided for registering and running SQLDiag as a service. Either way you get 3 extra text files in the output

    Over time you’ll find it useful to build up a selection of your own config files, and then you can just copy them to any server you wish to monitor, based upon the situation you want to look at. For example I keep custom ones to hand which focus on the following subjects

    Memory Usage

    Blocking and general Profiler Trace

    Disk Performance

    Profiler Trace

    Recompilation

    Remember though, if you enable a custom configuration file like the ones above, your collections can be extremely verbose. You should therefore consider the following:

    1. Disk Space for output: potentially many GBs on heavy loaded machines

    2. Disk Drive chosen for output: If you want to monitor disk performance don’t write the output to the same drive, and don’t ever write the output to a network drive

    3. Verbose events such as performance statistics can place a heavy CPU load on a server. I only enable these if I’m certain that I need them

    Finally you might wish to configure the collector to only run against specific machines or instances, and the following XML elements allow you to do this:

    a) Change the machine name element to the name of the server you wish to connect to (the default of “.” means connect to the local machine)

    <Machines>
    <Machine name=".">

    (Or use the /M switch when running from the command line.)

    b) Change the instance name element to the SQL Server instance that you wish to connection to (the default of “*” connects to every instance it can on the machine specified above.

    <Instance name="*" windowsauth="true" ssver="9" user="">

    I hope this all makes sense. If you have any questions, just post a comment. Over coming posts I'll talk about how to interpret the data you collect.

  • SQL in Stockholm

    Database Mirroring Monitor can cause large tempdb growth in SQL Server 2005 - changed behavior in SQL Server 2008

    • 3 Comments

    On extremely rare occasions in SQL Server 2005, when the following conditions are true:

     

    1.       You have a database configured for database mirroring

    2.       You have replication configured on the same database

    3.       You are running the database mirroring monitor

    4.       The SQL Agent job created by the database mirroring monitor is enabled

     

    It was possible to encounter a situation where tempdb grew suddenly and significantly, leading to all the complications that such a situation causes, including but not limited to:

     

    Msg 1105, Level 17, State 2, Line 3

    Could not allocate space for object xxxxxxx in database 'tempdb' because the 'PRIMARY' filegroup is full. Create disk space by deleting unneeded files, dropping objects in the filegroup, adding additional files to the filegroup, or setting autogrowth on for existing files in the filegroup.

     

    Error: 17053, Severity: 16, State: 1.

    C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\DATA\tempdb.mdf: Operating system error 112(There is not enough space on the disk.) encountered.

     

    Error: 1101, Severity: 17, State: 2.

    Could not allocate a new page for database 'tempdb' because of insufficient disk space in filegroup 'PRIMARY'. Create the necessary space by dropping objects in the filegroup, adding additional files to the filegroup, or setting autogrowth on for existing files in the filegroup.

     

    If you were able to capture this situation and have some monitoring running, you would notice the following event captured in profiler:

     

    An execution of the following which is the main entry point stored procedure for the mirroring monitor:

     

    exec sys.sp_dbmmonitorupdate @database_name

     

    followed by an execution of :

     

    exec @retcode = sys.sp_dbmmonitorMSgetthelatestlsn @database_name, @end_of_log_lsn output

     

    which was the last procedure executed by the mirroring monitor before tempdb started to grow. If you were to actually trace the details of this procedure, you could see the statements which it executes. What you will see is that last statement actually executed before tempdb starts filling is this:

     

    dbcc dbtable(blog_mirror) with tableresults, no_infomsgs

     

    Being that this procedure is actually held in the resource database (the others are in msdb so you can take a look easily if you want to), it is protected from normal viewing, so I won't reproduce it here, but it does exactly what it's name would imply, it gets the latest LSN and then uses this to do some calculations of amount of time required to synchronize and other such functions.

     

    The problem lies here in that in extremely rare occurrences this procedure can cause huge tempdb growth due to the way it calculates the latest LSN using dbcc dbtable (an undocumented internal DBCC command) and stores it's temporary calculations in a temp table. Unfortunately the only way to resolve this problem in SQL Server 2005 is to disable the mirroring monitor job in SQL Agent. You should only consider or need to do this though if you are definitely experiencing this specific problem, i.e. you have all the pre-requisite conditions and you have a profiler trace proving that the dbcc command was the command executed before tempdb filled up.

     

    However the good news is that in SQL Server 2008 we made a design change to collect this data in a far more efficient way. If you look at the results returned from a query of the sys.database_mirroring DMV in SQL 2008:

     

    select *

    from sys.database_mirroring

     

    you will note 2 columns have been added to the output:

     

    mirroring_end_of_log_lsn

    and

    mirroring_replication_lsn

     

    Meaning that the information that the mirroring monitor requires is now available in a much simpler format. Therefore if you look at the SQL 2008 version of the msdb stored procedure sys.sp_dbmmonitorupdate, you will note that the execution of sys.sp_dbmmonitorMSgetthelatestlsn has been replaced with a simple select from the new DMV columns:

     

    select @role = (mirroring_role - 1),

                      @status = mirroring_state,

                      @witness_status = mirroring_witness_state,

                      @failover_lsn = mirroring_failover_lsn,

                      @end_of_log_lsn = mirroring_end_of_log_lsn

    from sys.database_mirroring where database_id = @database_id

     

    which is both simpler, faster and avoids any possibility of the tempdb problem occurring.

     

  • SQL in Stockholm

    Interesting things to watch out for when using the autoclose database option

    • 3 Comments

    Here's something which I've been looking at this week, surrounding the autoclose database option.

    When you restore any database to a SQL Server 2005 SP2 or later instance, you'll more than likely get errors of this nature written to the log.

    2008-11-12 17:24:16.58 spid13s     SQL Server has encountered 5 occurrence(s) of cachestore flush for the 'Object Plans' cachestore (part of plan cache) due to some database maintenance or reconfigure operations.
    2008-11-12 17:24:16.58 spid13s     SQL Server has encountered 5 occurrence(s) of cachestore flush for the 'SQL Plans' cachestore (part of plan cache) due to some database maintenance or reconfigure operations.
    2008-11-12 17:24:16.58 spid13s     SQL Server has encountered 5 occurrence(s) of cachestore flush for the 'Bound Trees' cachestore (part of plan cache) due to some database maintenance or reconfigure operations.

    This is expected behavior and is documented in Books Online

    "Restoring a database clears the plan cache for the instance of SQL Server. Clearing the plan cache causes a recompilation of all subsequent execution plans and can cause a sudden, temporary decrease in query performance. In SQL Server 2005 Service Pack 2, for each cleared cachestore in the plan cache, the SQL Server error log contains the following informational message: "SQL Server has encountered %d occurrence(s) of cachestore flush for the '%s' cachestore (part of plan cache) due to some database maintenance or reconfigure operations". This message is logged every five minutes as long as the cache is flushed within that time interval."

    I'd been looking at some problems which at first glance appeared to be related to this, in that after restoring a database, there were lots of occurances of cache flush which was causing performance problems.

    The answer in this case was that this database was in autoclose mode. A quick check of sp_dboption and this was confirmed. Another quick check to Books Online confirms that autoclose databases will cause a cache flush event when they shutdown.

    "When the database is set to AUTOCLOSE = ON, an operation that initiates an automatic database shutdown clears the plan cache for the instance of SQL Server. Clearing the plan cache causes a recompilation of all subsequent execution plans and can cause a sudden, temporary decrease in query performance. In SQL Server 2005 Service Pack 2, for each cleared cachestore in the plan cache, the SQL Server error log contains the following informational message: "SQL Server has encountered %d occurrence(s) of cachestore flush for the '%s' cachestore (part of plan cache) due to some database maintenance or reconfigure operations". This message is logged every five minutes as long as the cache is flushed within that time interval."

    I'd forgotten about this behavior, as its very rare that i deal with autoclose databases as most of the systems I tend to work with don't use it, autoclose is more designed for desktop systems as books online states.

    The act of the cache flush in itself is not particularly problematic, unless of course you constantly access the database and then disconnect from it, causing numerous SQL instance wide cache flushes, which severly impact performance on any server.

    Another indication of this type of behavior is that your error log will be full of lots of these messages

    2008-11-13 16:10:16.07 spid51      Starting up database 'test'.
    2008-11-13 16:10:16.22 spid51      CHECKDB for database 'test' finished without errors on 2008-11-12 16:11:55.453 (local time). This is an informational message only; no user action is required.
    2008-11-13 16:10:18.46 spid51      Starting up database 'test'.
    2008-11-13 16:10:18.50 spid51      CHECKDB for database 'test' finished without errors on 2008-11-12 16:11:55.453 (local time). This is an informational message only; no user action is required.

    The thing that I didn't realize and that really surprised me was that it was possible to create such flushes by just accessing the database through Management Studio. Basically if you were the only user on the database, and you decided to browse its contents in SSMS, you'd open a connection, and then every time that you expanded an object explorer tree control for example, you issue an query to the server to generate the list of objects in that part of the tree (this is easy enough to trace in profiler). However when such a query finishes if you are the only user then the database closes as it deems all active users to be disconnected, and therefore a cache flush event takes place. This of course is a bad thing, especially if you happen to restore your autoclose database to an instance which is running lots of other production databases maintaining large numbers of cache entries on your server.

    Just something to think about....

  • SQL in Stockholm

    Database Mirroring Error 1418 Troubleshooter

    • 3 Comments

    I've been posting in the MSDN and TECHNET mirroring forums in the last few days and I've noticed a lot of occurrences of the 1418 error which I have referred to briefly in this blog before. Therefore I thought I'd post a 1418 specific troubleshooter for reference. The problem we're trying to solve is that you try to start database mirroring and you receive this error:

    Msg 1418, Level 16, State 1, Line 1
    The server network address "
    TCP://myserver.domain.domain.domain.com:5022" can not be reached or does not exist. Check the network address name and reissue the command.

    The following assumes that:

    1. You are using separate physical servers for the mirror partners. If you're not, you might this to read this about configuring mirroring on local instances and a typical problem you can come across.

    2. You're using domain accounts for SQL Server services

    3. You're using windows authentication and you're not using certificates

    4. You've already read this, or are familiar with the overall principal.

    So here are the things that I would check:

    1. Have you restored a fairly recent full backup to the planned mirror server, with NORECOVERY. Have you then restored a transaction log, also with NORECOVERY. If you haven't you won't be going anywhere with mirroring. Maybe you should read this first :-)

    2. Have you configured your endpoints correctly. To be specific, this means:

    a) Are they started

    b) Are they using the same form of authentication

    c) Are they using the same form of encryption

    d) Does the service account have permission to use them

    Even if you think or assume you have done this correctly (including if you've used the wizard to configure them) don't assume they are correct. You've encountered an error and you need to be sure. You need to script each endpoint and check that all the settings are correct. I've seen this happen too many times to mention, where people thought they were OK (because they "ought" to have been) but it was something simple like one of them wasn't started. Using the wizard does not guarantee that everything will always work. Remember all the wizard does is execute T-SQL, so mistakes or problems are still possible.

    3. Are the ports you have selected valid, and are they open? Are they being used by any other processes? Can you telnet to the port?

    4. Is there a firewall (hardware or software)? Have you created an exception / rule to the firewall? Windows firewall is no exception to this rule. It will block mirroring traffic just as much as any physical firewall will.

    5. Is the FQDN valid? Can it be resolved? Again, can you telnet to it? If you're not using FQDN and you're using direct IP addresses, the same rules still apply. Can you telnet to it? If you're not using FQDN or IP, have you checked the above post for resolving names locally.

    I hope this helps. It's my experience that these points will solve the majority of 1418 errors. If you have any questions about this topic feel free to post here or in the forums.

    Graham

  • SQL in Stockholm

    Problems executing maintenance plans in SQL Server 2008

    • 3 Comments

    In the RTM build of SQL Server 2008 , if you install the tools in any edition, you are able to create maintenance plans. However depending upon your version and your extact installation components, when you execute the maintenance plan you created, you may receive the following message:

    Executed as user: <domain>\<user>. Microsoft (R) SQL Server Execute Package Utility  Version 10.0.1600.22 for 32-bit  Copyright (C) Microsoft Corp 1984-2005. All rights reserved.    The SQL Server Execute Package Utility requires Integration Services to be installed by one of these editions of SQL Server 2008: Standard, Enterprise, Developer, or Evaluation. To install Integration Services, run SQL Server Setup and select Integration Services.  The package execution failed.  The step failed.

    This is caused by the fact that you need to have the full SSIS component installed to execute the package that you have created. Many people will not come across this message as they have just installed the SSIS component by default, but you should consider the following scenario:

    If you are running a lower level edition, such as web or workgroup, then you cannot install SSIS as part of that setup and configuration, as it is not available in that edition. Therefore to be able to run the packages that you have created, you would, as the message says, need to install the component, but from a licensed edition that supports SSIS.

    If you are running Standard or Enterprise edition, you will need to re-run setup and add the SSIS component. This will allow you to run the maintenance plan.

    This behavior is by design as SSIS is only supported in the higher editions of SQL Server. However we understand that for some customers this is not the optimal type of behavior with regard to maintenance plans (this is a good example of Microsoft Connect feedback), and as such it has been logged and is under review by the development team.

  • SQL in Stockholm

    Which sessionid has the most locks in a SQL Server instance

    • 2 Comments

    Quick post today. Was playing around with something on Friday night, thinking how to answer this question quickly whilst looking a live performance problem. I have a huge selection of ready to go DMV based queries, aimed at targetting all types of performance problems, but this was one I didn't have. The answer of course is beautifully simple through the magic of DMVs. Do you remember when these didn't exist? Sometimes it's easy to forget :-) The answer and a good one to have in your troubleshooting code toolbox:

    select request_session_id, count(*)

    from sys.dm_tran_locks

    group by request_session_id

  • SQL in Stockholm

    SQL Server Compilation Bottlenecks – Error: 8628, Severity: 17, State: 0. – Part 1

    • 2 Comments

    [alternative title: Memory Troubleshooting using DBCC MEMORYSTATUS]

    The following samples are taken from a live problem I was working on for most of last week. The steps might well help you resolve an occurrence of the above error, but the methodology might help you with some general performance and memory troubleshooting as well, hence the alternative title.

    Symptoms:

    • A previously well running server starts exhibiting extremely slow performance for all users during the online day.
    • CPU time is maxed out at 100%
    • The issue is intermittent and has no correlation to batch business processing or periods of peak usage

    Initial Actions:

    • Configure a PSSDIAG (but could just have easily have been SQLDIAG) to collect perfstats script, perfmon counters
    • Don’t include profiler trace to begin with, as the server is already maxed out for CPU and you might skew the results or make the situation worse, especially as going into the problem you don’t actually know what the problem is related to
    • Configure the collection for long term running, as you don’t know when the error will next occur

    Initial Results

    • When reviewing the error logs you note that Error: 8628, Severity: 17, State: 0. was logged repeatedly when the problem occurred in the past.
    • RESOURCE_SEMAPHORE_QUERY_COMPILE  is the top aggregate wait stat in sys.dm_os_wait_stats
    • The server was clustered and a failover occurred when the problem last happened
    • No other items of note were recorded in the past
    • When the issue is not occurring everything looks fine on the server, there are no indications of dangerously long running queries or memory pressure etc.

    This is a common scenario that I come across. You know something bad has happened, you can reproduce it on demand, but you suspect it will happen again. At this point I always like to do the above, review the general health of the server with PSSDIAG / SQLDIAG and then do another round of monitoring, once I know what I’m looking for. In this case I’m targeting the 8628 error as it seems pretty conclusively tied to the outage.

    One of the principal hits you get if you search microsoft.com sites for error 8628 is the following page

    How to use the DBCC MEMORYSTATUS command to monitor memory usage on SQL Server 2005

    http://support.microsoft.com/kb/907877

    and it points out that this error is thrown when there are too many queries trying to compile simultaneously, and that the gateway structures are throttling the compilations. You can read more background about this theory in the above article and there’s also some reference to it in the following white paper

    Troubleshooting Performance Problems in SQL Server 2005

    http://technet.microsoft.com/en-gb/library/cc966540.aspx

    which includes an excellent section on internal physical memory pressure.

    So now we’re on the track of the problem. We reconfigure the PSSDIAG to include a script which runs DBCC MEMORYSTATUS regularly so that we can be sure to capture this output when the problem next occurs. We also make sure that we keep a copy of the data from this query when the system is running well, so that we can have a baseline to compare against. At this point I still didn’t have a profiler running as I didn’t want to have to manage the disk space and have the performance overhead, and the perfstats script add-in gives you so much information anyway.

    So the problem eventually occurred again, and here are some key outputs from the DBCC MEMORYSTATUS related to the compilation gateways:

    Baseline:

    Procedure Cache                Value     
    ------------------------------ -----------
    TotalProcs                            7614
    TotalPages                          570060
    InUsePages                           32736

    Small Gateway                  Value              
    ------------------------------ --------------------
    Configured Units                                 32
    Available Units                                  32
    Acquires                                          0
    Waiters                                           0
    Threshold Factor                             500000
    Threshold                                    500000

    Problem Occurring:

    Procedure Cache               Value
    ------------------------------ -----------
    TotalProcs                     397
    TotalPages                     66888
    InUsePages                     44924

    Small Gateway                 Value
    ------------------------------ --------------------
    Configured Units               32
    Available Units                0
    Acquires                       32
    Waiters                        59
    Threshold Factor               500000
    Threshold                      500000

    So what we can see here is that we are (as expected) encountering a big bottleneck on the gateways (the small one in this case) and at this moment in time 59 compilations are queued. What we can also note is that the reason that this appears to be the case is that the procedure cache is significantly smaller than its normal size, about 85% smaller in fact.

    This therefore, is why so many queries are suddenly trying to compile. For reasons currently unknown at this point 85 % of the cached plans have been thrown away from the cache and so the many thousands of connections on this system are now all having to compile new plans.

    Now you might be thinking, sometimes you see this type of behaviour when the cache is flushed. This is true, but in this case this wasn’t the problem. We sat and monitored the output above for quite a while, there was constant compilation queues but the cache was not growing. More to the point there were no cache flush events in the error logs, and we had not performed and maintenance operations which cause cache flush. I talked about a classic example of this behaviour in this post.

    So we need to look elsewhere, and ask ourselves the question, “Why is my procedure cache 85% of its normal size?”

    That’s for my next post.

  • SQL in Stockholm

    Non-yielding IOCP Listener - Stack Dump analysis

    • 2 Comments

     

    Here's something I was looking at yesterday which is worth a quick look. It's quite a common scenario and one that's relatively easy to address. Your SQL Server 2005 instance suddenly throws the following to the error log (apparently without warning):

    2008-09-09 08:30:44.810 Server       Using 'dbghelp.dll' version '4.0.5'
    2008-09-09 08:30:44.960 Server       **Dump thread - spid = 0, PSS = 0x0000000000000000, EC = 0x0000000000000000
    2008-09-09 08:30:45.000 Server       ***Stack Dump being sent to F:\MSSQL.1\MSSQL\LOG\SQLDump0001.txt                 
    2008-09-09 08:30:45.000 Server       * *******************************************************************************
    2008-09-09 08:30:45.010 Server       *                                                                                
    2008-09-09 08:30:45.010 Server       * BEGIN STACK DUMP:                                                              
    2008-09-09 08:30:45.010 Server       *   09/09/08 08:30:45 spid 0                                                     
    2008-09-09 08:30:45.010 Server       *                                                                                
    2008-09-09 08:30:45.010 Server       * Non-yielding IOCP Listener                                                     
    2008-09-09 08:30:45.010 Server       *                                                                                
    2008-09-09 08:30:45.010 Server       * *******************************************************************************
    2008-09-09 08:30:45.010 Server       * -------------------------------------------------------------------------------
    2008-09-09 08:30:45.010 Server       * Short Stack Dump                                                               
    2008-09-09 08:30:45.320 Server       Stack Signature for the dump is 0x000000000000020B                               
    2008-09-09 04:47:06.180 spid3s       LazyWriter: warning, no free buffers found.
                 
                            

    Following this several hundred lines of output similar to the following format are shown

    2008-09-09 08:31:08.750 spid3s      
    Memory Manager
        VM Reserved = 16963064 KB
        VM Committed = 16942872 KB
        AWE Allocated = 0 KB
        Reserved Memory = 1024 KB
        Reserved Memory In Use = 0 KB 
    2008-09-09 08:31:08.750 spid3s      
    Memory node Id = 0
        VM Reserved = 16957304 KB
        VM Committed = 16937200 KB
        AWE Allocated = 0 KB
        SinglePage Allocator = 2519016 KB
        MultiPage Allocator = 39632 KB 
    2008-09-09 08:31:08.750 spid3s      
    MEMORYCLERK_SQLGENERAL (Total)
        VM Reserved = 0 KB
        VM Committed = 0 KB
        AWE Allocated = 0 KB
        SM Reserved = 0 KB
        SM Committed = 0 KB
        SinglePage Allocator = 48576 KB
        MultiPage Allocator = 5136 KB


    You may well recognize that this latter output is basically the same as you would get if you run:
    DBCC MEMORYSTATUS

    You'll also have noted that a small dump file has been produced in the default error log directory of this SQL instance (the path of which is noted in the error log output above).

    In normal circumstances you can submit this dump file to us in support, and we can take a look at it and, all being well tell you what the problem is. The thing you have to remember here is that the production of the dump file does not necessarily mean that you have encountered a bug in SQL Server (as some people often believe). In the more recent versions of SQL there are innumerable scenarios where we control and catch certain scenarios well known to us and produce a dump file automatically.

    This particular scenario was caused by inappropriate configuration of SQL Server memory settings, and should be easily rectified, so I thought it a good example to show why.

    Although there is little that can normally be gleaned from debugging a SQL Server dump in the public domain, this one does contain a few hints. We don't publish private symbols for SQL Server, meaning that the amount you can do yourself with a dump file is quite limited (since you only have access to retail symbols), however you can look at the thread stacks and see if anything of note can be seen. Anyone who has read Ken Henderson's fantastic book about SQL Server internals will have used these techniques before. Sometimes it's just interesting to look a little deeper after all.

    I'll assume that you're already familiar with installing and configuring the debugger, but if you're not here's a good place to start. Debugging Tools and Symbols: Getting Started. This example just uses the retail symbol server address from this page:

    SRV*c:\websymbols*http://msdl.microsoft.com/download/symbols

    If you open this dump, you'll find yourself starting with the context set to the thread that generated the dump, in this case:

    0:004> kn
    # Child-SP          RetAddr           Call Site
    00 00000000`0474de58 00000000`77d704bf ntdll!NtWaitForSingleObject+0xa
    01 00000000`0474de60 00000000`016be0f9 kernel32!WaitForSingleObjectEx+0x130
    02 00000000`0474df00 00000000`016bdaeb sqlservr!CDmpDump::DumpInternal+0x4a3
    03 00000000`0474dff0 00000000`01c7c7b3 sqlservr!CDmpDump::Dump+0x3b
    04 00000000`0474e040 00000000`01d3af67 sqlservr!CImageHelper::DoMiniDump+0x413
    05 00000000`0474e1a0 00000000`025f8cf5 sqlservr!stackTrace+0x6e7
    06 00000000`0474f6d0 00000000`01a3b025 sqlservr!SQL_SOSNonYieldIOCPCallback+0x1ef
    07 00000000`0476f7e0 00000000`018662b6 sqlservr!SOS_OS::ExecuteNonYieldIOCPCallbacks+0x44
    08 00000000`0476f830 00000000`010b9628 sqlservr!alloca_probe+0x220815
    09 00000000`0476f990 00000000`029002c9 sqlservr!SchedulerMonitor::Run+0x164
    0a 00000000`0476fac0 00000000`010079fe sqlservr!SchedulerMonitor::EntryPoint+0x9
    0b 00000000`0476faf0 00000000`01008362 sqlservr!SOS_Task::Param::Execute+0xee
    0c 00000000`0476fc00 00000000`01007da1 sqlservr!SOS_Scheduler::RunTask+0xc9
    0d 00000000`0476fc90 00000000`014787a7 sqlservr!SOS_Scheduler::ProcessTasks+0xb4
    0e 00000000`0476fd00 00000000`014789ca sqlservr!SchedulerManager::WorkerEntryPoint+0xe7
    0f 00000000`0476fda0 00000000`01478911 sqlservr!SystemThread::RunWorker+0x59
    10 00000000`0476fde0 00000000`014795d9 sqlservr!SystemThreadDispatcher::ProcessWorker+0x130
    11 00000000`0476fe80 00000000`781337a7 sqlservr!SchedulerManager::ThreadEntryPoint+0x128
    12 00000000`0476ff20 00000000`78133864 msvcr80!_callthreadstartex+0x17
    13 00000000`0476ff50 00000000`77d6b6da msvcr80!_threadstartex+0x84

    This is the thread that is actually doing the monitoring and generating the dump. You can see in frame 02 the name of the function doing this. If you had the private symbols at this point, you can look through some of the earlier frames and determine which thread caused this condition to occur, but I can't show that here :-)

    The thread in question here was actually thread 8, we can switch to this and take a look

    0:008> kn 50
    # Child-SP          RetAddr           Call Site
    00 00000000`04f6f008 00000000`01750bd0 sqlservr!NumaNode::RampUp
    01 00000000`04f6f010 00000000`012f1523 sqlservr!alloca_probe+0xbf420
    02 00000000`04f6f110 00000000`0103f136 sqlservr!BPool::Steal+0x3ec
    03 00000000`04f6f1c0 00000000`0103f32e sqlservr!SQLSinglePageAllocator::AllocatePages+0x25
    04 00000000`04f6f200 00000000`0103f1ce sqlservr!MemoryNode::AllocatePagesInternal+0xce
    05 00000000`04f6f4a0 00000000`0103f7ba sqlservr!MemoryClerkInternal::AllocatePages+0x6e
    06 00000000`04f6f500 00000000`010d4bb4 sqlservr!CMemThread::PbGetNewPages+0x6a
    07 00000000`04f6f540 00000000`01068bba sqlservr!CFixedPageMgr::PbAllocate+0xcd
    08 00000000`04f6f5b0 00000000`01004d37 sqlservr!CMemThread::Alloc+0x1d7
    09 00000000`04f6f640 00000000`0147dc3a sqlservr!operator new+0x26
    0a 00000000`04f6f680 00000000`01045589 sqlservr!SOS_Node::EnqueueTaskDirectInternal+0x587
    0b 00000000`04f6f730 00000000`01045b03 sqlservr!SOS_Node::EnqueueTask+0x49
    0c 00000000`04f6f780 00000000`01046581 sqlservr!CNetConnection::EnqueuePacket+0x8f
    0d 00000000`04f6f890 00000000`01045cab sqlservr!TDSSNIClient::ReadHandler+0x171
    0e 00000000`04f6f900 00000000`010459bc sqlservr!SNIReadDone+0xb7
    0f 00000000`04f6f960 00000000`010079fe sqlservr!SOS_Node::ListenOnIOCompletionPort+0x251
    10 00000000`04f6faf0 00000000`01008362 sqlservr!SOS_Task::Param::Execute+0xee
    11 00000000`04f6fc00 00000000`01007da1 sqlservr!SOS_Scheduler::RunTask+0xc9
    12 00000000`04f6fc90 00000000`014787a7 sqlservr!SOS_Scheduler::ProcessTasks+0xb4
    13 00000000`04f6fd00 00000000`014789ca sqlservr!SchedulerManager::WorkerEntryPoint+0xe7
    14 00000000`04f6fda0 00000000`01478911 sqlservr!SystemThread::RunWorker+0x59
    15 00000000`04f6fde0 00000000`014795d9 sqlservr!SystemThreadDispatcher::ProcessWorker+0x130
    16 00000000`04f6fe80 00000000`781337a7 sqlservr!SchedulerManager::ThreadEntryPoint+0x128
    17 00000000`04f6ff20 00000000`78133864 msvcr80!_callthreadstartex+0x17
    18 00000000`04f6ff50 00000000`77d6b6da msvcr80!_threadstartex+0x84
    19 00000000`04f6ff80 00000000`00000000 kernel32!BaseThreadStart+0x3a

    So again, just by looking at the function names, we can make a guess that some internal memory routines related to the buffer pool are going on here, particularly from frames 02 and 03. If you ever attach a debugger into a running SQL Server instance (obviously never in production!) you'd see functions like this occurring regularly as SQL Server manages the buffer pool pages.

    There's not much more that can be seen from this stack, so we'll take a look through the rest of the threads, in this case by running

    ~* kn

    When browsing through these threads, we notice lots that also appear to be involved in memory related events, just one example of which is

      53  Id: 554.a6c Suspend: 1 Teb: 000007ff`fffd9000 Unfrozen
    # Child-SP          RetAddr           Call Site
    00 00000000`1521e140 00000000`012f1523 sqlservr!BPool::ReplenishFreeList+0x10d
    01 00000000`1521e240 00000000`0103f136 sqlservr!BPool::Steal+0x3ec
    02 00000000`1521e2f0 00000000`0103f32e sqlservr!SQLSinglePageAllocator::AllocatePages+0x25
    03 00000000`1521e330 00000000`0103f1ce sqlservr!MemoryNode::AllocatePagesInternal+0xce
    04 00000000`1521e5d0 00000000`0103f676 sqlservr!MemoryClerkInternal::AllocatePages+0x6e
    05 00000000`1521e630 00000000`0103f737 sqlservr!CVarPageMgr::PviNewVarPage+0x56
    06 00000000`1521e6a0 00000000`0100ca58 sqlservr!CVarPageMgr::PbAllocate+0x3ce
    07 00000000`1521e710 00000000`01004d37 sqlservr!CMemObj::Alloc+0xcc
    08 00000000`1521e780 00000000`0109b203 sqlservr!operator new+0x26
    09 00000000`1521e7c0 00000000`014607a1 sqlservr!operator new[]+0x23
    0a 00000000`1521e800 00000000`0145d4c8 sqlservr!CErrorReportingManager::CwchFormatToBuffer+0x180
    0b 00000000`1521e970 00000000`0128d202 sqlservr!CErrorReportingManager::CwchCallFormatToBuffer+0x18
    0c 00000000`1521e9b0 00000000`0128d0f8 sqlservr!CLoginClientInfo::CLoginClientInfo+0x71
    0d 00000000`1521ea30 00000000`01285265 sqlservr!IssueLoginSuccessReport+0x67
    0e 00000000`1521eb40 00000000`0128494f sqlservr!login+0x584
    0f 00000000`1521f480 00000000`01285610 sqlservr!process_login+0xee
    10 00000000`1521f840 00000000`010079fe sqlservr!process_commands+0x40d
    11 00000000`1521faf0 00000000`01008362 sqlservr!SOS_Task::Param::Execute+0xee
    12 00000000`1521fc00 00000000`01007da1 sqlservr!SOS_Scheduler::RunTask+0xc9
    13 00000000`1521fc90 00000000`014787a7 sqlservr!SOS_Scheduler::ProcessTasks+0xb4

    We note that sqlservr!BPool::ReplenishFreeList is common to a lot of the threads, and other BPool functions crop up frequently as well.

    At this point there is not much else to be gleaned from the dump without private symbols, however it's clear that there appears to be a significant number of threads waiting on what appear to be internal memory routines. In this case, we also have the fact that SQL Server itself has detected that it believes that memory allocations were the cause of the problem, which is why it has dumped out the DBCC MEMORYSTATUS to help us. The error log specifically states that it has "no free buffers".

    If you were to look through the whole of the DBCC MEMORYSTATUS and decode some of the latter items using this article as a guide

    http://support.microsoft.com/kb/907877

    then you would note that the buffer pool has been recently grown, and it is issuing a notification that it wished to grow further.

    So what next, is the question; you've noted that SQL Server appears to be having memory management problems, so maybe you think that you should add some more memory to alleviate this? Well in a word - "NO".

    The answer to this problem is that we have to look at the configuration of the hardware, and the configuration of SQL Server first (which in any troubleshooting scenario is always a good idea).

    What we saw here at the time of the dump was:

    MemoryLoad = 95%                    
    Total Physical = 16382 MB           
    Available Physical = 792 MB    

    And in fact over the preceding minutes before the dump we saw even lower memory availability than this, with physical utilization up to 98%. (this is something that we can extract from the dump with private symbols).

    So we do appear short of physical memory, but when we run sp_configure on the server, we see this:

    name                                minimum     maximum     config_value run_value
    ----------------------------------- ----------- ----------- ------------ -----------
    max server memory (MB)              16          2147483647  2147483647   2147483647

    Therefore the max server memory setting has not been set. Although SQL will dynamically attempt to manage its memory, there can be circumstances in large memory installations where if it runs unlimited you can get into situations where it ends up fighting with other processes or the OS for memory, and situations like this can occur. This is why we recommend running a "max server memory" setting in this type of scenario, to limit the size of the buffer pool to an amount lower than the total memory on the server itself. The most referenced article on this is probably here

    http://blogs.msdn.com/slavao/archive/2006/11/13/q-a-does-sql-server-always-respond-to-memory-pressure.aspx

    We see numerous cases revolving around this scenario and derivatives of it, and many can simply be avoided by limiting the memory of the SQL Server buffer pool with this setting, so that you don't ever get into this memory bound scenario. So the bottom line here is change the max server memory setting to something around the 13000 mark, and then monitor your memory usage to see if you're still encountering this type of problem, making further adjustments if necessary.

    If you have any questions surrounding this topic, feel free to post them.

  • SQL in Stockholm

    Troubleshooting SQL Server deadlocks and performance when running SharePoint

    • 2 Comments

    I’ve worked on a lot of cases this year involving SharePoint. Here are some example problems I encountered recently. If you want to read the background or the formal advice first, here are some good starting links

    Planning and Monitoring SQL Server Storage for Office SharePoint Server: Performance Recommendations and Best Practices (white paper)

    Plan for performance and capacity (Office SharePoint Server)

    In addition to this, you need to consider what you can and can’t do from a SQL Server perspective, if you’re trying to optimize or troubleshoot your SharePoint deployment. You need to pay detailed attention to these articles:

    http://support.microsoft.com/kb/932744/

    http://support.microsoft.com/kb/841057/

    In simple terms, don’t go amending the schema or the objects! So back to my main point…

    Things to do first:

    1. Collect some decent SQL Server diagnostic data – my preferred option,as ever, SQLDIAG with perfstats scripts

    2. Turn on trace flag 1222 to output deadlock information to the errorlog, or capture deadlock graphs in your profiler traces.

    3. Collect this data at period of both good and bad performance, so you have a benchmark to compare against

    4. Aggregate the profiler output. Personally I use readtrace.exe directly from RMLUtilities, but other like to use SQLNexus as well.

    Here are the 2 real examples of problems I encountered recently, and diagnosed using the above data collection:

    1. Are you experiencing significant parallel waits – SQL Server installations will have a default Max Degree of Parallelism set to 0. I have observed certain MOSS deployments this year where this is not the best MAXDOP, dependent upon the specifics of the data distribution and deployment. If you look in the perfstats output you might see evidence like this:

    -- requests --
    runtime                        session_id request_id  ecid        blocking_session_id blocking_ecid task_state      wait_type
    ------------------------------ ---------- ----------- ----------- ------------------- ------------- --------------- ---------
    2009-05-27T15:06:38.060                68           0           4                   0             0 RUNNING         CXPACKET 
    2009-05-27T15:06:38.060                68           0           0                   0             4 SUSPENDED       CXPACKET 
    2009-05-27T15:06:38.060                96           0           0                   0             0 SUSPENDED       WAITFOR  
    2009-05-27T15:06:38.060               116           0           0                   0             0 RUNNING                  
    2009-05-27T15:06:38.060               128           0          12                   0             0 RUNNING         CXPACKET 
    2009-05-27T15:06:38.060               128           0           9                   0             0 RUNNABLE        CXPACKET 
    2009-05-27T15:06:38.060               128           0           0                   0             0 SUSPENDED       CXPACKET 
    2009-05-27T15:06:38.060               128           0           5                   0             0 SUSPENDED       CXPACKET 
    2009-05-27T15:06:38.060               128           0          11                   0             0 RUNNABLE        CXPACKET 
    2009-05-27T15:06:38.060               128           0           6                   0             0 SUSPENDED       CXPACKET 
    2009-05-27T15:06:38.060               128           0           2                   0             0 SUSPENDED       CXPACKET 
    2009-05-27T15:06:38.060               128           0           3                   0             0 SUSPENDED       CXPACKET 
    2009-05-27T15:06:38.060               128           0           4                   0             0 SUSPENDED       CXPACKET 
    2009-05-27T15:06:38.060               128           0           7                   0             0 SUSPENDED       CXPACKET 
    2009-05-27T15:06:38.060               128           0           8                   0             0 SUSPENDED       CXPACKET 
    2009-05-27T15:06:38.060               128           0           1                   0             0 SUSPENDED       CXPACKET 
    2009-05-27T15:06:38.060               128           0          10                   0             0 RUNNABLE        CXPACKET 
    2009-05-27T15:06:38.060               138           0          10                   0             0 SUSPENDED       CXPACKET 
    2009-05-27T15:06:38.060               138           0          12                   0             0 RUNNABLE        CXPACKET 
    2009-05-27T15:06:38.060               138           0          11                   0             0 SUSPENDED       CXPACKET 
    2009-05-27T15:06:38.060               138           0           9                   0             0 SUSPENDED       CXPACKET 
    2009-05-27T15:06:38.060               138           0           0                   0             3 SUSPENDED       CXPACKET 
    2009-05-27T15:06:38.060               138           0           3                   0             8 SUSPENDED       CXPACKET 
    2009-05-27T15:06:38.060               138           0           1                   0             8 SUSPENDED       CXPACKET 
    2009-05-27T15:06:38.060               138           0           2                   0             8 SUSPENDED       CXPACKET 
    2009-05-27T15:06:38.060               138           0           4                   0             8 SUSPENDED       CXPACKET 
    2009-05-27T15:06:38.060               138           0           7                   0            12 SUSPENDED       CXPACKET 
    2009-05-27T15:06:38.060               138           0           5                   0            12 SUSPENDED       CXPACKET 
    2009-05-27T15:06:38.060               138           0           6                   0            12 SUSPENDED       CXPACKET 
    2009-05-27T15:06:38.060               138           0           8                   0            12 SUSPENDED       CXPACKET 
    2009-05-27T15:06:38.060               163           0           0                   0             0 RUNNABLE                 
    2009-05-27T15:06:38.060               170           0           0                 163             0 SUSPENDED       LCK_M_X  

    This is a classic example, where a long blocking chain starts to build due to CXPACKET waits. Note that the same sessionid is reported on multiple lines, but the ecid column changes. This indicates that the batch (sessionid) is being processed in parallel by more than one worked (ecid). This extract is very early on in the problem, from here the chains grow and grow, and eventually can result in multiple deadlocks. In this scenario the batches being executed by sessions 128 and 138 (which are both running PROC_LISTUNGHOSTEDFIELDTEMPLATESINLIST ) are picking parallel plans inappropriately. We worked around this by changing MAXDOP on the server to a value of 1. This is definitely something that you should test in your environment, since it’s a server wide setting. Currently according to the above KB articles there is no fixed recommendation about MAXDOP for MOSS deployments, but things can change, so you should check the fixed requirements of your version. In this scenario we saw significant other number of procedures which encountered the same issue, procedures such as proc_CheckDirectoryTree, proc_DeplGetListItemLinks and PROC_DEPLGETLISTITEMANDRESERVEID amongst others. Typical errors you might see in the application logs for SharePoint would be

    Unexpected query execution failure, error code 1205. Additional error information from SQL Server is included below. "Transaction (Process ID 163) was deadlocked on lock resources with another process and has been chosen as the deadlock victim. Rerun the transaction."

    A global change of MAXDOP = 1 resolved this specific situation.

    2. Are you encountering long blocking chains accessing the AllDocs table? This object is commonly accessed by a number of different stored procedures, and I’ve seen some scenarios where it becomes a point of contention. The most problematic one I came across recently resulted in numerous deadlocks, which firstly throw the above 1205 error to the SharePoint application log, and then in addition they throw the following error:

    InitializePublishingWebDefaults failed. Microsoft.SharePoint.SPException: Operation aborted (Exception from HRESULT: 0x80004004 (E_ABORT)) ---> System.Runtime.InteropServices.COMException (0x80004004): Operation aborted (Exception from HRESULT: 0x80004004 (E_ABORT))     at Microsoft.SharePoint.Library.SPRequestInternalClass.SetWebProps

    This second error occurs as SharePoint does not roll back properly after encountering the deadlock. If you have collected your SQLDIAG data as noted above, you’ll see symptoms like this in the perfstats script output:

    a) long blocking chains waiting on LCK_M_U

    -- requests --
    runtime                        session_id request_id  ecid        blocking_session_id blocking_ecid task_state      wait_type                                          wait_duration_ms     
    ------------------------------ ---------- ----------- ----------- ------------------- ------------- --------------- -------------------------------------------------- -------------------- 
    2009-05-28T10:01:40.973                60           0           0                 197             0 SUSPENDED       LCK_M_X                                                            3201 
    2009-05-28T10:01:40.973                72           0           0                  60             0 SUSPENDED       LCK_M_S                                                            3224 
    2009-05-28T10:01:40.973                77           0           0                   0             0 SUSPENDED       WAITFOR                                                            9087 
    2009-05-28T10:01:40.973                84           0           0                  72             0 SUSPENDED       LCK_M_S                                                            1018 
    2009-05-28T10:01:40.973               197           0           0                   0             0 RUNNING                                                                               0 
    2009-05-28T10:01:40.973               268           0           0                 270             0 SUSPENDED       LCK_M_U                                                           36299 
    2009-05-28T10:01:40.973               270           0           0                 197             0 SUSPENDED       LCK_M_U                                                           96753 
    2009-05-28T10:01:40.973               277           0           0                 270             0 SUSPENDED       LCK_M_U                                                           44290 
    2009-05-28T10:01:40.973               280           0           0                 270             0 SUSPENDED       LCK_M_U                                                           60461 
    2009-05-28T10:01:40.973               292           0           0                 270             0 SUSPENDED       LCK_M_U                                                           52389 

    b) the notableactivequeries part of the perfstats script will potentially note that the plan_total_exec_count is very low. This indicates that you appear to be running a freshly compiled plan for the blocking statement. This will be all the more likely if your system was running well, and then suddenly started degrading performance and encountering errors

    -- notableactivequeries --
    runtime                        session_id request_id  plan_total_exec_count plan_total_cpu_ms    plan_total_duration_ms plan_total_physical_reads plan_total_logical_writes plan_total_logical_reads 
    ------------------------------ ---------- ----------- --------------------- -------------------- ---------------------- ------------------------- ------------------------- ------------------------ 
    2009-05-28T10:01:40.973               197           0                     5                33057                 406079                      2550                         0                  2352015 
    2009-05-28T10:01:40.973               268           0                     5                33057                 406079                      2550                         0                  2352015 
    2009-05-28T10:01:40.973               270           0                     5                33057                 406079                      2550                         0                  2352015 
    2009-05-28T10:01:40.973               277           0                     5                33057                 406079                      2550                         0                  2352015 
    2009-05-28T10:01:40.973               280           0                     5                33057                 406079                      2550                         0                  2352015 
    2009-05-28T10:01:40.973               292           0                     5                33057                 406079                      2550                         0                  2352015 

     

     

     

     

    c) when reviewing the error logs you note that the same statement shown below, being executed within the procedure proc_UpdateTpWebMetaData, is always involved in the deadlocking

    UPDATE
        Docs
    SET
        Docs.BuildDependencySet = NULL
    FROM
        Docs INNER JOIN BuildDependencies
    ON
        BuildDependencies.SiteId = Docs.SiteId AND
        BuildDependencies.LeafName = Docs.LeafName AND
        BuildDependencies.DirName = Docs.DirName
    WHERE
        BuildDependencies.SiteId = @SiteId AND
        BuildDependencies.TargetDirName = @PreviousCustomMasterDir AND
        BuildDependencies.TargetLeafName = @PreviousCustomMasterLeaf AND
        Docs.BuildDependencySet IS NOT NULL

    The reason for this problem, in this particular scenario, is that a new plan has been compiled recently for this statement which is using a HASH join, as opposed to a nested loop join which it was using earlier. This is caused by a lack of up to date statistics on the AllDocs table. We noted scenarios where the regular update statistics jobs (issued by SharePoint as part of the changes in 2007 SP2) were not thorough enough, and we were forced to run update statistics with the fullscan option on the AllDocs table.

    The bottom line here being that although SharePoint does now do database maintenance, as shown in KB932744, there are still times potentially where you will need to do more detailed targeted maintenance.

  • SQL in Stockholm

    How to Customise SQLDIAG in SQL Server 2008 to collect DBCC MEMORYSTATUS

    • 1 Comments

    Although Books Online makes some mention of SQLDIAG customizations, it doesn’t any explicit examples, so here’s one for you. I documented previously here about how to customize your XML configuration file to connect to different machines, and collect output from the standard collectors, and I also mentioned how you could download the PerfStats scripts and use the XML files that they came supplied with. The following example though shows how to insert a custom script of your own, which can be anything of your choice. I’ll be using DBCC MEMORYSTATUS output to demonstrate. The thing to remember here is that I’m not saying that this is the only way to collect such data, but that as ever, with SQLDIAG it allows you to do everything in one place, in a controlled and easy to administer way.

    Firstly we need a script which will collect DBCC MEMORYSTATUS repeatedly. The following one won’t win any design prizes but will do the job just nicely:

    WHILE 1=1

    BEGIN

    select 'start time:', GETDATE();

    DBCC MEMORYSTATUS

    WAITFOR DELAY '00:00:10'

    END;

    This will run DBCC MEMORYSTATUS every 10 seconds until you kill it.

    Next we need to create an XML configuration to run this script using SQLDIAG. I’ll take the default SQLDIAG.XML which unpacks when you run the sqldiag.exe for the first time. If you need any explanations of these pre-requisites then go back to the SQLDIAG link above and read the background.

    <CustomDiagnostics>
    <CustomGroup name="msinfo" enabled="true" />
    <CustomTask enabled="true" groupname="MsInfo" taskname="Get MSINFO32" type="Utility" point="Startup" wait="OnlyOnShutdown" cmd="start /B /WAIT MSInfo32.exe /computer %server% /report &quot;%output_path%%server%_MSINFO32.TXT&quot; /categories +SystemSummary+ResourcesConflicts+ResourcesIRQS+ComponentsNetwork+ComponentsStorage+ComponentsProblemDevices+SWEnvEnvVars+SWEnvNetConn+SWEnvServices+SWEnvProgramGroup+SWEnvStartupPrograms" />
    <CustomTask enabled="true" groupname="MsInfo" taskname="Get default traces" type="Copy_File" point="Startup" wait="OnlyOnShutdown" cmd="&quot;%sspath%log\log*.trc&quot; &quot;%output_path%&quot;" />
    <CustomTask enabled="true" groupname="MsInfo" taskname="Get SQLDumper log" type="Copy_File" point="Startup" wait="OnlyOnShutdown" cmd="&quot;%sspath%log\SQLDUMPER_ERRORLOG*.*&quot; &quot;%output_path%&quot;" />
    </CustomDiagnostics>

    If you look in the above section we already have one custom group, called MSINFO32 , which has 3 sub-tasks. (The format of these if worth noting if you wish to do any file manipulations and command line tools.) What we want to do is add another CustomGroup, which will contain a single task, which is our DBCC script from above. The script itself needs to be saved into a file for the collector to access it, so in this scenario we’ll save the script into the same root directory as the sqldiag.exe. If you’re planning on doing this for many different scripts and scenarios you might be better organising these in appropriately named sub directories. So I’ve saved my script above as the following

    C:\Program Files\Microsoft SQL Server\100\Tools\Binn\exec_dbcc_memorystatus.sql

    Then I need to add the following XML to tell the SQLDIAG that I have a new custom group of diagnostics (called SQL memory scripts), which contains currently 1 custom task (out DBCC MEMORYSTATUS script). The new section looks like this:

    <CustomDiagnostics>
    <CustomGroup name="msinfo" enabled="true" />
    <CustomTask enabled="true" groupname="MsInfo" taskname="Get MSINFO32" type="Utility" point="Startup" wait="OnlyOnShutdown" cmd="start /B /WAIT MSInfo32.exe /computer %server% /report &quot;%output_path%%server%_MSINFO32.TXT&quot; /categories +SystemSummary+ResourcesConflicts+ResourcesIRQS+ComponentsNetwork+ComponentsStorage+ComponentsProblemDevices+SWEnvEnvVars+SWEnvNetConn+SWEnvServices+SWEnvProgramGroup+SWEnvStartupPrograms" />
    <CustomTask enabled="true" groupname="MsInfo" taskname="Get default traces" type="Copy_File" point="Startup" wait="OnlyOnShutdown" cmd="&quot;%sspath%log\log*.trc&quot; &quot;%output_path%&quot;" />
    <CustomTask enabled="true" groupname="MsInfo" taskname="Get SQLDumper log" type="Copy_File" point="Startup" wait="OnlyOnShutdown" cmd="&quot;%sspath%log\SQLDUMPER_ERRORLOG*.*&quot; &quot;%output_path%&quot;" />
    <CustomGroup name="SQLscripts" enabled="true" />
    <CustomTask enabled="true" groupname="SQLscripts" taskname="dbcc memorystatus" type="TSQL_Script" point="Startup" wait="No" cmd="exec_dbcc_memorystatus.sql"/>
    </CustomDiagnostics>

    Next I save this XML file as a new name, let’s say SQLDIAG_memstatus.XML and then I just execute the sqldiag tool with a input file switch like this:

    sqldiag /ISQLDIAG_memstatus.XML

    and away we go. If you look in your output directory, you’ll find the following file:

    <yourservername>__exec_dbcc_memorystatus_Startup.OUT

    which contains all the DBCC MEMORYSTATUS output.

    All you need to do now, is create custom XML files and scripts for as many scenarios as you can think of! Hopefully if you examine the entries for the MSINFO task, this might give you a few ideas as well.

  • SQL in Stockholm

    Unable to use forced parameterization when deploying plan guides in SQL Server 2005

    • 1 Comments

    I was working a case last week where I wanted to instate some fairly simple plan guides for a query that had picked a bad plan on a particular deployment of an application. I was thinking it was going to be an easy process. The query in question executed several tens of thousands of times, and had 2 parameters being generated by the application. On the server I was analyzing the performance had degraded due to a bad plan, but I had a trace from another server using the same application, where I had captured the good plan. This ought to have been a classic case for a plan guide (since I couldn’t change the SQL batch coming from the application, and the server had been managed properly so I couldn’t address the issue by simple update stats or some such). Anyway it drove me mad for a few hours as I couldn’t get it to work, so thought I would share, what turned out to be, the simple reason why :-)

    I needed to do 2 things:

    1. Create a plan guide which forced parameterization of the query, since I had noted it wasn’t being parameterized by default.

    2. Create a plan guide which forced a specific plan on the parameterized form of the query

    Both of these are fairly standard steps, step1 is documented here:

    Specifying Query Parameterization Behavior by Using Plan Guides

    and step 2 is documented here:

    Using the USE PLAN Query Hint

    I’ve done this many times before, but whatever I did, I couldn’t get step 1 to fly, so I couldn’t go any further. After much double checking of what I was doing (including trying to set forced parameterization at a database level), I decided to take a step back and check out the pre-requisites of forced parameterization at a database level. Books Online listed exceptions where parameterization could not be forced, so I went back and checked them all, even though my SQL batch seemed extremely simple. I found that there were 2 problems:

    a) Deprecated query constructs

    b) Any statement that is run in the context of ANSI_PADDING or ANSI_NULLS set to OFF

    It turned out that the connections that the application was using were issuing SET ANSI_NULLS statements manually, just after they connected. This statement itself it deprecated meaning that parameterization was a no no, and ironically it meant that doubly since ANSI NULLS OFF also stops this occurring.

    A little something to watch out for.

  • SQL in Stockholm

    New SQL Server 2008 DMV which I noticed this week - sys.dm_os_sys_memory

    • 1 Comments

    In the continuing series of "things I notice in SQL Server which I didn't know were there", I noticed the very simple but very useful sys.dm_os_sys_memory earlier this week.

    Why do I like this? Well obviously you can extract the memory information exposed by the OS in a number of different places, but as much as anything, I like the simple record of the state of the CreateMemoryResourceNotification Function , from the system_low_memory_signal_state column in the DMV.

    We've added this into the internel PSSDIAG / SQLDIAG perfstats scripts for 2008 (which is where I noticed it). For those of you that use proactive monitoring and record the state of your system over time, this is a good simple set of data to record.

  • SQL in Stockholm

    Troubleshooting SQL Server Management Studio with SQL Profiler and Debugging Tools – Part 1

    • 1 Comments

    This is the first in a series of 3 posts which discuss troubleshooting SQL Server Management Studio when it encounters errors or unexpected behaviour.

    In this post we look at using profiler, in the 2nd and 3rd we debug SSMS using public Microsoft debugging tools.

    Here are the symptoms of the problem we are looking at:

    1. Unable to view Job Activity in the SQL Server Agent Job Activity Monitor

    2. Progress bar shows “Cannot display activity data”

    3. Click on more details throws a message box with the error

    Failed to retrieve data for this request. (Microsoft.SqlServer.Management.Sdk.Sfc)

    with the additional details

    Hour, Minute, and Second parameters describe an un-representable DateTime. (mscorlib)

    It actually looks like this

    image

    4. The error repros with all versions of SSMS (2005 and 2008)

    5. The error repros for all users whatever workstation they connect from.

    These last 2 points are particularly relevant. Whenever you are looking at an SSMS problem, you always need to be asking these 2 questions. Straight at the off, you have potentially then eliminated whole areas of research (such as corruptions of .NET framework on specific workstations, incompatibilities in processor architecture specific versions of the framework, and localising whether the problem is with the SSMS client or the server instance being managed, or a combination of both)

    The initial approach – using SQL Profiler and some common sense

    When I initially came to this problem, the customer was experiencing the errors above. Normally when I get any errors or problems in SSMS, rather than reaching for debugger, I reach for SQL Profiler first (having asked the above 2 questions about where and when the issue repros), and this is a good rule of thumb. At the end of the day SSMS is just a GUI which issues TSQL commands to achieve functions in SQL Server. Whenever I get unexpected results in SSMS, the first thing I therefore do is to run a profiler trace against the server which SSMS is connected to. This means that you can see the TSQL which is issued, and you can sometimes find the problem straight off, if for example it issues some piece of bad code which obviously causes your error.

    In this case if you run the SQL profiler GUI tool (you can simply point it at the appropriate server instance and use the default options), you’ll note that the SSMS issues the following SQL batches.

    SET LOCK_TIMEOUT 10000
    select SERVERPROPERTY(N'servername')
    declare @RegPathParams sysname
                    declare @Arg sysname
                    declare @Param sysname
                    declare @MasterPath nvarchar(512)
     
                    declare @LogPath nvarchar(512)
                    declare @ErrorLogPath nvarchar(512)
                    declare @n int
     
                    select @n=0
                    select @RegPathParams=N'Software\Microsoft\MSSQLServer\MSSQLServer'+'\Parameters'
                    select @Param='dummy'
                    while(not @Param is null)
                    begin
                        select @Param=null
                        select @Arg='SqlArg'+convert(nvarchar,@n)
     
                        exec master.dbo.xp_instance_regread N'HKEY_LOCAL_MACHINE', @RegPathParams, @Arg, @Param OUTPUT
                        if(@Param like '-d%')
                        begin
                            select @Param=substring(@Param, 3, 255)
                            select @MasterPath=substring(@Param, 1, len(@Param) - charindex('\', reverse(@Param)))
                        end
                        else if(@Param like '-l%')
                        begin
                            select @Param=substring(@Param, 3, 255)
                            select @LogPath=substring(@Param, 1, len(@Param) - charindex('\', reverse(@Param)))
                        end
                        else if(@Param like '-e%')
                        begin
                            select @Param=substring(@Param, 3, 255)
                            select @ErrorLogPath=substring(@Param, 1, len(@Param) - charindex('\', reverse(@Param)))
                        end
                        
                        select @n=@n+1
                    end
                
     
     
                    declare @SmoRoot nvarchar(512)
                    exec master.dbo.xp_instance_regread N'HKEY_LOCAL_MACHINE', N'SOFTWARE\Microsoft\MSSQLServer\Setup', N'SQLPath', @SmoRoot OUTPUT
                
     
     
    SELECT
    CAST(case when 'a' <> 'A' then 1 else 0 end AS bit) AS [IsCaseSensitive],
    CAST(FULLTEXTSERVICEPROPERTY('IsFullTextInstalled') AS bit) AS [IsFullTextInstalled],
    @LogPath AS [MasterDBLogPath],
    @MasterPath AS [MasterDBPath],
    @ErrorLogPath AS [ErrorLogPath],
    @@MAX_PRECISION AS [MaxPrecision],
    @SmoRoot AS [RootDirectory],
    SERVERPROPERTY(N'ProductVersion') AS [VersionString],
    CAST(SERVERPROPERTY(N'MachineName') AS sysname) AS [NetName],
    CAST(SERVERPROPERTY(N'Edition') AS sysname) AS [Edition],
    CAST(SERVERPROPERTY(N'ProductLevel') AS sysname) AS [ProductLevel],
    CAST(SERVERPROPERTY('IsClustered') AS bit) AS [IsClustered],
    CAST(SERVERPROPERTY('IsSingleUser') AS bit) AS [IsSingleUser],
    CAST(SERVERPROPERTY('EngineEdition') AS int) AS [EngineEdition],
    SERVERPROPERTY(N'ResourceVersion') AS [ResourceVersionString],
    SERVERPROPERTY(N'ResourceLastUpdateDateTime') AS [ResourceLastUpdateDateTime],
    SERVERPROPERTY(N'BuildClrVersion') AS [BuildClrVersionString],
    SERVERPROPERTY(N'ComputerNamePhysicalNetBIOS') AS [ComputerNamePhysicalNetBIOS],
    SERVERPROPERTY(N'CollationID') AS [CollationID],
    SERVERPROPERTY(N'ComparisonStyle') AS [ComparisonStyle],
    SERVERPROPERTY(N'SqlCharSet') AS [SqlCharSet],
    SERVERPROPERTY(N'SqlCharSetName') AS [SqlCharSetName],
    SERVERPROPERTY(N'SqlSortOrder') AS [SqlSortOrder],
    SERVERPROPERTY(N'SqlSortOrderName') AS [SqlSortOrderName],
    convert(sysname, serverproperty(N'collation')) AS [Collation]
    exec sp_executesql N'SELECT
    dtb.collation_name AS [Collation],
    dtb.name AS [DatabaseName2]
    FROM
    master.sys.databases AS dtb
    WHERE
    (dtb.name=@_msparam_0)',N'@_msparam_0 nvarchar(4000)',@_msparam_0=N'msdb'    Microsoft SQL Server Management Studio    grahamk    EUROPE\grahamk    0    2    0    0    2912    52    2009-10-19 14:28:13.143    2009-10-19 14:28:13.143    0X00000000040000001A00730070005F006500780065006300750074006500730071006C002E01000082001A00E7206E0076006100720063006800610072002800310033003100290006010000530045004C004500430054000A006400740062002E006300    
    create table #tmp_sp_help_category
    (category_id int null, category_type tinyint null, name nvarchar(128) null)
    insert into #tmp_sp_help_category exec msdb.dbo.sp_help_category
    SELECT
    sv.name AS [Name],
    CAST(sv.enabled AS bit) AS [IsEnabled],
    tshc.name AS [Category],
    sv.category_id AS [CategoryID],
    tshc.category_type AS [CategoryType],
    null AS [CurrentRunStatus],
    null AS [CurrentRunStep],
    null AS [HasSchedule],
    null AS [HasStep],
    null AS [HasServer],
    null AS [LastRunDate],
    null AS [NextRunDate],
    null AS [LastRunOutcome],
    CAST(sv.job_id AS nvarchar(100)) AS [job_id]
    FROM
    msdb.dbo.sysjobs_view AS sv
    INNER JOIN #tmp_sp_help_category AS tshc ON sv.category_id = tshc.category_id
     
    drop table #tmp_sp_help_category
            

    and finally:

    exec msdb.dbo.sp_help_job

     

    These batches don’t appear to return any errors or warnings to profiler. The logical next step was to take a look at what this code does. Once you read through the various batches, you notice 2 types of batch. The first 5 batches are just required to get certain properties about the server instance and to populate properties of the tree view in SSMS Object Explorer. Whenever you profile SSMS you see lots of batches like this, which tend to precede the “real” work.

    The you see the final single batch, which executes msdb.dbo.sp_help_job. This is the real work that is required to populate the dialogue box that SSMS is opening. Since these batches are all just queries, and nothing is being updated, we can easily re-run them in SSMS query windows to see if anything else is thrown up, or see if weird results are returned that might explain our strange date time error.

    We tried this and it returned a wide record set of several hundred rows, which on initial glance was all well and good. At this stage, the pragmatists amongst you may feel it’s appropriate to just scan through this data and see if there are any glaring inconsistencies which can be noticed which might cause a DateTime related error. The truth is that there was such a piece of data, if you could be bothered to read all the data manually, but it was only a hypothesis. The stored procedure in question returns some columns that are related to dates and times in a true DateTime representation. However it also returns some columns as text representations of date parts. In this record set, these columns should always be populated with values, but in this case there were a number of rows which showed the value “0” in these columns. We therefore made the guess that this data was the problem. We removed this data and tried the repro again, and everything worked ok.

    We were lucky in this situation that the record set to review was only a few hundreds rows, and that the data corruption was obvious. On the basis that neither of these situations is always true, part 2 will show how to continue to troubleshoot this with debugging tools, to categorically prove what the problem was.

  • SQL in Stockholm

    SQL Server 2005 Error: 701, Severity: 17, State: 193

    • 1 Comments

    Here’s another interesting example which we worked recently in the Stockholm office, which relates also back to one of my previous entries. We had a customer whose error log was filled with a variety of critical errors such as these:

    Error: 701, Severity: 17, State: 193. There is insufficient system memory to run this query.

    Error: 1204, Severity: 19, State: 2. The instance of the SQL Server Database Engine cannot obtain a LOCK resource at this time. Rerun your statement when there are fewer active users. Ask the database administrator to check the lock and memory configuration for this instance, or to check for long-running transactions.

    Error: 17803, Severity: 20, State: 13. There was a memory allocation failure during connection establishment. Reduce nonessential memory load, or increase system memory. The connection has been closed. [CLIENT: <local machine>]

    This was a fairly high end system with 64GB RAM, 16 logical CPUs and was generally well configured in terms of its Max Server Memory and other basic sp_configure settings. As you may well be aware, there are innumerable causes that can actually cause you to receive a memory related error such as this, and some of the more basic troubleshooting tips are shown in the following MSDN article

    http://msdn.microsoft.com/en-us/library/aa337311(SQL.90).aspx

    However in this case none of the basics were really of any help. Typical starting questions to ask yourself in this scenario are:

    1. Is there a pattern to when this occurs

    2. Do I know of any particular workloads which are running when I see this problem

    …just to understand if you have any particular potential causes in your own code. Our customer in this scenario was not aware of any, and unfortunately we didn’t have any dumps created which might allow us to see retrospectively what batches were running when the problem occurred.

    However we did have the partial output of from DBCC MEMORYSTATUS which SQL had kindly written to the error log for us. An examination of this showed the following suspicious looking section:

    OBJECTSTORE_LOCK_MANAGER (node 2)
        VM Reserved = 0 KB
        VM Committed = 0 KB
        AWE Allocated = 0 KB
        SM Reserved = 0 KB
        SM Committed = 0 KB
        SinglePage Allocator = 29398856 KB
        MultiPage Allocator = 0 KB

    Global Memory Objects:
            Resource=    709
            Locks=    3676149
            SE Schema Mgr=    1646
            SQLCache=    261
            Replication=    2
            ServerGlobal=    109
            XPGlobal=    2
            Xact=    101
            SETLS=    16
            DatasetMemObjs=    32
            SubpDescPmos=    16
            SortTables=    2

    From this we can infer that we seemed to experience an equivalent problem to the one I wrote about here

    http://blogs.msdn.com/grahamk/archive/2009/02/03/compilation-bottlenecks-error-8628-severity-17-state-0-part-1.aspx

    although manifesting distinctly different symptoms.

    At the end of the day one of the NUMA nodes has taken up about 29GB of memory just managing lock resources, and this is confirmed by the number of locks which are reported in the global memory objects. Remember that the majority of the output from DBCC MEMORYSTATUS is publically documented here:

    How to use the DBCC MEMORYSTATUS command to monitor memory usage on SQL Server 2005

    http://support.microsoft.com/kb/907877

    This 29GB of memory purely managing locks appears to be placing an undue load of the memory resources of the system, and what we appear to have here is a runaway or huge batch of some type, or maybe a very large open transaction.

    The way to resolve this problem is to identify the batch and stop it from running. One good way to identify this problem proactively might be to use the tiny DMV query I wrote in my previous post about this type of situation and attach a monitor to it, warning you if the value goes above 1 million say, or some other high number of your choice.

    select request_session_id, count(*)
    from sys.dm_tran_locks
    group by request_session_id

    You could then easily identify the problem session_id before it got completely out of hand. There are of course many different alternative ways of capturing such data or doing such monitoring. Maybe you’d prefer to create a custom SQLDIAG package to monitor your server in more detail.

    The moral of this story being though, that even in the most critical system failures such as this, there may well be enough info to get a hint of where the problem lies, if you make sure you go through the error logs in detail. Remember not to be overwhelmed by the sheer volume of errors that might be reported; ensure you go through all the lines removing the duplicates to see whether something of interest is hidden amongst all the noise.

  • SQL in Stockholm

    Download location for SQLDIAG Perfstats DMV scripts

    • 1 Comments

    I refer to this many times in my other posts, so here’s a direct link to the download location of the perfstats scripts for SQLDIAG which you can use yourself. I use this everyday, you should too!

    http://blogs.msdn.com/psssql/archive/2007/02/21/sql-server-2005-performance-statistics-script.aspx

  • SQL in Stockholm

    Using Xquery to query Extended Events asynchronous file target results

    • 1 Comments

    This is the follow up (part 2) to the following post:

    http://blogs.msdn.com/grahamk/archive/2009/09/23/realistic-troubleshooting-example-of-extended-events-xevent-usage-in-sql-server-2008-part-1.aspx

    In the original post we configured an extended event session and issued a very simple query to review the results as XML. Here are some more intelligent XQuery examples. I’m documenting these purely because I hate XQuery so much, and it takes so damn long to do (well it does for me anyway). Please feel free to post in the comments section if you think XQuery is the best damn query language ever invented :-)

    Hopefully the code comment explain everything, I have attempted to use very descriptive alias names as well

    -- get information from XET binary files representing extended event session date
     
    -- as per part 1 blog post:
    -- straight select, view the XML manually in an SSMS window by clicking on the XML results column
    select CAST(event_data as XML) , * 
    from sys.fn_xe_file_target_read_file
    (
    'C:\temp\ArithmeticErrors_0_128981071730370000.xet', 
    'C:\temp\ArithmeticErrors_0_128981071730390000.xem', 
    null, 
    null
    )
     
    -- Get only specific nodes
    -- 2 different methods shown of the xquery, you can specify the name of the node
    -- or you can specify its location in the structure
    -- this means that you return every row for every event captured in the file
    -- the results are then converted to SQL data types
     
    SELECT
    xdata.value('(/event/data[4]/value)[1]', 'varchar(100)') AS [error],
    xdata.value('(/event/data[@name="severity"]/value)[1]', 'int') AS [severity],
    xdata.value('(/event/action[@name="sql_text"]/value)[1]', 'varchar(max)') AS [sql_text]
     
    FROM
        (
            select CAST(event_data as XML)
            from sys.fn_xe_file_target_read_file
            (
            'C:\temp\ArithmeticErrors_0_128981071730370000.xet', 
            'C:\temp\ArithmeticErrors_0_128981071730390000.xem', 
            null, 
            null
            )
        ) as xmlr(xdata)
     
    -- using the above technique, you can now wrap the entire query and give it an alias
    -- and then perform more useful T-SQL language queries on the results
    -- here is an example of an aggregate
     
    SELECT
    convert(varchar(100), outer_results.[error]) as 'error',
    convert(int, outer_results.[severity]) as 'severity',
    convert(varchar(max), outer_results.[sql_text]) as 'sql_text',
    count(*) as [count_occurances]
    FROM
        (
        SELECT
        xdata.value('(/event/data[4]/value)[1]', 'varchar(100)') AS [error],
        xdata.value('(/event/data[@name="severity"]/value)[1]', 'int') AS [severity],
        xdata.value('(/event/action[@name="sql_text"]/value)[1]', 'varchar(max)') AS [sql_text]
     
        FROM
            (
                select CAST(event_data as XML)
                from sys.fn_xe_file_target_read_file
                (
                'C:\temp\ArithmeticErrors_0_128981071730370000.xet', 
                'C:\temp\ArithmeticErrors_0_128981071730390000.xem', 
                null, 
                null
                )
            ) as xmlr(xdata)
        ) as outer_results
    group by error, severity,sql_text
     
    -- finally using the same technique as above , here we have simple predicates as well
     
    -- add a predicate into the outer query
    SELECT
    convert(varchar(100), outer_results.[error]) as 'error',
    convert(int, outer_results.[severity]) as 'severity',
    convert(varchar(max), outer_results.[sql_text]) as 'sql_text',
    count(*) as [count_occurances]
    FROM
        (
        SELECT
        xdata.value('(/event/data[4]/value)[1]', 'varchar(100)') AS [error],
        xdata.value('(/event/data[@name="severity"]/value)[1]', 'int') AS [severity],
        xdata.value('(/event/action[@name="sql_text"]/value)[1]', 'varchar(max)') AS [sql_text]
     
        FROM
            (
                select CAST(event_data as XML)
                from sys.fn_xe_file_target_read_file
                (
                'C:\temp\ArithmeticErrors_0_128981071730370000.xet', 
                'C:\temp\ArithmeticErrors_0_128981071730390000.xem', 
                null, 
                null
                )
            ) as xmlr(xdata)
        ) as outer_results
     
    -- here is the extra predicate we have added
    where outer_results.severity < 17
    group by error, severity,sql_text
     

    I fully accept that this may not be the most reasonable way, or the most efficient way to do this, but it’s the way that I’ll be using for now.

  • SQL in Stockholm

    SQL Server performance problems due to high privileged CPU time when running Windows Server 2008 R2

    • 1 Comments

    It's very rare that I tell people to blindy apply service packs or hotfixes (especially windows ones) without first analyzing where their performance problem lies, but if you are running SQL Server 2008 on Windows Server 2008 R2 (specifically this OS only) then you should absolutely ensure that you've installed the following fix:

    976700  An application stops responding, experiences low performance, or experiences high privileged CPU usage if many large I/O operations are performed in Windows 7 or Windows Server 2008 R2
    http://support.microsoft.com/default.aspx?scid=kb;EN-US;976700

    In this scenario, the key symptom which you normally ought to notice is that you are getting high kernel times, which can you see in the followiing perfmon counter:

    Process(sqlservr)\% Privileged Time

    We've seen a trend of this issue in recent months, so it's something to look to apply in your next maintenance window if you can.

     

  • SQL in Stockholm

    Compatibility problems between DMO and SMO connectivity on newer versions of SQL Server

    • 1 Comments

    We've recently been looking at a problem which triggered this error:

    [Microsoft][ODBC SQL Server Driver][SQL Server]To connect to this server you must use SQL Server Management Studio or SQL Server Management Objects (SMO)

    In our case it was actually being thrown by the SQL Server Web Data Administrator which is available on codeplex, but it could just as well be any application that you've written yourselves. In this case the application in question was running on an old Windows 2003 server and had been working fine for a long time, when connecting to old SQL 2000 instances. However when they started trying to connect to later versions of SQL they were getting the above error within the following stack:

    at SqlAdmin.ISqlServer.Connect(Object ServerName, Object Login, Object Password)
    at SqlAdmin.SqlServer.Connect()
    at SqlWebAdmin.databases.Page_Load(Object sender, EventArgs e)
    at System.Web.UI.Control.OnLoad(EventArgs e)
    at System.Web.UI.Control.LoadRecursive()
    at System.Web.UI.Page.ProcessRequestMain()

    Being as you can check out the source of this application I did check it out, but excluded this as a problem fairly early on. We came back to focus on the SQLDMO.dll itself and via some rigorous structured testing in the lab we managed to reproduce the exact error and provide an appropriate solution. (If you wonder what I mean by rigorous structured troubleshooting, I mean that I build a windows 2003 virtual machine in a lab and then applied updates to it sequentially checkpointing everytime something changed, and testing various hypotheses that I had on each checkpoint. When you're working with DLL versioning problems I find this to be a very effective - and in my opinion necessary methodology).

    Anyway the customer web server in question was runing the following version of SQLDMO.DLL on the web server

    Module[945] [C:\PROGRAM FILES\WINDOWS\SYSTEM32\SQLDMO.DLL]
      Company Name:      Microsoft Corporation
      File Description:  Distributed Management Objects OLE DLL for SQL Enterprise Manager
      Product Version:   (8.0:760.0)
      File Version:      (2000.80:760.0)
      File Size (bytes): 4215360
      File Date:         ti joulu 17 19:25:22 2002
        Module TimeDateStamp = 0x3dffbfbc - Wed Dec 18 02:22:20 2002
        Module Checksum      = 0x00411b8a
        Module SizeOfImage   = 0x00405000
      Module Pointer to PDB = [dll\sqldmo.pdb]
        Module PDB Signature = 0x3dff938b
        Module PDB Age = 0x9e

    (I was running MPSReports during my tests to extract and save exact configurations). This is a build from SQL Server 2000 SP3a, a fairly recognisable version number for those of us who have been around SQL for a few years!

    To cut a long story short all SQL 2000 versions of this DLL are incompatible with higher major versions of SQL Server, and you need to patch this DLL. There are a number of articles about installing SQLDMO on the web, but I didn;t find any of them to be particularly 100% reliable for all circumstances, which is why I tested it so thoroughly. I wanted to be sure that I could patch this DLL and only this DLL without changing anything else on a production machine. (For example it was hypothesized that installing SQL2005 or higher would resolve the problem - this in fact turned out to be true, but we didn;t want to do this. I also didn't want to be manually registering DLLs.)

    We instead used the backward compatibility MSI from SQL 2005, which can be downloaded and run indepently here:

    Feature Pack for Microsoft SQL Server 2005 - November 2005

    When running this we removed all the features apart from SQLDMO, and this ensured that we correctly upgraded SQLDMO.DLL (and its related objects such as SQLDMO.DLL) in the SQL 2000 directory here

    C:\Program Files\Microsoft SQL Server\80\Tools\Binn

    This upgrades SQLDMO.DLL to a version of this format

    8.05.2312

    The 05 being the key identifier here. Everything will then work fine.

    The main point being here that one can install and run a version of DMO which will be compatible with all major versions of SQL higher than 2000, and you can do it in a controlled and easy manner, without resorting to the manual registrations listed in older KBs like these

    http://support.microsoft.com/kb/326613
    http://support.microsoft.com/kb/248241
    http://msdn.microsoft.com/en-us/library/aa274817(v=sql.80).aspx

    Hope this helps.

    Graham

     

  • SQL in Stockholm

    SQL Server Performance Assessment – What are my top 20 expensive batches

    • 1 Comments

    Previously here , I discussed the nature of collecting data when you are experiencing a SQL Server performance problem. A performance problem sometimes intimidates people as they are not sure where to start. This post details one of my preferred ways to take a quick overview of what the situation is.

    Let’s assume that using one of the techniques described, you have some SQL Profiler data. One of the first things that I like to do is aggregate the data together into a database so that I can then simply query it for both an overview of aggregated problems or fetch details of individual batches quickly and easily, using TSQL.

    My current preferred method of doing this is to use the readtrace tool that installs as part of RML Utilities. You might choose to run readtrace.exe from the command line (as I do), or you might prefer to also install SQLNexus and set the options there to have it run readtrace for you. Either way at the end of it you’ll be left with a database that contains the readtrace schema.

    What I like to discover first off, is answer the question

    “Are there actually SQL batches that are taking a long time to run?”

    This may sound obvious, but considering the number of different types of performance problems, and the different sub sets of data you might want to analyze, this provides an immediate split of the problem into separate areas. With this data you can follow a line of questioning like this:

    1. Do I have long running SQL batches?
    2. If yes – is it the same batches that are long running all the time?
    3. Is it a small number or even a single batch that is long running?
    4. Can I recognize a pattern to this behavior? Is the result of a recent application change?
    5. Is this a crazy query being issued manually by a user?

    And so on….the point being as well that if the answer to the first question is “no”, then maybe you should be looking elsewhere within your application architecture for the problem, or maybe you should be collecting additional or different diagnostics. There’s no hard and fast rule here, but it’s always nice to have some sort of an idea of what direction you might wish to proceed in.

    Consider the following example. You make a small change to a previously healthy application and suddenly all your users start complaining of unbearably slow performance. You collect the profiler data and run it through readtrace. When you query the readtrace data you find that a single SQL batch is always at the top of every list of the most expensive batches, under every category. This is a stored procedure which acts as a common entry point for all application functionality. Knowing this you can now work out that every user is experiencing application slowness because this common stored procedure always runs slowly. You can now focus on why this specific piece of code is going slow and deal with it as necessary.

    So anyway, the script is attached to this post. It’s based upon one of the old scripts which we used to use internally before the advent of SQLNexus, and has been amended by me to suit my requirements. I often use it to send customers summaries of their badly performing code by email, and so it’s designed to have results output to text. This is also why it only prints the initial 100 characters of each batch, but you can change this easily enough if you have long SQL batches.
    You might note that it only does batches, and not statements. This is because first off when I have a random performance problem, I don’t tend to collect statement level data in profiler as it’s so much more expensive to collect and sometimes can exaggerate the situation. I work on the basis of taking a summary collection and then going back for more targeted data if need be. readtrace does know how to analyze statement level data if you collect it (in fact you’ll note that if you don’t collect it, it warns you) so you can always write your own version on statement level data by following this principal and looking at the readtrace schema which comes with the documentation.

    Finally you might ask, why not do this with DMVs and look at the data real time. Well you can do this, and sometimes I would, but that’s not what this post is about! There are both advantages and disadvantages to this method, and I do have an archive of DMV queries which I’m ready to run should I ever do online troubleshooting of this nature. These are the type of queries which you can also see if you download the performance dashboard reports. But as I say, I really like readtrace because once you have the data in this format, you can run your stock queries like the one attached here, but if you find something out of the ordinary, then you can write something custom to analyze the data in more detail.

  • SQL in Stockholm

    1413 error when starting Database Mirroring – How many virtual log files is too many?

    • 1 Comments

    Recently in Cumulative update package 6 for SQL Server 2005 Service Pack 2 we released a fix for a problem where you receive a 1413 error when starting database mirroring. The error that you get in the looks like this:

    Error: 1413, Severity: 16, State: 1.

    Communications to the remote server instance 'TCP://ComputerName:PortNumber' failed before database mirroring was fully started. The ALTER DATABASE command failed. Retry the command when the remote database is started.

    In addition to this, if you look in the error logs themselves you will normally see the following 2 errors reported as well:

    Error: 1443, Severity: 16, State: 2.

    Database mirroring has been terminated for database 'xxx'. This is an informational message only. No user action is required.

    And you also note:

    Error: 1479, Severity: 16, State: 1.

    The mirroring connection to "TCP://ComputerName:PortNumber" has timed out for database "xxx" after 10 seconds without a response. Check the service and network connections.

    Although you can install the hotfix, I think it’s worth considering why this problem actually occurs and how you can potentially avoid it happening at all through good database management practices, which have benefits far and beyond the configuration of database mirroring.

    Hopefully you are familiar with the architecture of a SQL Server transaction log file. In principal the primary reason that this 1413 and its associated time out errors occur is that when you start database mirroring SQL Server needs to read through the transaction log file to bring the 2 databases into a synchronized state, following the restore of the last transaction log backup to the mirror partner. The time is takes to do this is dependent upon the number of virtual log files contained within the physical log file. This sometimes leads to confusion as a relatively small physical log file can contain thousands of virtual log files if it’s been badly managed. This in turn means that a relatively small physical log file (say 1GB) can be the cause of this 1413 error, and in turn the exact same database configured for mirroring on the exact same hardware will successfully set up when it has an identically sized physical log, but one that has been better managed and doesn’t contain so many virtual log files.

    To demonstrate this, try the following reproduction examples. To run these you’ll need 2 SQL Server instances with mirroring endpoints already set up (I’ll assume if you’re reading this far that you already understand this). A good place to try this repro is a test server environment where you already have a mirroring partnership configured. The repros use the undocumented DBCC LOGINFO command to examine the number of virtual log files in the physical log. Remember as with all undocumented commands, use of it is not supported, and you use it at your own risk.

    The example uses a common scenario where a log has been left to grow and runs out of space, but actually there are many different ways that you can cause your log to grow. Some are documented here. This is just a quick way to run the demo.

    Script 1: produces the 1413 failure – note the thousands of VLFs

    --CONNECT to the primary server insatnce

    use master

    go

     

    drop database vlf --(if you need to)

     

    --create a database on the root of C with a small starting log size

    --which then grows in small increments

    create database vlf

    on (name=vlf, filename='c:\vlf.mdf')

    log on (name=vlf_log, filename='c:\vlf_log.ldf',size=1MB,maxsize=1GB,filegrowth=1MB)

    go

     

    use vlf

    go

     

    --create a dummy table to fill up

    create table vlf_test

    (

    a int identity(1,1) not null,

    b char(8000) not null,

    c int not null

    )

    go

     

    insert vlf_test values ('vlftest',0)

    go

     

    --create a starting point of backups

    backup database vlf to disk = 'c:\vlf.bak' with stats = 1

    go

    backup log vlf to disk = 'c:\vlf1.bak'

    go

     

    set nocount on

    go

     

    --grow the log until it runs out of space

    declare @x int

    declare @a char(8000)

    declare @b char(8000)

    set @a = replicate('a',8000)

    set @b = replicate('b',8000)

    set @x = 1

    while 1=1

    begin

    update vlf_test set b = @b, c = @x where a = 1

    update vlf_test set b = @a where a = 1

    set @x = @x + 2

    end

    go

     

    --now alter the log size and backup the log

    alter database vlf modify file (name=vlf_log, maxsize=2GB)

    go

    backup log [vlf] to disk = 'c:\vlf2.bak' with stats=1

    go

     

    --use the undocumented DBCC LOGINFO command to check the number of VLFs in our

    dbcc loginfo (vlf)

    go

    --note the number of VLFs, it should be several thousand

    --for me it was 4100

     

    --connect to the other server which you wish to be the mirror

    --copy the 3 backup files across to it's C drive and run the following

    use master

    go

     

    restore database [vlf] from disk = 'c:\vlf.bak' with norecovery, replace, file=1,

    move 'vlf' to 'c:\vlf.mdf',

    move 'vlf_log' to 'c:\vlf_log.LDF'

    go

    restore log [vlf] from disk = 'c:\vlf1.bak' with norecovery

    go

    restore log [vlf] from disk = 'c:\vlf2.bak' with norecovery, stats=1

    go

     

    --you now have a pair of databases ready to have mirroring started

    --try to start database mirroring using the SSMS GUI and you will receive the 1413 error

    --the only exception to this might be if you run it on particularly fast hardware

    Script 2: works fine - identical except that the log has been pre-grown to the same size. This is good practice for all database files, whatever their type.

    --CONNECT to the primary server instance

    --remember to delete any old backup files of the same name from previous tests

    use master

    go

     

    drop database vlf --(if you need to)

     

    create database vlf

    on (name=vlf, filename='c:\vlf.mdf')

    log on (name=vlf_log, filename='c:\vlf_log.ldf',size=1GB,maxsize=1GB,filegrowth=250MB)

    go

     

    use vlf

    go

     

    create table vlf_test

    (

    a int identity(1,1) not null,

    b char(8000) not null,

    c int not null

    )

    go

     

    insert vlf_test values ('vlftest',0)

    go

    backup database vlf to disk = 'c:\vlf.bak' with stats = 1

    go

    backup log vlf to disk = 'c:\vlf1.bak'

    go

    set nocount on

    go

     

    --this will grow the log until it runs out of space

    declare @x int

    declare @a char(8000)

    declare @b char(8000)

    set @a = replicate('a',8000)

    set @b = replicate('b',8000)

    set @x = 1

    while 1=1

    begin

    update vlf_test set b = @b, c = @x where a = 1

    update vlf_test set b = @a where a = 1

    set @x = @x + 2

    end

    go

     

    --now alter the log size and backup the log

    alter database vlf modify file (name=vlf_log, maxsize=2GB)

    go

    backup log [vlf] to disk = 'c:\vlf2.bak' with stats=1

    go

     

    --use the undocumented DBCC LOGINFO command to check the number of VLFs in our

    dbcc loginfo (vlf)

    go

    --note the number of VLFs, it should be only about 16 or so

     

    --connect to the other server which you wish to be the mirror

    --copy the 3 back files across to it's C drive

     

    restore database [vlf] from disk = 'c:\vlf.bak' with norecovery, replace, file=1,

    move 'vlf' to 'c:\vlf.mdf',

    move 'vlf_log' to 'c:\vlf_log.LDF'

    go

    restore log [vlf] from disk = 'c:\vlf1.bak' with norecovery

    go

    restore log [vlf] from disk = 'c:\vlf2.bak' with norecovery, stats=1

    go

     

    --you now have a pair of databases ready to have mirroring started

    --try to start database mirroring using the SSMS GUI and it should work

    --the only exception to this might be if you run it on particularly slow hardware

    I ran my test on 2 small single processor servers. You might get slightly different results as the scripts note, depending upon your exact configuration. I used SQL Server 2005 build 3186 (CU#3) for this test.

    My main point here is that you should look to control your transaction log proactively. There are obviously times when this is not possible, or other mitigating circumstances arise, which is why we have released a hotfix to improve the speed with which the database mirroring configuration reads the log files. This improvement is also included in SQL Server 2008 as well. However there are other issues that you can encounter if you let you virtual log files grow unchecked, such as excessive recovery times, so as with all SQL Server file management tasks, proactive is best. We always recommend that our customers analyze their expected usage and pre-grow their database files.

    I hope this helps to explain the database mirroring error in more depth, but feel free to post a comment if you have a question.

  • SQL in Stockholm

    Slow recovery times and slow performance due to large numbers of Virtual Log Files

    • 1 Comments

    In my previous post about database mirroring configuration problems caused by large numbers of virtual log files, I mentioned that such a scenario can have other unwanted side effects, with or without database mirroring being enabled. Here’s an example that demonstrates one of these scenarios. This shows how recovery times after a disaster can be significantly extended by allowing your log file to grow in small increments, thereby creating thousands of VLFs. It’s worth noting as well that this scenario also demonstrates the performance impact of doing online file growth during periods of large activity. This scenario can affect you with or without mirroring enabled, it’s just quicker to demonstrate with configuring mirroring.

    I mentioned before that to fully understand this optic you need to be familiar with Transaction Log Physical Architecture and also Factors That Can Delay Log Truncation.

    There is one simple lesson to be learnt here, so I’ll write it early on in the article:

    Pre-grow your log files (and your data files) in advance. Avoid repeated online auto-growth if at all possible.

    Consider the following example scripts which are based upon the same routines as used in the previous post, but with some small changes.

    These scripts do the following actions, one for a good scenario and one for the bad scenario.

    1. Create a database named VLF

    2. Create a starting backup point

    3. Open a transaction in this database to ensure that the transaction cannot be truncated

    4. Fill up the log with thousands of transactions until the log reaches the space limit set for it – in this case we test with a 5GB log. The test of the log running out of space is used just to guarantee log files of equivalent sizes for the recovery time comparison. This situation holds true for logs that haven’t run out of space.

    5. Restart the server without taking a backup or dealing with the log space. This forces SQL to recover the entire contents of the active portion of the log, which is in effect, all of it.

    We then just start the SQL Server service again and monitor the recovery times and look in the SQL Server error logs.

    We notice a number of interesting points here:

    1. The runtime required to fill up the 5GB log files is significantly faster when using the pre-grown log file. Less than 2 minutes for the pre-grown log file compared to over 9 minutes for the incrementally auto grown file. This demonstrates why auto-growths can be bad. They are expensive to manage and can harm your performance. The fact that the log fills quicker indicates that you are managing to throughput the transactions faster.

    2. There is a delay of about 2 minutes when nothing is logged about the recovery phases of the VLF database in the bad scenario. This is the period in between the database starting up and the analysis phase beginning. In reality there is a large amount of work going on here; SQL Server is processing the 20000 VLFs that are present in the log file. This delay can be extended in proportion to the number of VLFs that you allow your log file to hold. I have seen real world examples where hundreds of thousands of VLFs have significantly delayed a recovery time.

    3. The actual recovery times are similar (about 2 minutes) once the recovery actually begins. These recovery times are this length due to the fact that we created the old open transaction in the test meaning that SQL is forced to recover the entirety of the log.

    4. As well as being significantly faster to fill up the log (in terms of runtime) which indicates that the processing of transactions is faster, when we look at the log in detail we see that we have processed about 6500 more transactions in the shorter period. Further evidence of how much better performance the pre-grown log files gives you. We’ve managed to process 25% more transactions in a quarter of the time!

    The scripts and error log output are shown below. The scripts use the undocumented DBCC LOGINFO command to examine the number of virtual log files in the physical log. Remember as with all undocumented commands, use of it is not supported, and you use it at your own risk. The number of rows that this command outputs represent the number of VLFs currently in the physical log.

    SCRIPT 1 – Bad scenario

    --CONNECT to the test server instance

     

    use master

    go

     

    if exists (select name from sys.databases where name = 'vlf')

    begin

    drop database vlf --(if you need to)

    end

     

     

    --create a database on the root of C with a small starting log size

    --which then grows in small increments

    create database vlf

    on (name=vlf, filename='c:\vlf.mdf')

    log on (name=vlf_log, filename='c:\vlf_log.ldf',size=1MB,maxsize=5GB,filegrowth=1MB)

    go

     

    use vlf

    go

     

    --create a dummy table to fill up

    create table vlf_test

    (

    a int identity(1,1) not null,

    b char(8000) not null,

    c int not null

    )

    go

     

    insert vlf_test values ('vlftest',0)

    go

     

    --create a starting point of backups

    backup database vlf to disk = 'c:\vlf.bak' with stats = 1

    go

    backup log vlf to disk = 'c:\vlf1.bak'

    go

     

    --open a 2nd new connection and open an explicit uncommitted transaction

    --to ensure that the transaction log cannot truncate

     

    use vlf

    go

     

    begin tran

    create table opentran

    (

    opentran int

    )

    Go

     

    --switch back to the first connection

    --now run the following to grow the log to create lots of VLFs

    --run until the log runs out of space - a convenient stopping point

     

    use vlf

    go

    set nocount on

    go

     

    declare @x int

    declare @a char(8000)

    declare @b char(8000)

    set @a = replicate('a',8000)

    set @b = replicate('b',8000)

    set @x = 1

    while 1=1

    begin

    update vlf_test set b = @b, c = @x where a = 1

    update vlf_test set b = @a where a = 1

    set @x = @x + 2

    end

    go

     

    --let this run until the log runs out of space

    --on the 2nd connection, use dbcc loginfo to check how many VLFs you have

     

    DBCC LOGINFO (vlf)

     

    --when you have reached 5GB of log file with several thousands VLFs

    --simulate a server disaster by cutting the power to the principal server

    --alternatively you could just power off your server ungracefully

     

    shutdown with nowait

     

    --now restart the SQL service or power on the server

    I get the following results from the error log when I run this on my test server.

    Runtime of filling the log: 9 minutes 21 seconds

    Number of VLFs when log full: 20480

    Time for VLF to recover fully: 3 minutes 40 seconds

    Excerpts from the SQL Server log:

    Server starts:

    2008-05-16 13:22:40.85 Server Microsoft SQL Server 2005 - 9.00.3186.00 (X64)

    VLF db starts up

    2008-05-16 13:22:47.97 spid18s Starting up database 'vlf'.

    2 minute pause whilst nothing is logged about vlf database, then analysis starts. It is during this period that SQL Server is processing the large number of VLFs.

    2008-05-16 13:24:34.28 spid18s Analysis of database 'vlf' (19) is 3% complete (approximately 8 seconds remain). This is an informational message only. No user action is required.

    Redo has started

    2008-05-16 13:24:45.80 spid18s Recovery of database 'vlf' (19) is 3% complete (approximately 108 seconds remain). Phase 2 of 3. This is an informational message only. No user action is required.

    Note how many transactions have been rolled forward – this is how many we did in the test

    2008-05-16 13:26:21.36 spid18s 26533 transactions rolled forward in database 'vlf' (19). This is an informational message only. No user action is required.

    Undo has started

    2008-05-16 13:26:21.39 spid18s Recovery of database 'vlf' (19) is 99% complete (approximately 0 seconds remain). Phase 3 of 3. This is an informational message only. No user action is required.

    Recovery complete

    2008-05-16 13:26:21.49 spid5s Recovery is complete. This is an informational message only. No user action is required.

    SCRIPT 2 – Good scenario

    The script is identical to the bad scenario shown above, apart from the simple fact that we create the log to a pre-expand size of 5GB, using this command:

    create database vlf

    on (name=vlf, filename='c:\vlf.mdf')

    log on (name=vlf_log, filename='c:\vlf_log.ldf',size=5GB,maxsize=5GB,filegrowth=1MB)

    This gives the following results on the identical server.

    Runtime of filling the log: 1 minutes 54 seconds

    Number of VLFs when log full: 16

    Time for VLF to recover fully: 2 minutes 2 seconds

    Excerpts from the SQL Server log:

    Server starts:

    2008-05-16 13:40:16.10 Server Microsoft SQL Server 2005 - 9.00.3186.00 (X64)

    VLF db starts up

    2008-05-16 13:40:22.64 spid18s Starting up database 'vlf'.

    6 second pause whilst nothing is logged about vlf database, then analysis starts. It is during this period that SQL Server is processing the VLFs, but in this case because there are only 16, it completes much quicker. This is the fundamental difference between the examples.

    2008-05-16 13:40:28.99 spid18s Analysis of database 'vlf' (19) is 2% complete (approximately 157 seconds remain). This is an informational message only. No user action is required.

    Redo has started

    2008-05-16 13:40:44.22 spid18s Recovery of database 'vlf' (19) is 4% complete (approximately 106 seconds remain). Phase 2 of 3. This is an informational message only. No user action is required.

    Note how many transactions have been rolled forward – this is how many we did in the test

    2008-05-16 13:42:18.69 spid18s 33216 transactions rolled forward in database 'vlf' (19). This is an informational message only. No user action is required.

    Undo has started

    2008-05-16 13:42:18.69 spid18s Recovery of database 'vlf' (19) is 99% complete (approximately 0 seconds remain). Phase 3 of 3. This is an informational message only. No user action is required.

    Recovery complete

    2008-05-16 13:42:18.89 spid5s Recovery is complete. This is an informational message only. No user action is required.

    I hope this demo gives further weight to the case for pre growing your SQL Server files in advance. Feel free to post a comment if you have a question.

  • SQL in Stockholm

    How to configure SQL Server database mirroring between 2 instances on a standalone server

    • 1 Comments

    There are lots of troubleshooting articles about how to diagnose connectivity problems when setting up database mirroring, especially errors such as these:

    Msg 1418, Level 16, State 1, Line 2

    The server network address "tcp://mirrortest.com:5023" can not be reached or does not exist. Check the network address name and that the ports for the local and remote endpoints are operational.

    or

    Msg 1456, Level 16, State 3, Line 1

    The ALTER DATABASE command could not be sent to the remote server instance

    but here's something a little different.

    Imagine that you have a server which is not connected to a domain. It might just be a standalone machine, it could be a virtual machine that you've just built. I've done this a few times in virtual environments to do demos on laptops and the like. It's a not a production scenario that you'd come across really, but it works really well in test.

    The questions that usually come up are how can I configure the partners (and endpoints) without having a FQDN, since I'm not connected to a domain, I'm just in "WORKGROUP". SQL Server still wants you to use FQDN formatting for the partner addresses. When you're trying to configure it, you might have even encountered an error like this when trying to start mirroring (depending on exactly how you did the setup)

    Database Mirroring login attempt failed with error: 'Connection handshake failed.

    There is no compatible authentication protocol. State 21

    What you need to do to be able to set this up is as follows:

    1. adding "local" as the primary DNS suffix as in detailed this article here (ok it's about exchange server but it works the same)

    http://technet.microsoft.com/en-us/library/aa997119.aspx

    (remember to reboot)

    2. You can now specify FQDN formatted names for the local machine in the format:

    For the principal:

    TCP://MIRRORTEST.LOCAL:5022

    For the mirror

    TCP://MIRRORTEST.LOCAL:5023

    (Or other port numbers of your choice, but remember that they'll need to be different)

    3. You'll need these ports as exceptions in Windows Firewall if you're running it (I was even though it was a local machine!)

    4. My endpoints are setup like this (but I just did them through the GUI) there's nothing exceptional here:

    CREATE ENDPOINT [Mirroring]

    AUTHORIZATION [MIRRORTEST\Administrator]

    STATE=STARTED

    AS TCP (LISTENER_PORT = 5022, LISTENER_IP = ALL)

    FOR DATA_MIRRORING (ROLE = PARTNER, AUTHENTICATION = WINDOWS NEGOTIATE, ENCRYPTION = DISABLED)

    You can now run some cool tests and demos without ever being connected to a network.

  • SQL in Stockholm

    Compilation Bottlenecks – Error: 8628, Severity: 17, State: 0. – Part 2

    • 1 Comments

    [follow up to RESOURCE_SEMAPHORE_QUERY_COMPILE diagnosis]

    Last week I wrote about the initial diagnosis of this problem, and now here's the follow up. So if you remember we had a DBCC MEMORYSTATUS and a load of other PSSDIAG data collected and we knew that the procedure cache seemed to be constantly flushing as fast as we put stuff into it.

    Now when you come to a situation like this, there's no perfect way to say, here's where to look for the answer. I certainly checked a few different things before I noticed the problem. I guess the bottom line here is that we have some data, we've collected a benchmark of when the system is operating normally, and we just need to review all of this to find something abnormal. We know in this situation that the cache is being flushed, and so we're looking for events or circumstances which potentially use up lots of buffer pool memory, or causes memory flush. One example of this I referred to in this earlier post, so I checked the error logs for this type of situation - nothing doing.

    So being as we're thinking memory related problems, I used some text tools to compare the output from current and previous DBCC MEMORYSTATUS. Out of interest my preferred text tool of note currently is notepad++, I can't say enough good things about this tool.....

    So in doing so I noticed the following significant change in one of the many values that the DBCC command collects.

    Problem occurring snippet:

    OBJECTSTORE_LOCK_MANAGER (Total)                                  KB
    ---------------------------------------------------------------- --------------------
    VM Reserved                                                     131072
    VM Committed                                                    131072
    AWE Allocated                                                   0
    SM Reserved                                                     0
    SM Commited                                                     0
    SinglePage Allocator                                            5769576
    MultiPage Allocator                                             0

    Baseline snippet:

    OBJECTSTORE_LOCK_MANAGER (Total)                                  KB
    ---------------------------------------------------------------- --------------------
    VM Reserved                                                     131072
    VM Committed                                                    131072
    AWE Allocated                                                   0
    SM Reserved                                                     0
    SM Commited                                                     0
    SinglePage Allocator                                            33232
    MultiPage Allocator                                             0

    In our baselines, the single page allocator for the lock manager consistently shows unless 50000KB, even in times of extreme user load. However now we suddenly have 5769576KB or about 5.5GB....so this would pretty much explain where all our memory went. We monitored this situation for a few minutes and this value was consistently climbing, so it seemed likely that a rogue process or user was out there issuing one a fairly crazy query.

    So all we had to do now was to identify this person or batch. There were about 3000 connections to this system, and so the easiest way I could think of, which I mentioned in a very short post at the time, was to run a query against the sys.dm_tran_locks DMV and count the total number of locks by sessionid. This showed that one particular session was holding over 39 million key locks, which pretty much explained the problem!

    After this it was pretty simple to track down this user, ask them what they were doing, and stop them doing it. Once this batch was terminated the lock manager released its memory and everything soon went back to normal.

  • SQL in Stockholm

    Kill with rollback estimation

    • 1 Comments

    It's always amazing when you find something new in something so common that you thought you knew so well. When resolving the compilation bottleneck issue I mentioned in the previous 2 posts, I had to use the kill command. It's actually incredibly rare that I use this command, as I really don't believe in it in principal in production if at all avoidable. Anyway in that particular situation it just had to be done. The session that we were killing was doing a huge update, and as such it had to rollback and it was going to take a while (I could guess this from the fact that it had 39 million locks!). However I'd never noticed that the kill command now comes with a rollback estimation syntax. Definately a good feature, but not one to abuse :-)

    From BOL, it looks like this:

    KILL 54 WITH STATUSONLY;
    GO

    --This is the progress report.
    spid 54: Transaction rollback in progress. Estimated rollback completion: 80% Estimated time left: 10 seconds.

  • SQL in Stockholm

    New SQL Blogger - a must read

    • 1 Comments

    An ex-colleague of mine Elisabeth Redei, has just opened a new blog at http://sqlblog.com/blogs/elisabeth_redei/default.aspx 

    Elisabeth is extremely knowledgable and has spent many years in Microsoft SQL Server support roles. You should definately add her to you reading list, if you haven't already seen her at sqlblogs.com

    She's just posted a good article here about statistics:

    http://sqlblog.com/blogs/elisabeth_redei/archive/2009/03/01/lies-damned-lies-and-statistics-part-i.aspx

Page 1 of 3 (64 items) 123