SQL in Stockholm

SQL Server thoughts from the support team in Stockholm

SQL Server troubleshooting thoughts from Graham Kent, (ex) Senior Support Escalation Engineer at Microsoft Sweden.

  • SQL in Stockholm

    Database Mirroring Error 1418 Troubleshooter

    • 3 Comments

    I've been posting in the MSDN and TECHNET mirroring forums in the last few days and I've noticed a lot of occurrences of the 1418 error which I have referred to briefly in this blog before. Therefore I thought I'd post a 1418 specific troubleshooter for reference. The problem we're trying to solve is that you try to start database mirroring and you receive this error:

    Msg 1418, Level 16, State 1, Line 1
    The server network address "
    TCP://myserver.domain.domain.domain.com:5022" can not be reached or does not exist. Check the network address name and reissue the command.

    The following assumes that:

    1. You are using separate physical servers for the mirror partners. If you're not, you might this to read this about configuring mirroring on local instances and a typical problem you can come across.

    2. You're using domain accounts for SQL Server services

    3. You're using windows authentication and you're not using certificates

    4. You've already read this, or are familiar with the overall principal.

    So here are the things that I would check:

    1. Have you restored a fairly recent full backup to the planned mirror server, with NORECOVERY. Have you then restored a transaction log, also with NORECOVERY. If you haven't you won't be going anywhere with mirroring. Maybe you should read this first :-)

    2. Have you configured your endpoints correctly. To be specific, this means:

    a) Are they started

    b) Are they using the same form of authentication

    c) Are they using the same form of encryption

    d) Does the service account have permission to use them

    Even if you think or assume you have done this correctly (including if you've used the wizard to configure them) don't assume they are correct. You've encountered an error and you need to be sure. You need to script each endpoint and check that all the settings are correct. I've seen this happen too many times to mention, where people thought they were OK (because they "ought" to have been) but it was something simple like one of them wasn't started. Using the wizard does not guarantee that everything will always work. Remember all the wizard does is execute T-SQL, so mistakes or problems are still possible.

    3. Are the ports you have selected valid, and are they open? Are they being used by any other processes? Can you telnet to the port?

    4. Is there a firewall (hardware or software)? Have you created an exception / rule to the firewall? Windows firewall is no exception to this rule. It will block mirroring traffic just as much as any physical firewall will.

    5. Is the FQDN valid? Can it be resolved? Again, can you telnet to it? If you're not using FQDN and you're using direct IP addresses, the same rules still apply. Can you telnet to it? If you're not using FQDN or IP, have you checked the above post for resolving names locally.

    I hope this helps. It's my experience that these points will solve the majority of 1418 errors. If you have any questions about this topic feel free to post here or in the forums.

    Graham

  • SQL in Stockholm

    Troubleshooting SQL Server deadlocks and performance when running SharePoint

    • 2 Comments

    I’ve worked on a lot of cases this year involving SharePoint. Here are some example problems I encountered recently. If you want to read the background or the formal advice first, here are some good starting links

    Planning and Monitoring SQL Server Storage for Office SharePoint Server: Performance Recommendations and Best Practices (white paper)

    Plan for performance and capacity (Office SharePoint Server)

    In addition to this, you need to consider what you can and can’t do from a SQL Server perspective, if you’re trying to optimize or troubleshoot your SharePoint deployment. You need to pay detailed attention to these articles:

    http://support.microsoft.com/kb/932744/

    http://support.microsoft.com/kb/841057/

    In simple terms, don’t go amending the schema or the objects! So back to my main point…

    Things to do first:

    1. Collect some decent SQL Server diagnostic data – my preferred option,as ever, SQLDIAG with perfstats scripts

    2. Turn on trace flag 1222 to output deadlock information to the errorlog, or capture deadlock graphs in your profiler traces.

    3. Collect this data at period of both good and bad performance, so you have a benchmark to compare against

    4. Aggregate the profiler output. Personally I use readtrace.exe directly from RMLUtilities, but other like to use SQLNexus as well.

    Here are the 2 real examples of problems I encountered recently, and diagnosed using the above data collection:

    1. Are you experiencing significant parallel waits – SQL Server installations will have a default Max Degree of Parallelism set to 0. I have observed certain MOSS deployments this year where this is not the best MAXDOP, dependent upon the specifics of the data distribution and deployment. If you look in the perfstats output you might see evidence like this:

    -- requests --
    runtime                        session_id request_id  ecid        blocking_session_id blocking_ecid task_state      wait_type
    ------------------------------ ---------- ----------- ----------- ------------------- ------------- --------------- ---------
    2009-05-27T15:06:38.060                68           0           4                   0             0 RUNNING         CXPACKET 
    2009-05-27T15:06:38.060                68           0           0                   0             4 SUSPENDED       CXPACKET 
    2009-05-27T15:06:38.060                96           0           0                   0             0 SUSPENDED       WAITFOR  
    2009-05-27T15:06:38.060               116           0           0                   0             0 RUNNING                  
    2009-05-27T15:06:38.060               128           0          12                   0             0 RUNNING         CXPACKET 
    2009-05-27T15:06:38.060               128           0           9                   0             0 RUNNABLE        CXPACKET 
    2009-05-27T15:06:38.060               128           0           0                   0             0 SUSPENDED       CXPACKET 
    2009-05-27T15:06:38.060               128           0           5                   0             0 SUSPENDED       CXPACKET 
    2009-05-27T15:06:38.060               128           0          11                   0             0 RUNNABLE        CXPACKET 
    2009-05-27T15:06:38.060               128           0           6                   0             0 SUSPENDED       CXPACKET 
    2009-05-27T15:06:38.060               128           0           2                   0             0 SUSPENDED       CXPACKET 
    2009-05-27T15:06:38.060               128           0           3                   0             0 SUSPENDED       CXPACKET 
    2009-05-27T15:06:38.060               128           0           4                   0             0 SUSPENDED       CXPACKET 
    2009-05-27T15:06:38.060               128           0           7                   0             0 SUSPENDED       CXPACKET 
    2009-05-27T15:06:38.060               128           0           8                   0             0 SUSPENDED       CXPACKET 
    2009-05-27T15:06:38.060               128           0           1                   0             0 SUSPENDED       CXPACKET 
    2009-05-27T15:06:38.060               128           0          10                   0             0 RUNNABLE        CXPACKET 
    2009-05-27T15:06:38.060               138           0          10                   0             0 SUSPENDED       CXPACKET 
    2009-05-27T15:06:38.060               138           0          12                   0             0 RUNNABLE        CXPACKET 
    2009-05-27T15:06:38.060               138           0          11                   0             0 SUSPENDED       CXPACKET 
    2009-05-27T15:06:38.060               138           0           9                   0             0 SUSPENDED       CXPACKET 
    2009-05-27T15:06:38.060               138           0           0                   0             3 SUSPENDED       CXPACKET 
    2009-05-27T15:06:38.060               138           0           3                   0             8 SUSPENDED       CXPACKET 
    2009-05-27T15:06:38.060               138           0           1                   0             8 SUSPENDED       CXPACKET 
    2009-05-27T15:06:38.060               138           0           2                   0             8 SUSPENDED       CXPACKET 
    2009-05-27T15:06:38.060               138           0           4                   0             8 SUSPENDED       CXPACKET 
    2009-05-27T15:06:38.060               138           0           7                   0            12 SUSPENDED       CXPACKET 
    2009-05-27T15:06:38.060               138           0           5                   0            12 SUSPENDED       CXPACKET 
    2009-05-27T15:06:38.060               138           0           6                   0            12 SUSPENDED       CXPACKET 
    2009-05-27T15:06:38.060               138           0           8                   0            12 SUSPENDED       CXPACKET 
    2009-05-27T15:06:38.060               163           0           0                   0             0 RUNNABLE                 
    2009-05-27T15:06:38.060               170           0           0                 163             0 SUSPENDED       LCK_M_X  

    This is a classic example, where a long blocking chain starts to build due to CXPACKET waits. Note that the same sessionid is reported on multiple lines, but the ecid column changes. This indicates that the batch (sessionid) is being processed in parallel by more than one worked (ecid). This extract is very early on in the problem, from here the chains grow and grow, and eventually can result in multiple deadlocks. In this scenario the batches being executed by sessions 128 and 138 (which are both running PROC_LISTUNGHOSTEDFIELDTEMPLATESINLIST ) are picking parallel plans inappropriately. We worked around this by changing MAXDOP on the server to a value of 1. This is definitely something that you should test in your environment, since it’s a server wide setting. Currently according to the above KB articles there is no fixed recommendation about MAXDOP for MOSS deployments, but things can change, so you should check the fixed requirements of your version. In this scenario we saw significant other number of procedures which encountered the same issue, procedures such as proc_CheckDirectoryTree, proc_DeplGetListItemLinks and PROC_DEPLGETLISTITEMANDRESERVEID amongst others. Typical errors you might see in the application logs for SharePoint would be

    Unexpected query execution failure, error code 1205. Additional error information from SQL Server is included below. "Transaction (Process ID 163) was deadlocked on lock resources with another process and has been chosen as the deadlock victim. Rerun the transaction."

    A global change of MAXDOP = 1 resolved this specific situation.

    2. Are you encountering long blocking chains accessing the AllDocs table? This object is commonly accessed by a number of different stored procedures, and I’ve seen some scenarios where it becomes a point of contention. The most problematic one I came across recently resulted in numerous deadlocks, which firstly throw the above 1205 error to the SharePoint application log, and then in addition they throw the following error:

    InitializePublishingWebDefaults failed. Microsoft.SharePoint.SPException: Operation aborted (Exception from HRESULT: 0x80004004 (E_ABORT)) ---> System.Runtime.InteropServices.COMException (0x80004004): Operation aborted (Exception from HRESULT: 0x80004004 (E_ABORT))     at Microsoft.SharePoint.Library.SPRequestInternalClass.SetWebProps

    This second error occurs as SharePoint does not roll back properly after encountering the deadlock. If you have collected your SQLDIAG data as noted above, you’ll see symptoms like this in the perfstats script output:

    a) long blocking chains waiting on LCK_M_U

    -- requests --
    runtime                        session_id request_id  ecid        blocking_session_id blocking_ecid task_state      wait_type                                          wait_duration_ms     
    ------------------------------ ---------- ----------- ----------- ------------------- ------------- --------------- -------------------------------------------------- -------------------- 
    2009-05-28T10:01:40.973                60           0           0                 197             0 SUSPENDED       LCK_M_X                                                            3201 
    2009-05-28T10:01:40.973                72           0           0                  60             0 SUSPENDED       LCK_M_S                                                            3224 
    2009-05-28T10:01:40.973                77           0           0                   0             0 SUSPENDED       WAITFOR                                                            9087 
    2009-05-28T10:01:40.973                84           0           0                  72             0 SUSPENDED       LCK_M_S                                                            1018 
    2009-05-28T10:01:40.973               197           0           0                   0             0 RUNNING                                                                               0 
    2009-05-28T10:01:40.973               268           0           0                 270             0 SUSPENDED       LCK_M_U                                                           36299 
    2009-05-28T10:01:40.973               270           0           0                 197             0 SUSPENDED       LCK_M_U                                                           96753 
    2009-05-28T10:01:40.973               277           0           0                 270             0 SUSPENDED       LCK_M_U                                                           44290 
    2009-05-28T10:01:40.973               280           0           0                 270             0 SUSPENDED       LCK_M_U                                                           60461 
    2009-05-28T10:01:40.973               292           0           0                 270             0 SUSPENDED       LCK_M_U                                                           52389 

    b) the notableactivequeries part of the perfstats script will potentially note that the plan_total_exec_count is very low. This indicates that you appear to be running a freshly compiled plan for the blocking statement. This will be all the more likely if your system was running well, and then suddenly started degrading performance and encountering errors

    -- notableactivequeries --
    runtime                        session_id request_id  plan_total_exec_count plan_total_cpu_ms    plan_total_duration_ms plan_total_physical_reads plan_total_logical_writes plan_total_logical_reads 
    ------------------------------ ---------- ----------- --------------------- -------------------- ---------------------- ------------------------- ------------------------- ------------------------ 
    2009-05-28T10:01:40.973               197           0                     5                33057                 406079                      2550                         0                  2352015 
    2009-05-28T10:01:40.973               268           0                     5                33057                 406079                      2550                         0                  2352015 
    2009-05-28T10:01:40.973               270           0                     5                33057                 406079                      2550                         0                  2352015 
    2009-05-28T10:01:40.973               277           0                     5                33057                 406079                      2550                         0                  2352015 
    2009-05-28T10:01:40.973               280           0                     5                33057                 406079                      2550                         0                  2352015 
    2009-05-28T10:01:40.973               292           0                     5                33057                 406079                      2550                         0                  2352015 

     

     

     

     

    c) when reviewing the error logs you note that the same statement shown below, being executed within the procedure proc_UpdateTpWebMetaData, is always involved in the deadlocking

    UPDATE
        Docs
    SET
        Docs.BuildDependencySet = NULL
    FROM
        Docs INNER JOIN BuildDependencies
    ON
        BuildDependencies.SiteId = Docs.SiteId AND
        BuildDependencies.LeafName = Docs.LeafName AND
        BuildDependencies.DirName = Docs.DirName
    WHERE
        BuildDependencies.SiteId = @SiteId AND
        BuildDependencies.TargetDirName = @PreviousCustomMasterDir AND
        BuildDependencies.TargetLeafName = @PreviousCustomMasterLeaf AND
        Docs.BuildDependencySet IS NOT NULL

    The reason for this problem, in this particular scenario, is that a new plan has been compiled recently for this statement which is using a HASH join, as opposed to a nested loop join which it was using earlier. This is caused by a lack of up to date statistics on the AllDocs table. We noted scenarios where the regular update statistics jobs (issued by SharePoint as part of the changes in 2007 SP2) were not thorough enough, and we were forced to run update statistics with the fullscan option on the AllDocs table.

    The bottom line here being that although SharePoint does now do database maintenance, as shown in KB932744, there are still times potentially where you will need to do more detailed targeted maintenance.

  • SQL in Stockholm

    Non-yielding IOCP Listener - Stack Dump analysis

    • 2 Comments

     

    Here's something I was looking at yesterday which is worth a quick look. It's quite a common scenario and one that's relatively easy to address. Your SQL Server 2005 instance suddenly throws the following to the error log (apparently without warning):

    2008-09-09 08:30:44.810 Server       Using 'dbghelp.dll' version '4.0.5'
    2008-09-09 08:30:44.960 Server       **Dump thread - spid = 0, PSS = 0x0000000000000000, EC = 0x0000000000000000
    2008-09-09 08:30:45.000 Server       ***Stack Dump being sent to F:\MSSQL.1\MSSQL\LOG\SQLDump0001.txt                 
    2008-09-09 08:30:45.000 Server       * *******************************************************************************
    2008-09-09 08:30:45.010 Server       *                                                                                
    2008-09-09 08:30:45.010 Server       * BEGIN STACK DUMP:                                                              
    2008-09-09 08:30:45.010 Server       *   09/09/08 08:30:45 spid 0                                                     
    2008-09-09 08:30:45.010 Server       *                                                                                
    2008-09-09 08:30:45.010 Server       * Non-yielding IOCP Listener                                                     
    2008-09-09 08:30:45.010 Server       *                                                                                
    2008-09-09 08:30:45.010 Server       * *******************************************************************************
    2008-09-09 08:30:45.010 Server       * -------------------------------------------------------------------------------
    2008-09-09 08:30:45.010 Server       * Short Stack Dump                                                               
    2008-09-09 08:30:45.320 Server       Stack Signature for the dump is 0x000000000000020B                               
    2008-09-09 04:47:06.180 spid3s       LazyWriter: warning, no free buffers found.
                 
                            

    Following this several hundred lines of output similar to the following format are shown

    2008-09-09 08:31:08.750 spid3s      
    Memory Manager
        VM Reserved = 16963064 KB
        VM Committed = 16942872 KB
        AWE Allocated = 0 KB
        Reserved Memory = 1024 KB
        Reserved Memory In Use = 0 KB 
    2008-09-09 08:31:08.750 spid3s      
    Memory node Id = 0
        VM Reserved = 16957304 KB
        VM Committed = 16937200 KB
        AWE Allocated = 0 KB
        SinglePage Allocator = 2519016 KB
        MultiPage Allocator = 39632 KB 
    2008-09-09 08:31:08.750 spid3s      
    MEMORYCLERK_SQLGENERAL (Total)
        VM Reserved = 0 KB
        VM Committed = 0 KB
        AWE Allocated = 0 KB
        SM Reserved = 0 KB
        SM Committed = 0 KB
        SinglePage Allocator = 48576 KB
        MultiPage Allocator = 5136 KB


    You may well recognize that this latter output is basically the same as you would get if you run:
    DBCC MEMORYSTATUS

    You'll also have noted that a small dump file has been produced in the default error log directory of this SQL instance (the path of which is noted in the error log output above).

    In normal circumstances you can submit this dump file to us in support, and we can take a look at it and, all being well tell you what the problem is. The thing you have to remember here is that the production of the dump file does not necessarily mean that you have encountered a bug in SQL Server (as some people often believe). In the more recent versions of SQL there are innumerable scenarios where we control and catch certain scenarios well known to us and produce a dump file automatically.

    This particular scenario was caused by inappropriate configuration of SQL Server memory settings, and should be easily rectified, so I thought it a good example to show why.

    Although there is little that can normally be gleaned from debugging a SQL Server dump in the public domain, this one does contain a few hints. We don't publish private symbols for SQL Server, meaning that the amount you can do yourself with a dump file is quite limited (since you only have access to retail symbols), however you can look at the thread stacks and see if anything of note can be seen. Anyone who has read Ken Henderson's fantastic book about SQL Server internals will have used these techniques before. Sometimes it's just interesting to look a little deeper after all.

    I'll assume that you're already familiar with installing and configuring the debugger, but if you're not here's a good place to start. Debugging Tools and Symbols: Getting Started. This example just uses the retail symbol server address from this page:

    SRV*c:\websymbols*http://msdl.microsoft.com/download/symbols

    If you open this dump, you'll find yourself starting with the context set to the thread that generated the dump, in this case:

    0:004> kn
    # Child-SP          RetAddr           Call Site
    00 00000000`0474de58 00000000`77d704bf ntdll!NtWaitForSingleObject+0xa
    01 00000000`0474de60 00000000`016be0f9 kernel32!WaitForSingleObjectEx+0x130
    02 00000000`0474df00 00000000`016bdaeb sqlservr!CDmpDump::DumpInternal+0x4a3
    03 00000000`0474dff0 00000000`01c7c7b3 sqlservr!CDmpDump::Dump+0x3b
    04 00000000`0474e040 00000000`01d3af67 sqlservr!CImageHelper::DoMiniDump+0x413
    05 00000000`0474e1a0 00000000`025f8cf5 sqlservr!stackTrace+0x6e7
    06 00000000`0474f6d0 00000000`01a3b025 sqlservr!SQL_SOSNonYieldIOCPCallback+0x1ef
    07 00000000`0476f7e0 00000000`018662b6 sqlservr!SOS_OS::ExecuteNonYieldIOCPCallbacks+0x44
    08 00000000`0476f830 00000000`010b9628 sqlservr!alloca_probe+0x220815
    09 00000000`0476f990 00000000`029002c9 sqlservr!SchedulerMonitor::Run+0x164
    0a 00000000`0476fac0 00000000`010079fe sqlservr!SchedulerMonitor::EntryPoint+0x9
    0b 00000000`0476faf0 00000000`01008362 sqlservr!SOS_Task::Param::Execute+0xee
    0c 00000000`0476fc00 00000000`01007da1 sqlservr!SOS_Scheduler::RunTask+0xc9
    0d 00000000`0476fc90 00000000`014787a7 sqlservr!SOS_Scheduler::ProcessTasks+0xb4
    0e 00000000`0476fd00 00000000`014789ca sqlservr!SchedulerManager::WorkerEntryPoint+0xe7
    0f 00000000`0476fda0 00000000`01478911 sqlservr!SystemThread::RunWorker+0x59
    10 00000000`0476fde0 00000000`014795d9 sqlservr!SystemThreadDispatcher::ProcessWorker+0x130
    11 00000000`0476fe80 00000000`781337a7 sqlservr!SchedulerManager::ThreadEntryPoint+0x128
    12 00000000`0476ff20 00000000`78133864 msvcr80!_callthreadstartex+0x17
    13 00000000`0476ff50 00000000`77d6b6da msvcr80!_threadstartex+0x84

    This is the thread that is actually doing the monitoring and generating the dump. You can see in frame 02 the name of the function doing this. If you had the private symbols at this point, you can look through some of the earlier frames and determine which thread caused this condition to occur, but I can't show that here :-)

    The thread in question here was actually thread 8, we can switch to this and take a look

    0:008> kn 50
    # Child-SP          RetAddr           Call Site
    00 00000000`04f6f008 00000000`01750bd0 sqlservr!NumaNode::RampUp
    01 00000000`04f6f010 00000000`012f1523 sqlservr!alloca_probe+0xbf420
    02 00000000`04f6f110 00000000`0103f136 sqlservr!BPool::Steal+0x3ec
    03 00000000`04f6f1c0 00000000`0103f32e sqlservr!SQLSinglePageAllocator::AllocatePages+0x25
    04 00000000`04f6f200 00000000`0103f1ce sqlservr!MemoryNode::AllocatePagesInternal+0xce
    05 00000000`04f6f4a0 00000000`0103f7ba sqlservr!MemoryClerkInternal::AllocatePages+0x6e
    06 00000000`04f6f500 00000000`010d4bb4 sqlservr!CMemThread::PbGetNewPages+0x6a
    07 00000000`04f6f540 00000000`01068bba sqlservr!CFixedPageMgr::PbAllocate+0xcd
    08 00000000`04f6f5b0 00000000`01004d37 sqlservr!CMemThread::Alloc+0x1d7
    09 00000000`04f6f640 00000000`0147dc3a sqlservr!operator new+0x26
    0a 00000000`04f6f680 00000000`01045589 sqlservr!SOS_Node::EnqueueTaskDirectInternal+0x587
    0b 00000000`04f6f730 00000000`01045b03 sqlservr!SOS_Node::EnqueueTask+0x49
    0c 00000000`04f6f780 00000000`01046581 sqlservr!CNetConnection::EnqueuePacket+0x8f
    0d 00000000`04f6f890 00000000`01045cab sqlservr!TDSSNIClient::ReadHandler+0x171
    0e 00000000`04f6f900 00000000`010459bc sqlservr!SNIReadDone+0xb7
    0f 00000000`04f6f960 00000000`010079fe sqlservr!SOS_Node::ListenOnIOCompletionPort+0x251
    10 00000000`04f6faf0 00000000`01008362 sqlservr!SOS_Task::Param::Execute+0xee
    11 00000000`04f6fc00 00000000`01007da1 sqlservr!SOS_Scheduler::RunTask+0xc9
    12 00000000`04f6fc90 00000000`014787a7 sqlservr!SOS_Scheduler::ProcessTasks+0xb4
    13 00000000`04f6fd00 00000000`014789ca sqlservr!SchedulerManager::WorkerEntryPoint+0xe7
    14 00000000`04f6fda0 00000000`01478911 sqlservr!SystemThread::RunWorker+0x59
    15 00000000`04f6fde0 00000000`014795d9 sqlservr!SystemThreadDispatcher::ProcessWorker+0x130
    16 00000000`04f6fe80 00000000`781337a7 sqlservr!SchedulerManager::ThreadEntryPoint+0x128
    17 00000000`04f6ff20 00000000`78133864 msvcr80!_callthreadstartex+0x17
    18 00000000`04f6ff50 00000000`77d6b6da msvcr80!_threadstartex+0x84
    19 00000000`04f6ff80 00000000`00000000 kernel32!BaseThreadStart+0x3a

    So again, just by looking at the function names, we can make a guess that some internal memory routines related to the buffer pool are going on here, particularly from frames 02 and 03. If you ever attach a debugger into a running SQL Server instance (obviously never in production!) you'd see functions like this occurring regularly as SQL Server manages the buffer pool pages.

    There's not much more that can be seen from this stack, so we'll take a look through the rest of the threads, in this case by running

    ~* kn

    When browsing through these threads, we notice lots that also appear to be involved in memory related events, just one example of which is

      53  Id: 554.a6c Suspend: 1 Teb: 000007ff`fffd9000 Unfrozen
    # Child-SP          RetAddr           Call Site
    00 00000000`1521e140 00000000`012f1523 sqlservr!BPool::ReplenishFreeList+0x10d
    01 00000000`1521e240 00000000`0103f136 sqlservr!BPool::Steal+0x3ec
    02 00000000`1521e2f0 00000000`0103f32e sqlservr!SQLSinglePageAllocator::AllocatePages+0x25
    03 00000000`1521e330 00000000`0103f1ce sqlservr!MemoryNode::AllocatePagesInternal+0xce
    04 00000000`1521e5d0 00000000`0103f676 sqlservr!MemoryClerkInternal::AllocatePages+0x6e
    05 00000000`1521e630 00000000`0103f737 sqlservr!CVarPageMgr::PviNewVarPage+0x56
    06 00000000`1521e6a0 00000000`0100ca58 sqlservr!CVarPageMgr::PbAllocate+0x3ce
    07 00000000`1521e710 00000000`01004d37 sqlservr!CMemObj::Alloc+0xcc
    08 00000000`1521e780 00000000`0109b203 sqlservr!operator new+0x26
    09 00000000`1521e7c0 00000000`014607a1 sqlservr!operator new[]+0x23
    0a 00000000`1521e800 00000000`0145d4c8 sqlservr!CErrorReportingManager::CwchFormatToBuffer+0x180
    0b 00000000`1521e970 00000000`0128d202 sqlservr!CErrorReportingManager::CwchCallFormatToBuffer+0x18
    0c 00000000`1521e9b0 00000000`0128d0f8 sqlservr!CLoginClientInfo::CLoginClientInfo+0x71
    0d 00000000`1521ea30 00000000`01285265 sqlservr!IssueLoginSuccessReport+0x67
    0e 00000000`1521eb40 00000000`0128494f sqlservr!login+0x584
    0f 00000000`1521f480 00000000`01285610 sqlservr!process_login+0xee
    10 00000000`1521f840 00000000`010079fe sqlservr!process_commands+0x40d
    11 00000000`1521faf0 00000000`01008362 sqlservr!SOS_Task::Param::Execute+0xee
    12 00000000`1521fc00 00000000`01007da1 sqlservr!SOS_Scheduler::RunTask+0xc9
    13 00000000`1521fc90 00000000`014787a7 sqlservr!SOS_Scheduler::ProcessTasks+0xb4

    We note that sqlservr!BPool::ReplenishFreeList is common to a lot of the threads, and other BPool functions crop up frequently as well.

    At this point there is not much else to be gleaned from the dump without private symbols, however it's clear that there appears to be a significant number of threads waiting on what appear to be internal memory routines. In this case, we also have the fact that SQL Server itself has detected that it believes that memory allocations were the cause of the problem, which is why it has dumped out the DBCC MEMORYSTATUS to help us. The error log specifically states that it has "no free buffers".

    If you were to look through the whole of the DBCC MEMORYSTATUS and decode some of the latter items using this article as a guide

    http://support.microsoft.com/kb/907877

    then you would note that the buffer pool has been recently grown, and it is issuing a notification that it wished to grow further.

    So what next, is the question; you've noted that SQL Server appears to be having memory management problems, so maybe you think that you should add some more memory to alleviate this? Well in a word - "NO".

    The answer to this problem is that we have to look at the configuration of the hardware, and the configuration of SQL Server first (which in any troubleshooting scenario is always a good idea).

    What we saw here at the time of the dump was:

    MemoryLoad = 95%                    
    Total Physical = 16382 MB           
    Available Physical = 792 MB    

    And in fact over the preceding minutes before the dump we saw even lower memory availability than this, with physical utilization up to 98%. (this is something that we can extract from the dump with private symbols).

    So we do appear short of physical memory, but when we run sp_configure on the server, we see this:

    name                                minimum     maximum     config_value run_value
    ----------------------------------- ----------- ----------- ------------ -----------
    max server memory (MB)              16          2147483647  2147483647   2147483647

    Therefore the max server memory setting has not been set. Although SQL will dynamically attempt to manage its memory, there can be circumstances in large memory installations where if it runs unlimited you can get into situations where it ends up fighting with other processes or the OS for memory, and situations like this can occur. This is why we recommend running a "max server memory" setting in this type of scenario, to limit the size of the buffer pool to an amount lower than the total memory on the server itself. The most referenced article on this is probably here

    http://blogs.msdn.com/slavao/archive/2006/11/13/q-a-does-sql-server-always-respond-to-memory-pressure.aspx

    We see numerous cases revolving around this scenario and derivatives of it, and many can simply be avoided by limiting the memory of the SQL Server buffer pool with this setting, so that you don't ever get into this memory bound scenario. So the bottom line here is change the max server memory setting to something around the 13000 mark, and then monitor your memory usage to see if you're still encountering this type of problem, making further adjustments if necessary.

    If you have any questions surrounding this topic, feel free to post them.

  • SQL in Stockholm

    SQL Server blocking caused by database mirroring wait type DBMIRROR_DBM_EVENT

    • 0 Comments

    I was looking at a problem this morning about long blocking chains caused by the wait type DBMIRROR_DBM_EVENT, and it makes an interesting case study about database mirroring, blocking, undocumented wait types and general troubleshooting techniques for database mirroring. The information following relates ONLY to SQL 2005, SQL 2008 and SQL 2008 R2. However since the undocumented wait type is not supported, it is not guaranteed to be relevant to future versions higher than SQL 2008 R2 and considering the enhancements coming in this area of functionality, it may well be completely different.

    First off some thoughts about looking at root cause analysis of database mirroring problems. We see quite a few of these and often people do not have the appropriate data with which is make a full analysis. Typical examples being unexpected failovers, slow performance and sometimes blocking. Obviously the more data you have the easier, but some of key parts for looking at DBM problems historically are this:

    The perfmon counters in the SQLServer:Database Mirroring and SQLServer:Service Broker objects.

    Default profiler trace files (people often forget these but they have very useful DBM state change audit information in them)

    SQL Server Error Logs

    windows event logs

    backup of msdb database (this contains useful meta data such as the size of every log backup and the DBM monitor meta data if you've enabled it)

    memory dumps (if any were produced)

    Obviously if you've also proactively collected MPSReports or SQLDIAG as well, then this will be no bad thing. Back to my specific scenario.....

    In this example the symptom was that a huge blocking chain built on a high volume OLTP database where all the head blockers were waiting on DBMIRROR_DBM_EVENT. Here is the sequence of events I went through:

    1. Review the blocking chain itself - ho help here as all we can see is that we're waiting on DBMIRROR_DBM_EVENT

    2. Review the source for the undocumented wait type. Obviously you can't do this outside of MS, but I can say that at the time of writing this wait type represents the wait used when the principal is waiting for the mirror to harden an LSN, meaning that the transaction it's part of cannot commit. This immediately points quite specifically to the problem that the principal cannot commit transactions as it's waiting on the mirror. Now we need to investigate why the mirror is not committing transactions or why the principal doesn't know whether it is.

    3. Review the msdb system tables

    (a) Look at the [backupset] table to see if the size of the logs produced at the time of the problem are significantly higher then normal. If they were exceptionally large it may be that the mirror was flooded with transactions and could simply not keep up with the volume. This is why books online will tell you sometimes to disable mirroring if you need to do an exceptionally large logged operation such as an index rebuild. (reference for why this is at http://technet.microsoft.com/en-us/library/cc917681.aspx). Here i used the following TSQL

    SELECT backup_set_id,backup_start_date,database_name,has_bulk_logged_data,backup_size / 1000
    FROM [backupset]
    where backup_start_date between '2011-01-05 14:00:00' and '2011-01-05 19:30:00'
    go

    select round((AVG(backup_size)/1000),0)
    FROM [backupset]
    where database_name = 'mydatabase'

    (b) secondly I looked at the data in the tables [dbm_monitor_data]. The key here is to locate the timeframe in which we had a problem and then see if we were significant experiencing changes in any of the following:

    log_flush_rate
    send_queue_size
    send_rate
    redo_queue_size
    redo_rate

    These are all indicators similar to part (a) in that they might show a component or piece of architecture that wasn't responding. For example if the send_queue suddenly starts to grow but the re_do queue doesn't grow, then it would imply that the the principal cannot send the log records to the mirror so you'd want to look at connectivity maybe, or the service broker queues dealing with the actual transmissions.

    In this particular scenario we noted that all the counters appeared to have strange values, in that there were log backups going on of normal sizes, but there were no status changes, 0 send queue, 0 redo queue, a flat send rate and a flat redo rate. This is very strange as it implies that the DBM Monitor could not record any values from anywhere over the problem period.

    4. Review the SQL Server error logs. In this case there were no errors or information messages whatsoever, but in other scenarios such as this, it’s very common for errors in the 1400 range to be reported, examples of which you can find in other places in my other mirroring blogs, such as this Error 1413 example

    5. Review the default trace files – in this scenario I was not provided the default traces, however they are fantastic sources of DBM problem information, as they record state change events on all the partners.This is documented here:

    Database Mirroring State Change Event Class

    This often gives you a great picture of scenarios such as when network connectivity failed between one or all of the partners and then what the state of the partnership became afterwards.

    CONCLUSIONS:

    In this particular scenario I’m currently missing 2 key points of data, but that apart I can still make a reasonable hypothesis on the above information. We certainly can say that the blocking was caused by the fact that DBM was enabled to the due the blockers all waiting on the DBMIRROR_DBM_EVENT wait type. Since we know we didn’t flood the mirror with a large logged operation and that this deployment normally runs happily in this mode, we can exclude unusual large operations. This means that we have 2 potential candidates at this stage:

    1. Hardware problems on the connectivity between some or all of the partners.

    2. CPU exhaustion on the mirror server – simply unable to keep up with redos – the CPU exhaustion could itself be from a process outside of SQL Server or outside of the this mirror partnership.

    3. A problem with the mirroring code itself (we’d really need some memory dumps to confirm this though).

    Based upon experience I’d suspect 1 or 2, but I always keep an open mind about 3 as well, we’re trying to collect some more data now to look at this problem in more detail.

    As a final note though I noted some similar scenarios on the forums, and one particular one of note where the poster said that everything worked fine until they enabled mirroring, at which point the system ground to a halt with this wait type. This is far more likely to be a non-optimal configuration where some section of the architecture can simply not keep up with the volume of requests required by the partnership, and is a perfect example of why one must benchmark mirror deployments very carefully. In that scenario you should review this KB first : http://support.microsoft.com/kb/2001270

    Graham

  • SQL in Stockholm

    How to configure SQL Server database mirroring between 2 instances on a standalone server

    • 1 Comments

    There are lots of troubleshooting articles about how to diagnose connectivity problems when setting up database mirroring, especially errors such as these:

    Msg 1418, Level 16, State 1, Line 2

    The server network address "tcp://mirrortest.com:5023" can not be reached or does not exist. Check the network address name and that the ports for the local and remote endpoints are operational.

    or

    Msg 1456, Level 16, State 3, Line 1

    The ALTER DATABASE command could not be sent to the remote server instance

    but here's something a little different.

    Imagine that you have a server which is not connected to a domain. It might just be a standalone machine, it could be a virtual machine that you've just built. I've done this a few times in virtual environments to do demos on laptops and the like. It's a not a production scenario that you'd come across really, but it works really well in test.

    The questions that usually come up are how can I configure the partners (and endpoints) without having a FQDN, since I'm not connected to a domain, I'm just in "WORKGROUP". SQL Server still wants you to use FQDN formatting for the partner addresses. When you're trying to configure it, you might have even encountered an error like this when trying to start mirroring (depending on exactly how you did the setup)

    Database Mirroring login attempt failed with error: 'Connection handshake failed.

    There is no compatible authentication protocol. State 21

    What you need to do to be able to set this up is as follows:

    1. adding "local" as the primary DNS suffix as in detailed this article here (ok it's about exchange server but it works the same)

    http://technet.microsoft.com/en-us/library/aa997119.aspx

    (remember to reboot)

    2. You can now specify FQDN formatted names for the local machine in the format:

    For the principal:

    TCP://MIRRORTEST.LOCAL:5022

    For the mirror

    TCP://MIRRORTEST.LOCAL:5023

    (Or other port numbers of your choice, but remember that they'll need to be different)

    3. You'll need these ports as exceptions in Windows Firewall if you're running it (I was even though it was a local machine!)

    4. My endpoints are setup like this (but I just did them through the GUI) there's nothing exceptional here:

    CREATE ENDPOINT [Mirroring]

    AUTHORIZATION [MIRRORTEST\Administrator]

    STATE=STARTED

    AS TCP (LISTENER_PORT = 5022, LISTENER_IP = ALL)

    FOR DATA_MIRRORING (ROLE = PARTNER, AUTHENTICATION = WINDOWS NEGOTIATE, ENCRYPTION = DISABLED)

    You can now run some cool tests and demos without ever being connected to a network.

  • SQL in Stockholm

    SQL Server 2005 Error: 701, Severity: 17, State: 193

    • 1 Comments

    Here’s another interesting example which we worked recently in the Stockholm office, which relates also back to one of my previous entries. We had a customer whose error log was filled with a variety of critical errors such as these:

    Error: 701, Severity: 17, State: 193. There is insufficient system memory to run this query.

    Error: 1204, Severity: 19, State: 2. The instance of the SQL Server Database Engine cannot obtain a LOCK resource at this time. Rerun your statement when there are fewer active users. Ask the database administrator to check the lock and memory configuration for this instance, or to check for long-running transactions.

    Error: 17803, Severity: 20, State: 13. There was a memory allocation failure during connection establishment. Reduce nonessential memory load, or increase system memory. The connection has been closed. [CLIENT: <local machine>]

    This was a fairly high end system with 64GB RAM, 16 logical CPUs and was generally well configured in terms of its Max Server Memory and other basic sp_configure settings. As you may well be aware, there are innumerable causes that can actually cause you to receive a memory related error such as this, and some of the more basic troubleshooting tips are shown in the following MSDN article

    http://msdn.microsoft.com/en-us/library/aa337311(SQL.90).aspx

    However in this case none of the basics were really of any help. Typical starting questions to ask yourself in this scenario are:

    1. Is there a pattern to when this occurs

    2. Do I know of any particular workloads which are running when I see this problem

    …just to understand if you have any particular potential causes in your own code. Our customer in this scenario was not aware of any, and unfortunately we didn’t have any dumps created which might allow us to see retrospectively what batches were running when the problem occurred.

    However we did have the partial output of from DBCC MEMORYSTATUS which SQL had kindly written to the error log for us. An examination of this showed the following suspicious looking section:

    OBJECTSTORE_LOCK_MANAGER (node 2)
        VM Reserved = 0 KB
        VM Committed = 0 KB
        AWE Allocated = 0 KB
        SM Reserved = 0 KB
        SM Committed = 0 KB
        SinglePage Allocator = 29398856 KB
        MultiPage Allocator = 0 KB

    Global Memory Objects:
            Resource=    709
            Locks=    3676149
            SE Schema Mgr=    1646
            SQLCache=    261
            Replication=    2
            ServerGlobal=    109
            XPGlobal=    2
            Xact=    101
            SETLS=    16
            DatasetMemObjs=    32
            SubpDescPmos=    16
            SortTables=    2

    From this we can infer that we seemed to experience an equivalent problem to the one I wrote about here

    http://blogs.msdn.com/grahamk/archive/2009/02/03/compilation-bottlenecks-error-8628-severity-17-state-0-part-1.aspx

    although manifesting distinctly different symptoms.

    At the end of the day one of the NUMA nodes has taken up about 29GB of memory just managing lock resources, and this is confirmed by the number of locks which are reported in the global memory objects. Remember that the majority of the output from DBCC MEMORYSTATUS is publically documented here:

    How to use the DBCC MEMORYSTATUS command to monitor memory usage on SQL Server 2005

    http://support.microsoft.com/kb/907877

    This 29GB of memory purely managing locks appears to be placing an undue load of the memory resources of the system, and what we appear to have here is a runaway or huge batch of some type, or maybe a very large open transaction.

    The way to resolve this problem is to identify the batch and stop it from running. One good way to identify this problem proactively might be to use the tiny DMV query I wrote in my previous post about this type of situation and attach a monitor to it, warning you if the value goes above 1 million say, or some other high number of your choice.

    select request_session_id, count(*)
    from sys.dm_tran_locks
    group by request_session_id

    You could then easily identify the problem session_id before it got completely out of hand. There are of course many different alternative ways of capturing such data or doing such monitoring. Maybe you’d prefer to create a custom SQLDIAG package to monitor your server in more detail.

    The moral of this story being though, that even in the most critical system failures such as this, there may well be enough info to get a hint of where the problem lies, if you make sure you go through the error logs in detail. Remember not to be overwhelmed by the sheer volume of errors that might be reported; ensure you go through all the lines removing the duplicates to see whether something of interest is hidden amongst all the noise.

  • SQL in Stockholm

    SQL Server Database Mirroring error 1443 - Connection Handshake Failed

    • 0 Comments

    Here's another error that I picked from the forums today, which I see a fair number of occurrences of, both in the public forums on the web and internally in MS support as well.

    You attempt to enable database mirroring and you receive the following:

    Database Mirroring login attempt failed with error: 'Connection handshake failed. An OS call failed: (c0000413) 0xc0000413(error not found). State 67.'.

    Error: 1443, Severity: 16, State: 2.

    If you translate error 0xc0000413 you come up with the following:

    STATUS_AUTHENTICATION_FIREWALL_FAILED

    This may seem slightly strange, as you may not have a firewall in between these servers (well you might but even if you do this is not the problem, as if your firewall was blocking the traffic you would encounter error 1418 normally). But in truth this error comes from somewhere different. It's extremely likely that if you are experiencing this error you are mostly likely running mirroring with one of the following configurations:

    1. Using domain user logins for the service accounts, but different ones across the mirror partners

    2. Using multiple domains, as per the forums poster that I saw today.

    The chances are that you have not explicitly created logins and granted them CONNECT permissions to the endpoints, as per these 2 articles:

    http://msdn.microsoft.com/en-us/library/ms178029(SQL.90).aspx

    http://msdn.microsoft.com/en-us/library/ms179306(SQL.90).aspx

    If you are running in the same domain, with the same domain account as the SQL Server service account, then this is not required, but if you want to do your authentication in a slightly more complex way, then these are the articles for you.

    Alternatively of course you could run mirroring using certificates, if for some reason the above solution didn't fit in with your security policies.

    http://technet.microsoft.com/en-us/library/ms191477(SQL.90).aspx

    Which also is valid if you want to run your service accounts outside of a domain environment.

    Graham

  • SQL in Stockholm

    Slow recovery times and slow performance due to large numbers of Virtual Log Files

    • 1 Comments

    In my previous post about database mirroring configuration problems caused by large numbers of virtual log files, I mentioned that such a scenario can have other unwanted side effects, with or without database mirroring being enabled. Here’s an example that demonstrates one of these scenarios. This shows how recovery times after a disaster can be significantly extended by allowing your log file to grow in small increments, thereby creating thousands of VLFs. It’s worth noting as well that this scenario also demonstrates the performance impact of doing online file growth during periods of large activity. This scenario can affect you with or without mirroring enabled, it’s just quicker to demonstrate with configuring mirroring.

    I mentioned before that to fully understand this optic you need to be familiar with Transaction Log Physical Architecture and also Factors That Can Delay Log Truncation.

    There is one simple lesson to be learnt here, so I’ll write it early on in the article:

    Pre-grow your log files (and your data files) in advance. Avoid repeated online auto-growth if at all possible.

    Consider the following example scripts which are based upon the same routines as used in the previous post, but with some small changes.

    These scripts do the following actions, one for a good scenario and one for the bad scenario.

    1. Create a database named VLF

    2. Create a starting backup point

    3. Open a transaction in this database to ensure that the transaction cannot be truncated

    4. Fill up the log with thousands of transactions until the log reaches the space limit set for it – in this case we test with a 5GB log. The test of the log running out of space is used just to guarantee log files of equivalent sizes for the recovery time comparison. This situation holds true for logs that haven’t run out of space.

    5. Restart the server without taking a backup or dealing with the log space. This forces SQL to recover the entire contents of the active portion of the log, which is in effect, all of it.

    We then just start the SQL Server service again and monitor the recovery times and look in the SQL Server error logs.

    We notice a number of interesting points here:

    1. The runtime required to fill up the 5GB log files is significantly faster when using the pre-grown log file. Less than 2 minutes for the pre-grown log file compared to over 9 minutes for the incrementally auto grown file. This demonstrates why auto-growths can be bad. They are expensive to manage and can harm your performance. The fact that the log fills quicker indicates that you are managing to throughput the transactions faster.

    2. There is a delay of about 2 minutes when nothing is logged about the recovery phases of the VLF database in the bad scenario. This is the period in between the database starting up and the analysis phase beginning. In reality there is a large amount of work going on here; SQL Server is processing the 20000 VLFs that are present in the log file. This delay can be extended in proportion to the number of VLFs that you allow your log file to hold. I have seen real world examples where hundreds of thousands of VLFs have significantly delayed a recovery time.

    3. The actual recovery times are similar (about 2 minutes) once the recovery actually begins. These recovery times are this length due to the fact that we created the old open transaction in the test meaning that SQL is forced to recover the entirety of the log.

    4. As well as being significantly faster to fill up the log (in terms of runtime) which indicates that the processing of transactions is faster, when we look at the log in detail we see that we have processed about 6500 more transactions in the shorter period. Further evidence of how much better performance the pre-grown log files gives you. We’ve managed to process 25% more transactions in a quarter of the time!

    The scripts and error log output are shown below. The scripts use the undocumented DBCC LOGINFO command to examine the number of virtual log files in the physical log. Remember as with all undocumented commands, use of it is not supported, and you use it at your own risk. The number of rows that this command outputs represent the number of VLFs currently in the physical log.

    SCRIPT 1 – Bad scenario

    --CONNECT to the test server instance

     

    use master

    go

     

    if exists (select name from sys.databases where name = 'vlf')

    begin

    drop database vlf --(if you need to)

    end

     

     

    --create a database on the root of C with a small starting log size

    --which then grows in small increments

    create database vlf

    on (name=vlf, filename='c:\vlf.mdf')

    log on (name=vlf_log, filename='c:\vlf_log.ldf',size=1MB,maxsize=5GB,filegrowth=1MB)

    go

     

    use vlf

    go

     

    --create a dummy table to fill up

    create table vlf_test

    (

    a int identity(1,1) not null,

    b char(8000) not null,

    c int not null

    )

    go

     

    insert vlf_test values ('vlftest',0)

    go

     

    --create a starting point of backups

    backup database vlf to disk = 'c:\vlf.bak' with stats = 1

    go

    backup log vlf to disk = 'c:\vlf1.bak'

    go

     

    --open a 2nd new connection and open an explicit uncommitted transaction

    --to ensure that the transaction log cannot truncate

     

    use vlf

    go

     

    begin tran

    create table opentran

    (

    opentran int

    )

    Go

     

    --switch back to the first connection

    --now run the following to grow the log to create lots of VLFs

    --run until the log runs out of space - a convenient stopping point

     

    use vlf

    go

    set nocount on

    go

     

    declare @x int

    declare @a char(8000)

    declare @b char(8000)

    set @a = replicate('a',8000)

    set @b = replicate('b',8000)

    set @x = 1

    while 1=1

    begin

    update vlf_test set b = @b, c = @x where a = 1

    update vlf_test set b = @a where a = 1

    set @x = @x + 2

    end

    go

     

    --let this run until the log runs out of space

    --on the 2nd connection, use dbcc loginfo to check how many VLFs you have

     

    DBCC LOGINFO (vlf)

     

    --when you have reached 5GB of log file with several thousands VLFs

    --simulate a server disaster by cutting the power to the principal server

    --alternatively you could just power off your server ungracefully

     

    shutdown with nowait

     

    --now restart the SQL service or power on the server

    I get the following results from the error log when I run this on my test server.

    Runtime of filling the log: 9 minutes 21 seconds

    Number of VLFs when log full: 20480

    Time for VLF to recover fully: 3 minutes 40 seconds

    Excerpts from the SQL Server log:

    Server starts:

    2008-05-16 13:22:40.85 Server Microsoft SQL Server 2005 - 9.00.3186.00 (X64)

    VLF db starts up

    2008-05-16 13:22:47.97 spid18s Starting up database 'vlf'.

    2 minute pause whilst nothing is logged about vlf database, then analysis starts. It is during this period that SQL Server is processing the large number of VLFs.

    2008-05-16 13:24:34.28 spid18s Analysis of database 'vlf' (19) is 3% complete (approximately 8 seconds remain). This is an informational message only. No user action is required.

    Redo has started

    2008-05-16 13:24:45.80 spid18s Recovery of database 'vlf' (19) is 3% complete (approximately 108 seconds remain). Phase 2 of 3. This is an informational message only. No user action is required.

    Note how many transactions have been rolled forward – this is how many we did in the test

    2008-05-16 13:26:21.36 spid18s 26533 transactions rolled forward in database 'vlf' (19). This is an informational message only. No user action is required.

    Undo has started

    2008-05-16 13:26:21.39 spid18s Recovery of database 'vlf' (19) is 99% complete (approximately 0 seconds remain). Phase 3 of 3. This is an informational message only. No user action is required.

    Recovery complete

    2008-05-16 13:26:21.49 spid5s Recovery is complete. This is an informational message only. No user action is required.

    SCRIPT 2 – Good scenario

    The script is identical to the bad scenario shown above, apart from the simple fact that we create the log to a pre-expand size of 5GB, using this command:

    create database vlf

    on (name=vlf, filename='c:\vlf.mdf')

    log on (name=vlf_log, filename='c:\vlf_log.ldf',size=5GB,maxsize=5GB,filegrowth=1MB)

    This gives the following results on the identical server.

    Runtime of filling the log: 1 minutes 54 seconds

    Number of VLFs when log full: 16

    Time for VLF to recover fully: 2 minutes 2 seconds

    Excerpts from the SQL Server log:

    Server starts:

    2008-05-16 13:40:16.10 Server Microsoft SQL Server 2005 - 9.00.3186.00 (X64)

    VLF db starts up

    2008-05-16 13:40:22.64 spid18s Starting up database 'vlf'.

    6 second pause whilst nothing is logged about vlf database, then analysis starts. It is during this period that SQL Server is processing the VLFs, but in this case because there are only 16, it completes much quicker. This is the fundamental difference between the examples.

    2008-05-16 13:40:28.99 spid18s Analysis of database 'vlf' (19) is 2% complete (approximately 157 seconds remain). This is an informational message only. No user action is required.

    Redo has started

    2008-05-16 13:40:44.22 spid18s Recovery of database 'vlf' (19) is 4% complete (approximately 106 seconds remain). Phase 2 of 3. This is an informational message only. No user action is required.

    Note how many transactions have been rolled forward – this is how many we did in the test

    2008-05-16 13:42:18.69 spid18s 33216 transactions rolled forward in database 'vlf' (19). This is an informational message only. No user action is required.

    Undo has started

    2008-05-16 13:42:18.69 spid18s Recovery of database 'vlf' (19) is 99% complete (approximately 0 seconds remain). Phase 3 of 3. This is an informational message only. No user action is required.

    Recovery complete

    2008-05-16 13:42:18.89 spid5s Recovery is complete. This is an informational message only. No user action is required.

    I hope this demo gives further weight to the case for pre growing your SQL Server files in advance. Feel free to post a comment if you have a question.

  • SQL in Stockholm

    SQL Server Error: 7320 Cannot execute the query "xxx" against OLE DB provider "DB2OLEDB" for linked server "xxx"

    • 0 Comments

    Here's a little something that me and my colleagues were looking at this week. Our customer was executing queries against a DB2 linked server using the latest provider and occasionally got these errors. One of my colleagues suggested to set a dump trigger on the error 7320 so that we could capture a sqlservr.exe dump when the problem next occurred. We achieved this by running

    dbcc dumptrigger('SET',7320)

    on the server. By default this will collect a mini-dump, unless you have enabled any other trace flags previously. When we had encountered the error and analyzed the dump, it showed up the following stack

    sqlservr!ex_raise+0x52
    sqlservr!COledbError::HandleExecuteError+0x11b
    sqlservr!COledbRangeRowset::FDoPerformSchemaCheckForQuery+0xc7
    sqlservr!COledbRangeRowset::FPerformSchemaCheck+0x1fb
    sqlservr!COledbRangeRowset::FCheckSchema+0x2a
    sqlservr!CEnvOledbRange::XretSchemaChanged+0x16
    sqlservr!CXStmtDML::XretDMLExecute+0x41037b
    sqlservr!CMsqlExecContext::ExecuteStmts<1,1>+0xace
    sqlservr!CMsqlExecContext::FExecute+0x439
    sqlservr!CSQLSource::Execute+0x355
    sqlservr!CStmtExecStr::XretExecStrExecute+0x48d
    sqlservr!CXStmtExecStr::XretExecute+0x20
    sqlservr!CMsqlExecContext::ExecuteStmts<1,1>+0x39e
    sqlservr!CMsqlExecContext::FExecute+0x439
    sqlservr!CSQLSource::Execute+0x355
    sqlservr!CStmtExecProc::XretLocalExec+0x125
    sqlservr!CStmtExecProc::XretExecExecute+0x132
    sqlservr!CXStmtExec::XretExecute+0x20
    sqlservr!CMsqlExecContext::ExecuteStmts<1,1>+0x39e
    sqlservr!CMsqlExecContext::FExecute+0x439
    sqlservr!CSQLSource::Execute+0x355
    sqlservr!process_request+0x312
    sqlservr!process_commands+0x3fd

     

    Before going into any in depth debugging which would involve monitoring the remote server and the components in between, my colleague thought it would be sensible to check the setting surrounding Schema Validation as we were obviously throwing the error in a routine related to this functionality (FDoPerformSchemaCheckForQuery). So we changed the option of the linked server to say enable "Lazy Schema Validation", using this syntax:

    exec sp_serveroption xxxxx, 'lazy schema validation', true

    If you look this option up in MSDN under "Configuring OLE DB Providers for Distributed Queries" you'll note the following regarding this option:

    "If this option is set to false, the default value, SQL Server checks for schema changes that have occurred since compilation in remote tables. This check occurs before query execution. If there is a change in the schema, SQL Server recompiles the query with the new schema.

    If this option is set to true, schema checking of remote tables is delayed until execution. This can cause a distributed query to fail with an error if the schema of a remote table has changed between query compilation and execution.

    You may want to set this option to true when distributed partitioned views are being used against a linked server that is running SQL Server. A specified table that participates in the partitioned view may not be actually used in a specific execution of a query against the view. Therefore, deferring the schema validation may improve performance."

    Now in all truth this was quite a pragmatic approach, in that we hadn't fully debugged the problem, we were just taking an educated guess, that we were experiencing a problem surrounding schema validation, and so far, since implementing this option, the errors have not re-occurred, so we've managed to workaround the problem.

    For reference the customer in question was also experiencing other errors in the 7300 to 7320 range, which have also stopped occurring since this option was set. Errors such as

    Error: 7350 Cannot get the column information from OLE DB provider "DB2OLEDB" for linked server "xxx"

    Error: 7321 An error occurred while preparing the query "xxx" for execution against OLE DB provider "DB2OLEDB" for linked server "xxx"

    Error: 7303 Cannot initialize the data source object of OLE DB provider "DB2OLEDB" for linked server "xxx"

    The moral of this story is that linked servers can be very difficult and time consuming to troubleshoot, due to the number of distributed components, including ones outside of the SQL Server area completely. This is one time where I am a fan of testing some of the different settings in the provider to see if the problem can be worked around initially, before doing some in depth analysis. In most other cases I'd far prefer to know why, and we may yet revisit this particular situation to fully debug it, but sometimes a workaround and an error free production environment is also nice for a quick win.

    For future reference another interesting quick diagnostic that can be performed on linked servers is to run Process Monitor on the server in question to see whether you are having any obscure permission of registry problems. I've seen this give some quick wins as well.

  • SQL in Stockholm

    1413 error when starting Database Mirroring – How many virtual log files is too many?

    • 1 Comments

    Recently in Cumulative update package 6 for SQL Server 2005 Service Pack 2 we released a fix for a problem where you receive a 1413 error when starting database mirroring. The error that you get in the looks like this:

    Error: 1413, Severity: 16, State: 1.

    Communications to the remote server instance 'TCP://ComputerName:PortNumber' failed before database mirroring was fully started. The ALTER DATABASE command failed. Retry the command when the remote database is started.

    In addition to this, if you look in the error logs themselves you will normally see the following 2 errors reported as well:

    Error: 1443, Severity: 16, State: 2.

    Database mirroring has been terminated for database 'xxx'. This is an informational message only. No user action is required.

    And you also note:

    Error: 1479, Severity: 16, State: 1.

    The mirroring connection to "TCP://ComputerName:PortNumber" has timed out for database "xxx" after 10 seconds without a response. Check the service and network connections.

    Although you can install the hotfix, I think it’s worth considering why this problem actually occurs and how you can potentially avoid it happening at all through good database management practices, which have benefits far and beyond the configuration of database mirroring.

    Hopefully you are familiar with the architecture of a SQL Server transaction log file. In principal the primary reason that this 1413 and its associated time out errors occur is that when you start database mirroring SQL Server needs to read through the transaction log file to bring the 2 databases into a synchronized state, following the restore of the last transaction log backup to the mirror partner. The time is takes to do this is dependent upon the number of virtual log files contained within the physical log file. This sometimes leads to confusion as a relatively small physical log file can contain thousands of virtual log files if it’s been badly managed. This in turn means that a relatively small physical log file (say 1GB) can be the cause of this 1413 error, and in turn the exact same database configured for mirroring on the exact same hardware will successfully set up when it has an identically sized physical log, but one that has been better managed and doesn’t contain so many virtual log files.

    To demonstrate this, try the following reproduction examples. To run these you’ll need 2 SQL Server instances with mirroring endpoints already set up (I’ll assume if you’re reading this far that you already understand this). A good place to try this repro is a test server environment where you already have a mirroring partnership configured. The repros use the undocumented DBCC LOGINFO command to examine the number of virtual log files in the physical log. Remember as with all undocumented commands, use of it is not supported, and you use it at your own risk.

    The example uses a common scenario where a log has been left to grow and runs out of space, but actually there are many different ways that you can cause your log to grow. Some are documented here. This is just a quick way to run the demo.

    Script 1: produces the 1413 failure – note the thousands of VLFs

    --CONNECT to the primary server insatnce

    use master

    go

     

    drop database vlf --(if you need to)

     

    --create a database on the root of C with a small starting log size

    --which then grows in small increments

    create database vlf

    on (name=vlf, filename='c:\vlf.mdf')

    log on (name=vlf_log, filename='c:\vlf_log.ldf',size=1MB,maxsize=1GB,filegrowth=1MB)

    go

     

    use vlf

    go

     

    --create a dummy table to fill up

    create table vlf_test

    (

    a int identity(1,1) not null,

    b char(8000) not null,

    c int not null

    )

    go

     

    insert vlf_test values ('vlftest',0)

    go

     

    --create a starting point of backups

    backup database vlf to disk = 'c:\vlf.bak' with stats = 1

    go

    backup log vlf to disk = 'c:\vlf1.bak'

    go

     

    set nocount on

    go

     

    --grow the log until it runs out of space

    declare @x int

    declare @a char(8000)

    declare @b char(8000)

    set @a = replicate('a',8000)

    set @b = replicate('b',8000)

    set @x = 1

    while 1=1

    begin

    update vlf_test set b = @b, c = @x where a = 1

    update vlf_test set b = @a where a = 1

    set @x = @x + 2

    end

    go

     

    --now alter the log size and backup the log

    alter database vlf modify file (name=vlf_log, maxsize=2GB)

    go

    backup log [vlf] to disk = 'c:\vlf2.bak' with stats=1

    go

     

    --use the undocumented DBCC LOGINFO command to check the number of VLFs in our

    dbcc loginfo (vlf)

    go

    --note the number of VLFs, it should be several thousand

    --for me it was 4100

     

    --connect to the other server which you wish to be the mirror

    --copy the 3 backup files across to it's C drive and run the following

    use master

    go

     

    restore database [vlf] from disk = 'c:\vlf.bak' with norecovery, replace, file=1,

    move 'vlf' to 'c:\vlf.mdf',

    move 'vlf_log' to 'c:\vlf_log.LDF'

    go

    restore log [vlf] from disk = 'c:\vlf1.bak' with norecovery

    go

    restore log [vlf] from disk = 'c:\vlf2.bak' with norecovery, stats=1

    go

     

    --you now have a pair of databases ready to have mirroring started

    --try to start database mirroring using the SSMS GUI and you will receive the 1413 error

    --the only exception to this might be if you run it on particularly fast hardware

    Script 2: works fine - identical except that the log has been pre-grown to the same size. This is good practice for all database files, whatever their type.

    --CONNECT to the primary server instance

    --remember to delete any old backup files of the same name from previous tests

    use master

    go

     

    drop database vlf --(if you need to)

     

    create database vlf

    on (name=vlf, filename='c:\vlf.mdf')

    log on (name=vlf_log, filename='c:\vlf_log.ldf',size=1GB,maxsize=1GB,filegrowth=250MB)

    go

     

    use vlf

    go

     

    create table vlf_test

    (

    a int identity(1,1) not null,

    b char(8000) not null,

    c int not null

    )

    go

     

    insert vlf_test values ('vlftest',0)

    go

    backup database vlf to disk = 'c:\vlf.bak' with stats = 1

    go

    backup log vlf to disk = 'c:\vlf1.bak'

    go

    set nocount on

    go

     

    --this will grow the log until it runs out of space

    declare @x int

    declare @a char(8000)

    declare @b char(8000)

    set @a = replicate('a',8000)

    set @b = replicate('b',8000)

    set @x = 1

    while 1=1

    begin

    update vlf_test set b = @b, c = @x where a = 1

    update vlf_test set b = @a where a = 1

    set @x = @x + 2

    end

    go

     

    --now alter the log size and backup the log

    alter database vlf modify file (name=vlf_log, maxsize=2GB)

    go

    backup log [vlf] to disk = 'c:\vlf2.bak' with stats=1

    go

     

    --use the undocumented DBCC LOGINFO command to check the number of VLFs in our

    dbcc loginfo (vlf)

    go

    --note the number of VLFs, it should be only about 16 or so

     

    --connect to the other server which you wish to be the mirror

    --copy the 3 back files across to it's C drive

     

    restore database [vlf] from disk = 'c:\vlf.bak' with norecovery, replace, file=1,

    move 'vlf' to 'c:\vlf.mdf',

    move 'vlf_log' to 'c:\vlf_log.LDF'

    go

    restore log [vlf] from disk = 'c:\vlf1.bak' with norecovery

    go

    restore log [vlf] from disk = 'c:\vlf2.bak' with norecovery, stats=1

    go

     

    --you now have a pair of databases ready to have mirroring started

    --try to start database mirroring using the SSMS GUI and it should work

    --the only exception to this might be if you run it on particularly slow hardware

    I ran my test on 2 small single processor servers. You might get slightly different results as the scripts note, depending upon your exact configuration. I used SQL Server 2005 build 3186 (CU#3) for this test.

    My main point here is that you should look to control your transaction log proactively. There are obviously times when this is not possible, or other mitigating circumstances arise, which is why we have released a hotfix to improve the speed with which the database mirroring configuration reads the log files. This improvement is also included in SQL Server 2008 as well. However there are other issues that you can encounter if you let you virtual log files grow unchecked, such as excessive recovery times, so as with all SQL Server file management tasks, proactive is best. We always recommend that our customers analyze their expected usage and pre-grow their database files.

    I hope this helps to explain the database mirroring error in more depth, but feel free to post a comment if you have a question.

  • SQL in Stockholm

    SQL Server Compilation Bottlenecks – Error: 8628, Severity: 17, State: 0. – Part 1

    • 2 Comments

    [alternative title: Memory Troubleshooting using DBCC MEMORYSTATUS]

    The following samples are taken from a live problem I was working on for most of last week. The steps might well help you resolve an occurrence of the above error, but the methodology might help you with some general performance and memory troubleshooting as well, hence the alternative title.

    Symptoms:

    • A previously well running server starts exhibiting extremely slow performance for all users during the online day.
    • CPU time is maxed out at 100%
    • The issue is intermittent and has no correlation to batch business processing or periods of peak usage

    Initial Actions:

    • Configure a PSSDIAG (but could just have easily have been SQLDIAG) to collect perfstats script, perfmon counters
    • Don’t include profiler trace to begin with, as the server is already maxed out for CPU and you might skew the results or make the situation worse, especially as going into the problem you don’t actually know what the problem is related to
    • Configure the collection for long term running, as you don’t know when the error will next occur

    Initial Results

    • When reviewing the error logs you note that Error: 8628, Severity: 17, State: 0. was logged repeatedly when the problem occurred in the past.
    • RESOURCE_SEMAPHORE_QUERY_COMPILE  is the top aggregate wait stat in sys.dm_os_wait_stats
    • The server was clustered and a failover occurred when the problem last happened
    • No other items of note were recorded in the past
    • When the issue is not occurring everything looks fine on the server, there are no indications of dangerously long running queries or memory pressure etc.

    This is a common scenario that I come across. You know something bad has happened, you can reproduce it on demand, but you suspect it will happen again. At this point I always like to do the above, review the general health of the server with PSSDIAG / SQLDIAG and then do another round of monitoring, once I know what I’m looking for. In this case I’m targeting the 8628 error as it seems pretty conclusively tied to the outage.

    One of the principal hits you get if you search microsoft.com sites for error 8628 is the following page

    How to use the DBCC MEMORYSTATUS command to monitor memory usage on SQL Server 2005

    http://support.microsoft.com/kb/907877

    and it points out that this error is thrown when there are too many queries trying to compile simultaneously, and that the gateway structures are throttling the compilations. You can read more background about this theory in the above article and there’s also some reference to it in the following white paper

    Troubleshooting Performance Problems in SQL Server 2005

    http://technet.microsoft.com/en-gb/library/cc966540.aspx

    which includes an excellent section on internal physical memory pressure.

    So now we’re on the track of the problem. We reconfigure the PSSDIAG to include a script which runs DBCC MEMORYSTATUS regularly so that we can be sure to capture this output when the problem next occurs. We also make sure that we keep a copy of the data from this query when the system is running well, so that we can have a baseline to compare against. At this point I still didn’t have a profiler running as I didn’t want to have to manage the disk space and have the performance overhead, and the perfstats script add-in gives you so much information anyway.

    So the problem eventually occurred again, and here are some key outputs from the DBCC MEMORYSTATUS related to the compilation gateways:

    Baseline:

    Procedure Cache                Value     
    ------------------------------ -----------
    TotalProcs                            7614
    TotalPages                          570060
    InUsePages                           32736

    Small Gateway                  Value              
    ------------------------------ --------------------
    Configured Units                                 32
    Available Units                                  32
    Acquires                                          0
    Waiters                                           0
    Threshold Factor                             500000
    Threshold                                    500000

    Problem Occurring:

    Procedure Cache               Value
    ------------------------------ -----------
    TotalProcs                     397
    TotalPages                     66888
    InUsePages                     44924

    Small Gateway                 Value
    ------------------------------ --------------------
    Configured Units               32
    Available Units                0
    Acquires                       32
    Waiters                        59
    Threshold Factor               500000
    Threshold                      500000

    So what we can see here is that we are (as expected) encountering a big bottleneck on the gateways (the small one in this case) and at this moment in time 59 compilations are queued. What we can also note is that the reason that this appears to be the case is that the procedure cache is significantly smaller than its normal size, about 85% smaller in fact.

    This therefore, is why so many queries are suddenly trying to compile. For reasons currently unknown at this point 85 % of the cached plans have been thrown away from the cache and so the many thousands of connections on this system are now all having to compile new plans.

    Now you might be thinking, sometimes you see this type of behaviour when the cache is flushed. This is true, but in this case this wasn’t the problem. We sat and monitored the output above for quite a while, there was constant compilation queues but the cache was not growing. More to the point there were no cache flush events in the error logs, and we had not performed and maintenance operations which cause cache flush. I talked about a classic example of this behaviour in this post.

    So we need to look elsewhere, and ask ourselves the question, “Why is my procedure cache 85% of its normal size?”

    That’s for my next post.

  • SQL in Stockholm

    Non-yielding Scheduler, IO Warnings and Cluster Failover

    • 0 Comments

    Here's another example of SQL Server producing a dump file which can help you diagnose critical problems. The scenario here is a a clustered SQL Server 2005 instance fails over and we need to investigate why.

    A good place to start in such problems (as it is with many problems) is to look into the SQL Server error logs. The best place to start is the log file leading up to the failure (this will not be the most recent error log as the SQL Server service has re-started when it failed over to the other node). If we do that we see the following

    2008-10-10 08:43:34.72 spid2s      SQL Server has encountered 1 occurrence(s) of I/O requests taking longer than 15 seconds to complete on file [Z:\DATA\MSSQL.1\MSSQL\DATA\tempdb.mdf] in database [tempdb] (2).  The OS file handle is 0x000006e3.  The offset of the latest long I/O is: 0x00000000b80000
    2008-10-10 08:45:44.72 spid2s      SQL Server has encountered 5 occurrence(s) of I/O requests taking longer than 15 seconds to complete on file [Z:\DATA\MSSQL.1\MSSQL\DATA\templog.ldf] in database [tempdb] (2).  The OS file handle is 0x000006b4.  The offset of the latest long I/O is: 0x00000000080000
    2008-10-10 08:47:01.63 Server      Using 'dbghelp.dll' version '4.0.5'
    2008-10-10 08:49:14.66 Server      ***Unable to get thread context - no pss
    2008-10-10 08:49:14.66 Server      * *******************************************************************************
    2008-10-10 08:49:14.66 Server      *
    2008-10-10 08:49:14.66 Server      * BEGIN STACK DUMP:
    2008-10-10 08:49:14.66 Server      *   10/10/08 08:49:14 spid 0
    2008-10-10 08:49:14.66 Server      *
    2008-10-10 08:49:14.66 Server      * Non-yielding Scheduler
    2008-10-10 08:49:14.66 Server      *
    2008-10-10 08:49:14.66 Server      * *******************************************************************************
    2008-10-10 08:49:14.66 Server      Stack Signature for the dump is 0x00000319

    2008-10-10 08:49:25.51 spid61      Autogrow of file 'templog' in database 'tempdb' took 238266 milliseconds.  Consider using ALTER DATABASE to set a smaller FILEGROWTH for this file.
    2008-10-10 08:49:26.30 Server      External dump process return code 0x20000001.
    External dump process returned no errors.

    2008-10-10 08:49:26.30 Server      Process 2:0:0 (0x11ec) Worker 0x0433C0E8 appears to be non-yielding on Scheduler 7. Thread creation time: 12865910449631. Approx Thread CPU Used: kernel 0 ms, user 0 ms. Process Utilization 0%. System Idle 99%. Interval: 70001 ms.


    About one minute after these events, the clustered instance is failed over to the other node, because according the cluster log the isalive check has failed.

    Now in this type of situation you have the option of reviewing the dump file yourself, as per my previous post, checking the stacks and seeing if something is obvious, or alternatively sending the dump file to use here as Microsoft and seeing what we can see with it using our private symbols.

    However in this case, there is more information to be had from the error logs themselves, which can be of use to use in this situation.

    1. We are receiving the classic "15 second IO warning" error that was introduced in SQL Server 2000 SP4. Think about what this is actually saying, it is staying that certain IO requests are taking at least 1000 times longer to run than a normal acceptable benchmark for IO.

    (When I say acceptable benchmark we could obviously discuss what is acceptable for what disk system, fixed or network storage or whatever, but the value I'm using is 15ms, because it's a nice round number compared to the warning message!)

    SQL Server is an IO dependent application (it likes memory as well) but with an IO system responding like that, you will often notice performance or other types of bad problems.

    2. We can clearly see that we have a non yielding scheduler, meaning that somewhere inside the SQL Server OS, we have one of our schedulers that has completely stopped working.

    3. We are getting the IO warning whilst doing an autogrow of tempdb.

    When I look at these facts I have the following thoughts:

    a) The IO could be a root cause of out problem, but it could also be an indicator. If tempdb is suddenly growing what have we attempted to put into tempdb that requires a grow?

    b) Is the IO system working normally at other times, and do we have any benchmarks on this IO system of what it normally runs like

    c) Why haven't we grown tempdb in advance at startup, as per standard best practices? Or have we pre grown it but it has just tried to grow massively again because the server is involved in some unexpected huge operation which requires innumerable amounts of tempdb space?

    d) What was the non-yielding scheduler doing. (I needed to look at the dump file in this regard, but I'm excluding the details here as the findings were inconclusive and there is lots of other public information).

    These are the typical types of questions that you should always be asking yourself and your database teams when scenarios of this nature are encountered.

    The 15 second IO warning message is one that I come across particularly frequently, and it should not be ignored. It can be reflective of all sorts of problems such as:

    Hardware Fault

    Driver Problems

    Overloading of disk system capacity

    Badly performing SQL batches issuing unreasonable amount of IO requests

    Mismanagement of data and log files

    and the list goes on like this...

    If you are experiencing this issue a lot in your production servers, you need to work with your storage support people to work out why. It can be harmless and just indicative of sudden burst of load which the disk cannot respond to, but it can also mean many worse things as well! My point here is I guess, ignore it at your peril!

    So back to the example in hand anyway....

    Since we can see some significant problems occurring around the time of the dump and the failover, you can easily investigate these using standard techniques without needing to debug the time or send it to Microsoft. In this case I would want to analyze tempdb usage and file sizes in detail and also monitor or review my disk performance baselines.

    It is not always possible to attribute a non-yielding scheduler to one single simple cause. Sometimes you do find a very obvious problem on the thread that is hung, or sometimes you do find a very obvious IO problem, but it also possible for it to be a combination of complex problems. When finding yourself in this situation you need to take note of the indicators which your log files show and consider implementing some proactive monitoring based upon what you see, to endeavor to catch or preempt the problem next time.

    Or just send us a case at Microsoft of course, and get us to look at the dump files :-)

  • SQL in Stockholm

    Database Mirroring Monitor can cause large tempdb growth in SQL Server 2005 - changed behavior in SQL Server 2008

    • 3 Comments

    On extremely rare occasions in SQL Server 2005, when the following conditions are true:

     

    1.       You have a database configured for database mirroring

    2.       You have replication configured on the same database

    3.       You are running the database mirroring monitor

    4.       The SQL Agent job created by the database mirroring monitor is enabled

     

    It was possible to encounter a situation where tempdb grew suddenly and significantly, leading to all the complications that such a situation causes, including but not limited to:

     

    Msg 1105, Level 17, State 2, Line 3

    Could not allocate space for object xxxxxxx in database 'tempdb' because the 'PRIMARY' filegroup is full. Create disk space by deleting unneeded files, dropping objects in the filegroup, adding additional files to the filegroup, or setting autogrowth on for existing files in the filegroup.

     

    Error: 17053, Severity: 16, State: 1.

    C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\DATA\tempdb.mdf: Operating system error 112(There is not enough space on the disk.) encountered.

     

    Error: 1101, Severity: 17, State: 2.

    Could not allocate a new page for database 'tempdb' because of insufficient disk space in filegroup 'PRIMARY'. Create the necessary space by dropping objects in the filegroup, adding additional files to the filegroup, or setting autogrowth on for existing files in the filegroup.

     

    If you were able to capture this situation and have some monitoring running, you would notice the following event captured in profiler:

     

    An execution of the following which is the main entry point stored procedure for the mirroring monitor:

     

    exec sys.sp_dbmmonitorupdate @database_name

     

    followed by an execution of :

     

    exec @retcode = sys.sp_dbmmonitorMSgetthelatestlsn @database_name, @end_of_log_lsn output

     

    which was the last procedure executed by the mirroring monitor before tempdb started to grow. If you were to actually trace the details of this procedure, you could see the statements which it executes. What you will see is that last statement actually executed before tempdb starts filling is this:

     

    dbcc dbtable(blog_mirror) with tableresults, no_infomsgs

     

    Being that this procedure is actually held in the resource database (the others are in msdb so you can take a look easily if you want to), it is protected from normal viewing, so I won't reproduce it here, but it does exactly what it's name would imply, it gets the latest LSN and then uses this to do some calculations of amount of time required to synchronize and other such functions.

     

    The problem lies here in that in extremely rare occurrences this procedure can cause huge tempdb growth due to the way it calculates the latest LSN using dbcc dbtable (an undocumented internal DBCC command) and stores it's temporary calculations in a temp table. Unfortunately the only way to resolve this problem in SQL Server 2005 is to disable the mirroring monitor job in SQL Agent. You should only consider or need to do this though if you are definitely experiencing this specific problem, i.e. you have all the pre-requisite conditions and you have a profiler trace proving that the dbcc command was the command executed before tempdb filled up.

     

    However the good news is that in SQL Server 2008 we made a design change to collect this data in a far more efficient way. If you look at the results returned from a query of the sys.database_mirroring DMV in SQL 2008:

     

    select *

    from sys.database_mirroring

     

    you will note 2 columns have been added to the output:

     

    mirroring_end_of_log_lsn

    and

    mirroring_replication_lsn

     

    Meaning that the information that the mirroring monitor requires is now available in a much simpler format. Therefore if you look at the SQL 2008 version of the msdb stored procedure sys.sp_dbmmonitorupdate, you will note that the execution of sys.sp_dbmmonitorMSgetthelatestlsn has been replaced with a simple select from the new DMV columns:

     

    select @role = (mirroring_role - 1),

                      @status = mirroring_state,

                      @witness_status = mirroring_witness_state,

                      @failover_lsn = mirroring_failover_lsn,

                      @end_of_log_lsn = mirroring_end_of_log_lsn

    from sys.database_mirroring where database_id = @database_id

     

    which is both simpler, faster and avoids any possibility of the tempdb problem occurring.

     

  • SQL in Stockholm

    SQL Server performance problems due to high privileged CPU time when running Windows Server 2008 R2

    • 1 Comments

    It's very rare that I tell people to blindy apply service packs or hotfixes (especially windows ones) without first analyzing where their performance problem lies, but if you are running SQL Server 2008 on Windows Server 2008 R2 (specifically this OS only) then you should absolutely ensure that you've installed the following fix:

    976700  An application stops responding, experiences low performance, or experiences high privileged CPU usage if many large I/O operations are performed in Windows 7 or Windows Server 2008 R2
    http://support.microsoft.com/default.aspx?scid=kb;EN-US;976700

    In this scenario, the key symptom which you normally ought to notice is that you are getting high kernel times, which can you see in the followiing perfmon counter:

    Process(sqlservr)\% Privileged Time

    We've seen a trend of this issue in recent months, so it's something to look to apply in your next maintenance window if you can.

     

  • SQL in Stockholm

    Performance Problems in SQL Server - What data should I collect?

    • 5 Comments

    If you ever talk to me or any of my colleagues in the SQL Server PSS support team, and you have a suspected SQL Server performance problem, you’ll undoubtedly be asked to run the PSSDIAG tool which usually collects the data that we need. This tool acts as a one stop shop for us to pick up a variety of data sets, including SQL Profiler, Performance Monitor, SQL Server configuration information, SQL Server logs, blocking scripts and many more custom scripts of our choosing. We published a version of the tool for SQL Server 7 and 2000 under http://support.microsoft.com/kb/830232 . Although we still maintain a version for SQL Server 2005, the tool was such a success that it began to be incorporated into the core product itself thereby enabling people to analyze their own performance problems more easily, or collect the data that PSS would need to help them.

    Nowadays the public version of the tool is named SQLDIAG and installs by default with all SQL Server 2005 installations. You can read about its options in Books Online http://msdn2.microsoft.com/en-us/library/ms162833.aspx.

    It works very simply in default mode, try it now on a test box: simply open a command prompt, switch to this directory

    C:\Program Files\Microsoft SQL Server\90\Tools\binn

    And run the following:

    sqldiag.exe

    You’ll see something like the following

    image

    By default it connects to the local machine name, and every SQL 2005 instance it can find. There are many switches and options and a detailed configuration file that can be used to tailor the process (more on this later) but if you’re experiencing a problem and you want to grab some data super quickly without thinking too much, you can just do this. The collector is now writing out data to a sub-directory named sqldiag, and as it says, when you’re ready stop, simply press Ctrl+C. The collector might take a few minutes to shutdown depending upon the speed of your server. When it’s finished, take a look in the sqldiag sub-directory and you’ll see a collection of files. By default you’ll have collected the following:

    Default Trace files (if server option enabled)

    sqldiag output text file (contains sql error logs, configuration information, internal performance counters , the output from several key DMVs and much more)

    msinfo32.txt

    sqldumper error log

    log and control files for the sqldiag process.

    I’ll talk about interpreting this data at another time, but you’ve already collected some interesting data about your SQL Server instance and what it was up to whilst the collector was running.

    What you should also notice is that when you ran the tool in this mode for the first time, it unpackaged several .XML files which it uses for configuration purposes. These are as follows:

    SQLDiag.XML

    SD_Detailed.XML

    SD_General.XML

    By default if you don’t use the /I switch (to specify an input file) the tool uses the SQLDiag.XML config file and you get the results as above. The true power of SQLDiag in my opinion though is in amending these config files and collecting data sets of your choice. The extra config files automatically created have the following collections turned on:

    SD_Detailed.XML

    Windows Event Logs

    Perfmon

    SQL Profiler (including several verbose events such as performance statistics, SQL:BatchStarted and SQL:BatchCompleted)

    SD_General.XML

    Windows Event Logs

    Perfmon

    SQL Profiler (with less verbose events)

    Within both the perfmon and SQL Profiler events, you have the ability to configure every potential sub event on or off, based upon the situation that you wish to monitor. All you do is edit a copy of the XML file and change individual elements like these from “true” to “false” and vice versa.

    <PerfmonObject name="\MSSQL$%s:Buffer Manager" enabled="true">
    <PerfmonCounter name="\*" enabled="true" />
    <PerfmonCounter name="\Buffer cache hit ratio" enabled="true" />
    
    <EventType name="Locks">
    <Event id="148" name="Deadlock Graph" enabled="true" description="Occurs when an attempt to acquire a lock is ......" />

    In many cases though, one of the default templates with contain enough info for you to make an initial assessment of situation. To run sqldiag with one of the other config files (or one you create yourself) use the following syntax:

    sqldiag.exe /ISD_General.XML

    and you might also want to add a new output directory:

    sqldiag.exe /ISD_General.XML /Oc:\sqldiagoutput

    One thing that the default configs don’t enable though is the blocking collector, which I do think is particularly useful. To enable this collector you need to create a copy of the SQLDiag.XML and edit the following section:

    <BlockingCollector enabled="false" pollinginterval="5" maxfilesize="350"/>

    Changing the BlockingCollector enabled value to “true” will turn on the blocking collector, and provide a blocked process report for any blocking experienced whilst the collector is running. This is a particularly common cause of application performance issues and well worth enabling.

    In addition to these configurations, one of my colleagues in the US has posted the internal performance statistic T-SQL script which we use to capture all sorts of data from the SQL Server DMVs. This is the principal script that I use every day to troubleshoot performance problems. This script doesn’t install by default with a standard SQL Server setup like the rest of SQLDIAG, but you can obtain the script and download it from the PSS SQL Server Engineers blog here. He also helpfully provided an XML config file with all the required entries to make it run. I found the easiest way to deploy it in a default install of SQL Server was to unzip the 2 SQL scripts and the 2 XML files into the directory where sqldiag.exe lives. You can then simply run the same diagnostics as above, but include these extra T-SQL scripts, by running something like this:

    sqldiag /ISQLDiagPerfStats_Trace.XML /Oc:\sqldiag

    Although you can also use the cmd files he provided for registering and running SQLDiag as a service. Either way you get 3 extra text files in the output

    Over time you’ll find it useful to build up a selection of your own config files, and then you can just copy them to any server you wish to monitor, based upon the situation you want to look at. For example I keep custom ones to hand which focus on the following subjects

    Memory Usage

    Blocking and general Profiler Trace

    Disk Performance

    Profiler Trace

    Recompilation

    Remember though, if you enable a custom configuration file like the ones above, your collections can be extremely verbose. You should therefore consider the following:

    1. Disk Space for output: potentially many GBs on heavy loaded machines

    2. Disk Drive chosen for output: If you want to monitor disk performance don’t write the output to the same drive, and don’t ever write the output to a network drive

    3. Verbose events such as performance statistics can place a heavy CPU load on a server. I only enable these if I’m certain that I need them

    Finally you might wish to configure the collector to only run against specific machines or instances, and the following XML elements allow you to do this:

    a) Change the machine name element to the name of the server you wish to connect to (the default of “.” means connect to the local machine)

    <Machines>
    <Machine name=".">

    (Or use the /M switch when running from the command line.)

    b) Change the instance name element to the SQL Server instance that you wish to connection to (the default of “*” connects to every instance it can on the machine specified above.

    <Instance name="*" windowsauth="true" ssver="9" user="">

    I hope this all makes sense. If you have any questions, just post a comment. Over coming posts I'll talk about how to interpret the data you collect.

  • SQL in Stockholm

    Error 29528. The setup has encountered an unexpected error while Setting Internal Properties

    • 0 Comments

    We've just published a change to KB925976 on how to work around the following setup error:

    Error 29528. The setup has encountered an unexpected error while Setting Internal Properties. The error is: Fatal error during installation.

    We've also just published KB961662 which details how to workaround a problem you can encounter if you had used the original version of 925976.

    Msg 7644, Level 17, State 1, Line 2
    Full-text crawl manager has not been initialized. Any crawl started before the crawl manager was fully initialized will need to be restarted. Please restart SQL Server and retry the command. You should also check the error log to fix any failures that might have caused the crawl manager to fail.

    These problems are related to SID management and the corresponding registry keys. In the earlier version of KB925976 there was advice to delete a registry key containing corrupted SIDs. This workaround was successful for all scenarios in fixing the setup error and allowing setup to complete, but it potentially causes problems with FTS as noted above. The new version of 925976 provides a more future proof solution.

    What I also learnt whilst working on these articles was that there is a public sysinternals tool to get SIDs when you need to do maintenance such as this. You can download it here:

    PsGetSid v1.43 - http://technet.microsoft.com/en-us/sysinternals/bb897417.aspx

    This can be useful for many other scenarios and is worth remembering.

  • SQL in Stockholm

    Changed behaviour of OPTION RECOMPILE syntax in SQL Server 2008 SP1 cumulative update #5

    • 0 Comments

    On Monday we released Cumulative update package 5 for SQL Server 2008 Service Pack 1, under KB975977. Contained in this was a very interesting fix to a long running issue with regards to the option recompile syntax and its changed behaviour in SQL Server 2008. This fix has its own KB article at KB976603.

    In SQL Server 2008 RTM we introduced behaviour that allowed potentially better plans to be created when using the OPTION RECOMPILE syntax. The way this worked in principal was that if the OPTION RECOMPILE syntax was used, SQL could assume that the plan for the statement in question would not need to re-used, since it was always going to recompile. Therefore the query optimizer could safely use the actual parameter values passed to the statement when choosing a plan, as opposed to just using a parameterised template. In certain circumstances this can lead to dramatic performance gains, especially when you have parameter values which vary wildly and need different plans dependent upon the values passed.

    The well known MVP Erland Sommarskog demonstrates these techniques (along with many others) in his excellent article

    Dynamic Search Conditions in T-SQL

    However currently this article has been edited to remove reference to this technique (but I believe that Erland will soon add it back). We were forced to withdraw this functionality as part of Cumulative update package 4 for SQL Server 2008. The reason for this was that on rare occasions batches using this syntax could return incorrect results if multiple connections were running the same batch simultaneously, due to the fact that the query optimizer associated such plans with the specific thread upon which they were compiled, whereas the actual execution of the batch could occur subsequently on a different thread.

    Therefore if you run SQL Server 2008 RTM CU#4 or above, or SQL Server 2008 SP1 between release and CU#4, the OPTION RECOMPILE syntax works as it did in SQL Server 2005.

    However this latest update releases a full and thorough fix for the original problem. It has re-instated the enhanced functionality of parameter embedding optimization, and resolves the incorrect results problem seen earlier. The reason that this fix took a while to come out in this form was down to the fact that the changes required in the query optimizer were relatively complex. Initially it was not viable to release them quickly due to the amount of testing required, so this was why we were forced to just withdraw the functionality, because the potential for incorrect results is one of the most serious types of problem SQL Server can encounter.

    In summary, here is a matrix showing the different builds and functionality:

    SQL 2005 – no parameter embedding optimization

    SQL 2008 RTM to SQL 2008 CU#3 – new functionality , but potential for incorrect results

    SQL 2008 CU#4 onwards - no parameter embedding optimization

    SQL 2008 SP1 RTM to SQL 2008 SP1 CU#4 - no parameter embedding optimization

    SQL 2008 SP1 CU#5 onwards – new functionality enabled again, no possibility of incorrect results

    Remember that RTM and SP1 are separate branches, so even though the fix is now out in the SP1 branch, it is not in the latest cumulative update on the RTM branch, and it is unlikely to ever be put in there.

  • SQL in Stockholm

    SQL Server Assertions and Dumps – a typical example and some proactive help

    • 0 Comments

    I recently worked with a customer who was seeing the following error in their error log:

    SQL Server Assertion: File: <qxcntxt.cpp>, line=1069 Failed Assertion = 'cref == 0'
    Could not find an entry for table or index with partition ID 491897996509184 in database 2

    As you would expect a mini dump was produced at the same time, which detailed the stored procedure that was being run when the assertion occurred.

    In the case of such errors I would always advise you to contact us here at PSS so that we can analyze the dump and provide an explanation or solution. However this particular case reminded me of an interesting side issue where you can proactively help yourself, before contacting PSS.

    You should consider running SQL Server Best Practice Analyzer on your server. You can download it from here. We regularly see cases where the running of the tool could have identified known issues in advance. When you run the tool initially, it checks for updates to the tool itself, but more importantly it checks for updated configuration files. For your reference (and if you're nosey like me) you can look at the configuration file it has downloaded here:

    C:\Program Files (x86)\Microsoft SQL Server 2005 Best Practices Analyzer\EN\SqlBPA.config.XML

    (removing the x86 from the directory path  if you don’t have a 64 bit installation of windows)

    Within PSS nowadays we add rules to this tool when we find interesting new issues or where we find common causes of problems that our customers experience. If you run the tool regularly you’ll pick up our latest updates, things that we in PSS think are important you should check for, and check for them automatically. A large number of the support cases that I work on contain known issues of configuration and the like which could be avoided or at least highlighted by this tool.

    The reason that I mention this now is that the tool currently contains a rule for a known issue with a driver from Dell which can cause random data corruptions which show up as assertions and errors similar to the one above.

    For reference the driver in question is LSI_SAS.SYS where the version is less than 1.21.26.01. If you notice that you are running an environment that has the matching configuration options ( >4GB of system memory and on servers with a SAS backplane, Microsoft Windows 2000 or 2003 and Dell OpenManage Server Administrator Storage Management Software) to be susceptible to this problem, you can download the update which resolves it here.

    It actually turned out that this was not the cause in this case, but it’s always good to check these things, and the BPA is a great way of doing it.

    The key to actually finding the cause of this particular error was to note the rest of the SQL Server error logs, which were littered with errors 602, 676, and 677. For example:

    Error: 602, Severity: 21, State: 30.
    Could not find an entry for table or index with partition ID 491897996509184 in database 2. This error can occur if a stored procedure references a dropped table, or metadata is corrupted. Drop and re-create the stored procedure, or execute DBCC CHECKDB. 

    Error 676
    Error occurred while attempting to drop worktable with partition ID 491897996509184

    Error 677
    Unable to drop worktable with partition ID 491897996509184 after repeated attempts. Worktable is marked for deferred drop. This is an informational message only. No user action is required.

    In themselves, these errors are not completely conclusive, as further memory dumps may be produced, which again would also need to be analyzed. However sometimes they are caused by this problem http://support.microsoft.com/default.aspx?scid=kb;EN-US;937343

    If you are getting errors such as these, or others like them, you should consider what the errors themselves recommend. You should run DBCC CHECKDB to check your data integrity (a data corruption can cause all manner of different assertions depending upon its nature) and you should check your hardware for possible problems as well.

    I guess what I’m trying to say in conclusion is that there doesn’t have to be one single simple cause when you look at particular errors such as these. The basics still always apply;

    You should run DBCC CHECKDB regularly (this is another BPA rule it will highlight it if you haven’t)
    You should keep an eye on the overall health of your environment, especially the storage
    You should keep an eye on your SQL Server error logs and consider all the events together.

    The answer to this actual problem? Well it was the hotfix noted above, but we had to collect some dumps to prove this definitively, and that’s something that you do need PSS involvement with.

  • SQL in Stockholm

    SQL Server 2008 SDK directory does not contain the opends60.lib file required for the creation of extended stored procedures

    • 0 Comments

    If you are trying to create extended stored procedures in SQL Server 2008, you will probably have read the BOL article here:

    Creating Extended Stored Procedures

    Hopefully you will have noticed that this feature is deprecated, and so firstly you’ll only be using this functionality for legacy requirements. That said you may have noticed that you will require the file opends60.lib to be able to create your extended stored procedure DLL. You may have noticed that you cannot find this after installing SQL Server 2008 (even if you have installed the SDK).

    The default location for the SDK directory in SQL Server 2008 is here:

    C:\Program Files\Microsoft SQL Server\100\SDK\

    However it does not contain the opends60.lib. If you need to acquire a copy of this file, currently you need to obtain it from an alternate source, namely a previous version of SQL Server SDK or from an install of a previous version of visual studio. For example the x64 version of the file can be found in the SQL Server 2005 SDK directory here:

    C:\Program Files\Microsoft SQL Server\90\SDK\Lib\x64

    You should be able to use one of these other versions instead.

  • SQL in Stockholm

    Sample Application to test database mirroring failover

    • 1 Comments

    Here's the source for a tiny C# app I wrote to test database mirroring failover at the application level, including a sample database mirroring connection string. It assumes that you have already set up a mirroring partnership. It doesn't matter whether the partnership includes a witness or has auto or manual failover.

    I'll let you create your web or winform project as you see fit, and therefore have only included the actual C# code from the button click event, but basically you need a form with 3 controls:

    button1 - a button

    textBox1 - a multiline text box

    testBox2 - a text box

    In addition to this you need to create a stored procedure in your mirrored database called 'dbmtest' and get it to fetch something that can be converted to a string. The code from my procedure is also shown at the end, but it's not exactly rocket science!

    Here's a quick summary of what the event does:

    1. When you click the button, it attempts to connect to the server supplied in the connection string and call a stored procedure to get some data.

    2. If it gets the data, it displays it. If it gets a SQL Exception error it displays this.

    3. it attempts to display in a separate text box which server it is connected to, using the SqlConnection.DataSource property.

    4. If you failover your mirror you will receive an error when you click the button

    5. Then re-try by clicking again and you will connect to the other server automatically.

    6. If any other general errors are caught it shows them in the text box.

    Obviously in a real world situation you would make the re-try method somewhat more graceful, but this serves my purposes :-) It's nice sometimes to stick breakpoints in and see what is happening at what stages across the components.

    Here's the code:

    private void button1_Click(object sender, EventArgs e)
            {
                try
                {
                    SqlConnection dbmconn = new SqlConnection();
                    dbmconn.ConnectionString = "server=server1;initial catalog = mymirrordb;failover partner=server2;user=domain\\user;trusted_connection = true";
                    SqlCommand cmd = new SqlCommand();
                    cmd.Connection = dbmconn;
                    cmd.CommandType = System.Data.CommandType.StoredProcedure;
                    cmd.CommandText = "dbmtest";

                    try
                    {
                        SqlDataReader dbreader;
                        dbmconn.Open();
                        dbreader = cmd.ExecuteReader();

                        while (dbreader.Read())
                        {
                            textBox1.Text = "Data fetched = " + dbreader[0].ToString();
                            textBox2.Text = dbmconn.DataSource.ToString();
                        }
                    }
                    catch (SqlException sqlex)
                    {
                        textBox1.Text = sqlex.Message.ToString();
                    }
                    finally
                    {
                        dbmconn.Close();
                    }
                }
                catch (Exception ex)
                {
                    textBox1.Text = ex.Message.ToString();
                }
            }

    Stored Procedure Code

    create procedure dbmtest

    as

    select top 1 blog_data
    from blog

     

  • SQL in Stockholm

    Download location for SQLDIAG Perfstats DMV scripts

    • 1 Comments

    I refer to this many times in my other posts, so here’s a direct link to the download location of the perfstats scripts for SQLDIAG which you can use yourself. I use this everyday, you should too!

    http://blogs.msdn.com/psssql/archive/2007/02/21/sql-server-2005-performance-statistics-script.aspx

  • SQL in Stockholm

    SQL Server DMV script for searching the plan cache by query text, plan handle or execution time

    • 0 Comments

    I guess everyone has their own set of DMV scripts to hand when working with SQL Server nowadays. I thought I’d post a few of mine. Pretty much as the title says: comment in whichever clauses suit you and then run the query. Returns a result set with an XML column that can be clicked on if you want, to view the plan itself in SSMS. Works with 2005 and 2008.

    select 
    bucketid,
    a.plan_handle,
    refcounts, 
    usecounts,
    execution_count,
    size_in_bytes,
    cacheobjtype,
    objtype,
    text,
    query_plan,
    creation_time,
    last_execution_time,
    execution_count,
    total_elapsed_time,
    last_elapsed_time
     
    from sys.dm_exec_cached_plans a 
          inner join sys.dm_exec_query_stats b on a.plan_handle=b.plan_handle
        cross apply sys.dm_exec_sql_text(b.sql_handle) as sql_text
        cross apply sys.dm_exec_query_plan(a.plan_handle) as query_plan
    where 1=1
    and text like '%mybadproc%'
    -- and a.plan_handle = 0x06000B00C96DEC2AB8A16D06000000000000000000000000
    and b.last_execution_time between '2009-09-24 09:00' and '2009-09-25 17:00'
    order by last_execution_time desc
  • SQL in Stockholm

    Access Denied error when running SQL Server Transfer Objects Task with a Proxy Account

    • 0 Comments

    Should you try to configure an SSIS package which uses the Transfer Objects Task, and then execute this package through SQL Agent, when you choose to run the task as a proxy credential, as opposed to the normal SQL Agent user, you may encounter a failure. The typical full text of the error message from the SQL Agent history is shown here.

    Executed as user: GRAHAMK20\SSIS_Proxy. Microsoft (R) SQL Server Execute Package Utility  Version 10.0.2531.0 for 64-bit  Copyright (C) Microsoft Corp 1984-2005. All rights reserved.    Started:  4:55:19 PM  Error: 2010-02-04 16:55:19.65     Code: 0xC0016016     Source:       Description: Failed to decrypt protected XML node "PassWord" with error 0x80070002 "The system cannot find the file specified.". You may not be authorized to access this information. This error occurs when there is a cryptographic error. Verify that the correct key is available.  End Error  Error: 2010-02-04 16:55:19.67     Code: 0xC0016016     Source:       Description: Failed to decrypt protected XML node "PassWord" with error 0x80070002 "The system cannot find the file specified.". You may not be authorized to access this information. This error occurs when there is a cryptographic error. Verify that the correct key is available.  End Error  Error: 2010-02-04 16:55:20.00     Code: 0xC002F325     Source: Transfer SQL Server Objects Task Transfer SQL Server Objects Task     Description: Execution failed with the following error: "Access to the path 'C:\Users\Default\AppData\Local\Microsoft\SQL Server\Smo' is denied.".  End Error  DTExec: The package execution returned DTSER_FAILURE (1).  Started:  4:55:19 PM  Finished: 4:55:20 PM  Elapsed:  0.562 seconds.  The package execution failed.  The step failed.

    Depending on how you view this message, you could decide it as being a number of different causes, but the actual important point to note is the red highlighted section. This is the lowest level error and the actual cause of the failure, the other errors are thrown further up the stack. You can confirm this by running Process Monitor on the server where the package is executing. You’ll see failures here

    - C:\Users\Default\AppData\Local\Microsoft\SQL Server (create file)
    - C:\Users\Default\AppData\Local\Microsoft\SQL Server\Smo (create file)

    However if you run the same package, but in the context of the SQL Agent account, you will notice that the same create file runs on a different directory:

    - C:\Users\<sqlagentuserid>\AppData\Local\Microsoft\SQL Server\Smo

    The package requires a temporary location to process some files when it prepared to run and there is a bug in the way the proxy account credential decides where it should place these files. This has been passed to the product group and should be fixed in SQL 11.

    The reason that it is not being fixed immediately is that there is a relatively simple workaround which is to manually create the directory structure in the default user profile directory, and grant write modify permissions to the user or group which represents the proxy credential account. So in my specific example I logged on with an adminstrator account to the server, created the following directory

    C:\Users\Default\AppData\Local\Microsoft\SQL Server\Smo

    and then granted the users group (of which my proxy credential account is a member of) the following permissions:

    Modify, Write, Read & Execute

    The package will then execute successfully under the proxy credential. This problem occurs on both SQL 2005 and SQL 2008.

  • SQL in Stockholm

    How to Customise SQLDIAG in SQL Server 2008 to collect DBCC MEMORYSTATUS

    • 1 Comments

    Although Books Online makes some mention of SQLDIAG customizations, it doesn’t any explicit examples, so here’s one for you. I documented previously here about how to customize your XML configuration file to connect to different machines, and collect output from the standard collectors, and I also mentioned how you could download the PerfStats scripts and use the XML files that they came supplied with. The following example though shows how to insert a custom script of your own, which can be anything of your choice. I’ll be using DBCC MEMORYSTATUS output to demonstrate. The thing to remember here is that I’m not saying that this is the only way to collect such data, but that as ever, with SQLDIAG it allows you to do everything in one place, in a controlled and easy to administer way.

    Firstly we need a script which will collect DBCC MEMORYSTATUS repeatedly. The following one won’t win any design prizes but will do the job just nicely:

    WHILE 1=1

    BEGIN

    select 'start time:', GETDATE();

    DBCC MEMORYSTATUS

    WAITFOR DELAY '00:00:10'

    END;

    This will run DBCC MEMORYSTATUS every 10 seconds until you kill it.

    Next we need to create an XML configuration to run this script using SQLDIAG. I’ll take the default SQLDIAG.XML which unpacks when you run the sqldiag.exe for the first time. If you need any explanations of these pre-requisites then go back to the SQLDIAG link above and read the background.

    <CustomDiagnostics>
    <CustomGroup name="msinfo" enabled="true" />
    <CustomTask enabled="true" groupname="MsInfo" taskname="Get MSINFO32" type="Utility" point="Startup" wait="OnlyOnShutdown" cmd="start /B /WAIT MSInfo32.exe /computer %server% /report &quot;%output_path%%server%_MSINFO32.TXT&quot; /categories +SystemSummary+ResourcesConflicts+ResourcesIRQS+ComponentsNetwork+ComponentsStorage+ComponentsProblemDevices+SWEnvEnvVars+SWEnvNetConn+SWEnvServices+SWEnvProgramGroup+SWEnvStartupPrograms" />
    <CustomTask enabled="true" groupname="MsInfo" taskname="Get default traces" type="Copy_File" point="Startup" wait="OnlyOnShutdown" cmd="&quot;%sspath%log\log*.trc&quot; &quot;%output_path%&quot;" />
    <CustomTask enabled="true" groupname="MsInfo" taskname="Get SQLDumper log" type="Copy_File" point="Startup" wait="OnlyOnShutdown" cmd="&quot;%sspath%log\SQLDUMPER_ERRORLOG*.*&quot; &quot;%output_path%&quot;" />
    </CustomDiagnostics>

    If you look in the above section we already have one custom group, called MSINFO32 , which has 3 sub-tasks. (The format of these if worth noting if you wish to do any file manipulations and command line tools.) What we want to do is add another CustomGroup, which will contain a single task, which is our DBCC script from above. The script itself needs to be saved into a file for the collector to access it, so in this scenario we’ll save the script into the same root directory as the sqldiag.exe. If you’re planning on doing this for many different scripts and scenarios you might be better organising these in appropriately named sub directories. So I’ve saved my script above as the following

    C:\Program Files\Microsoft SQL Server\100\Tools\Binn\exec_dbcc_memorystatus.sql

    Then I need to add the following XML to tell the SQLDIAG that I have a new custom group of diagnostics (called SQL memory scripts), which contains currently 1 custom task (out DBCC MEMORYSTATUS script). The new section looks like this:

    <CustomDiagnostics>
    <CustomGroup name="msinfo" enabled="true" />
    <CustomTask enabled="true" groupname="MsInfo" taskname="Get MSINFO32" type="Utility" point="Startup" wait="OnlyOnShutdown" cmd="start /B /WAIT MSInfo32.exe /computer %server% /report &quot;%output_path%%server%_MSINFO32.TXT&quot; /categories +SystemSummary+ResourcesConflicts+ResourcesIRQS+ComponentsNetwork+ComponentsStorage+ComponentsProblemDevices+SWEnvEnvVars+SWEnvNetConn+SWEnvServices+SWEnvProgramGroup+SWEnvStartupPrograms" />
    <CustomTask enabled="true" groupname="MsInfo" taskname="Get default traces" type="Copy_File" point="Startup" wait="OnlyOnShutdown" cmd="&quot;%sspath%log\log*.trc&quot; &quot;%output_path%&quot;" />
    <CustomTask enabled="true" groupname="MsInfo" taskname="Get SQLDumper log" type="Copy_File" point="Startup" wait="OnlyOnShutdown" cmd="&quot;%sspath%log\SQLDUMPER_ERRORLOG*.*&quot; &quot;%output_path%&quot;" />
    <CustomGroup name="SQLscripts" enabled="true" />
    <CustomTask enabled="true" groupname="SQLscripts" taskname="dbcc memorystatus" type="TSQL_Script" point="Startup" wait="No" cmd="exec_dbcc_memorystatus.sql"/>
    </CustomDiagnostics>

    Next I save this XML file as a new name, let’s say SQLDIAG_memstatus.XML and then I just execute the sqldiag tool with a input file switch like this:

    sqldiag /ISQLDIAG_memstatus.XML

    and away we go. If you look in your output directory, you’ll find the following file:

    <yourservername>__exec_dbcc_memorystatus_Startup.OUT

    which contains all the DBCC MEMORYSTATUS output.

    All you need to do now, is create custom XML files and scripts for as many scenarios as you can think of! Hopefully if you examine the entries for the MSINFO task, this might give you a few ideas as well.

  • SQL in Stockholm

    Interesting things to watch out for when using the autoclose database option

    • 3 Comments

    Here's something which I've been looking at this week, surrounding the autoclose database option.

    When you restore any database to a SQL Server 2005 SP2 or later instance, you'll more than likely get errors of this nature written to the log.

    2008-11-12 17:24:16.58 spid13s     SQL Server has encountered 5 occurrence(s) of cachestore flush for the 'Object Plans' cachestore (part of plan cache) due to some database maintenance or reconfigure operations.
    2008-11-12 17:24:16.58 spid13s     SQL Server has encountered 5 occurrence(s) of cachestore flush for the 'SQL Plans' cachestore (part of plan cache) due to some database maintenance or reconfigure operations.
    2008-11-12 17:24:16.58 spid13s     SQL Server has encountered 5 occurrence(s) of cachestore flush for the 'Bound Trees' cachestore (part of plan cache) due to some database maintenance or reconfigure operations.

    This is expected behavior and is documented in Books Online

    "Restoring a database clears the plan cache for the instance of SQL Server. Clearing the plan cache causes a recompilation of all subsequent execution plans and can cause a sudden, temporary decrease in query performance. In SQL Server 2005 Service Pack 2, for each cleared cachestore in the plan cache, the SQL Server error log contains the following informational message: "SQL Server has encountered %d occurrence(s) of cachestore flush for the '%s' cachestore (part of plan cache) due to some database maintenance or reconfigure operations". This message is logged every five minutes as long as the cache is flushed within that time interval."

    I'd been looking at some problems which at first glance appeared to be related to this, in that after restoring a database, there were lots of occurances of cache flush which was causing performance problems.

    The answer in this case was that this database was in autoclose mode. A quick check of sp_dboption and this was confirmed. Another quick check to Books Online confirms that autoclose databases will cause a cache flush event when they shutdown.

    "When the database is set to AUTOCLOSE = ON, an operation that initiates an automatic database shutdown clears the plan cache for the instance of SQL Server. Clearing the plan cache causes a recompilation of all subsequent execution plans and can cause a sudden, temporary decrease in query performance. In SQL Server 2005 Service Pack 2, for each cleared cachestore in the plan cache, the SQL Server error log contains the following informational message: "SQL Server has encountered %d occurrence(s) of cachestore flush for the '%s' cachestore (part of plan cache) due to some database maintenance or reconfigure operations". This message is logged every five minutes as long as the cache is flushed within that time interval."

    I'd forgotten about this behavior, as its very rare that i deal with autoclose databases as most of the systems I tend to work with don't use it, autoclose is more designed for desktop systems as books online states.

    The act of the cache flush in itself is not particularly problematic, unless of course you constantly access the database and then disconnect from it, causing numerous SQL instance wide cache flushes, which severly impact performance on any server.

    Another indication of this type of behavior is that your error log will be full of lots of these messages

    2008-11-13 16:10:16.07 spid51      Starting up database 'test'.
    2008-11-13 16:10:16.22 spid51      CHECKDB for database 'test' finished without errors on 2008-11-12 16:11:55.453 (local time). This is an informational message only; no user action is required.
    2008-11-13 16:10:18.46 spid51      Starting up database 'test'.
    2008-11-13 16:10:18.50 spid51      CHECKDB for database 'test' finished without errors on 2008-11-12 16:11:55.453 (local time). This is an informational message only; no user action is required.

    The thing that I didn't realize and that really surprised me was that it was possible to create such flushes by just accessing the database through Management Studio. Basically if you were the only user on the database, and you decided to browse its contents in SSMS, you'd open a connection, and then every time that you expanded an object explorer tree control for example, you issue an query to the server to generate the list of objects in that part of the tree (this is easy enough to trace in profiler). However when such a query finishes if you are the only user then the database closes as it deems all active users to be disconnected, and therefore a cache flush event takes place. This of course is a bad thing, especially if you happen to restore your autoclose database to an instance which is running lots of other production databases maintaining large numbers of cache entries on your server.

    Just something to think about....

Page 1 of 3 (64 items) 123