SQL in Stockholm

SQL Server thoughts from the support team in Stockholm

SQL Server troubleshooting thoughts from Graham Kent, (ex) Senior Support Escalation Engineer at Microsoft Sweden.

  • SQL in Stockholm

    How to configure SQL Server database mirroring between 2 instances on a standalone server

    • 1 Comments

    There are lots of troubleshooting articles about how to diagnose connectivity problems when setting up database mirroring, especially errors such as these:

    Msg 1418, Level 16, State 1, Line 2

    The server network address "tcp://mirrortest.com:5023" can not be reached or does not exist. Check the network address name and that the ports for the local and remote endpoints are operational.

    or

    Msg 1456, Level 16, State 3, Line 1

    The ALTER DATABASE command could not be sent to the remote server instance

    but here's something a little different.

    Imagine that you have a server which is not connected to a domain. It might just be a standalone machine, it could be a virtual machine that you've just built. I've done this a few times in virtual environments to do demos on laptops and the like. It's a not a production scenario that you'd come across really, but it works really well in test.

    The questions that usually come up are how can I configure the partners (and endpoints) without having a FQDN, since I'm not connected to a domain, I'm just in "WORKGROUP". SQL Server still wants you to use FQDN formatting for the partner addresses. When you're trying to configure it, you might have even encountered an error like this when trying to start mirroring (depending on exactly how you did the setup)

    Database Mirroring login attempt failed with error: 'Connection handshake failed.

    There is no compatible authentication protocol. State 21

    What you need to do to be able to set this up is as follows:

    1. adding "local" as the primary DNS suffix as in detailed this article here (ok it's about exchange server but it works the same)

    http://technet.microsoft.com/en-us/library/aa997119.aspx

    (remember to reboot)

    2. You can now specify FQDN formatted names for the local machine in the format:

    For the principal:

    TCP://MIRRORTEST.LOCAL:5022

    For the mirror

    TCP://MIRRORTEST.LOCAL:5023

    (Or other port numbers of your choice, but remember that they'll need to be different)

    3. You'll need these ports as exceptions in Windows Firewall if you're running it (I was even though it was a local machine!)

    4. My endpoints are setup like this (but I just did them through the GUI) there's nothing exceptional here:

    CREATE ENDPOINT [Mirroring]

    AUTHORIZATION [MIRRORTEST\Administrator]

    STATE=STARTED

    AS TCP (LISTENER_PORT = 5022, LISTENER_IP = ALL)

    FOR DATA_MIRRORING (ROLE = PARTNER, AUTHENTICATION = WINDOWS NEGOTIATE, ENCRYPTION = DISABLED)

    You can now run some cool tests and demos without ever being connected to a network.

  • SQL in Stockholm

    Interesting things to watch out for when using the autoclose database option

    • 3 Comments

    Here's something which I've been looking at this week, surrounding the autoclose database option.

    When you restore any database to a SQL Server 2005 SP2 or later instance, you'll more than likely get errors of this nature written to the log.

    2008-11-12 17:24:16.58 spid13s     SQL Server has encountered 5 occurrence(s) of cachestore flush for the 'Object Plans' cachestore (part of plan cache) due to some database maintenance or reconfigure operations.
    2008-11-12 17:24:16.58 spid13s     SQL Server has encountered 5 occurrence(s) of cachestore flush for the 'SQL Plans' cachestore (part of plan cache) due to some database maintenance or reconfigure operations.
    2008-11-12 17:24:16.58 spid13s     SQL Server has encountered 5 occurrence(s) of cachestore flush for the 'Bound Trees' cachestore (part of plan cache) due to some database maintenance or reconfigure operations.

    This is expected behavior and is documented in Books Online

    "Restoring a database clears the plan cache for the instance of SQL Server. Clearing the plan cache causes a recompilation of all subsequent execution plans and can cause a sudden, temporary decrease in query performance. In SQL Server 2005 Service Pack 2, for each cleared cachestore in the plan cache, the SQL Server error log contains the following informational message: "SQL Server has encountered %d occurrence(s) of cachestore flush for the '%s' cachestore (part of plan cache) due to some database maintenance or reconfigure operations". This message is logged every five minutes as long as the cache is flushed within that time interval."

    I'd been looking at some problems which at first glance appeared to be related to this, in that after restoring a database, there were lots of occurances of cache flush which was causing performance problems.

    The answer in this case was that this database was in autoclose mode. A quick check of sp_dboption and this was confirmed. Another quick check to Books Online confirms that autoclose databases will cause a cache flush event when they shutdown.

    "When the database is set to AUTOCLOSE = ON, an operation that initiates an automatic database shutdown clears the plan cache for the instance of SQL Server. Clearing the plan cache causes a recompilation of all subsequent execution plans and can cause a sudden, temporary decrease in query performance. In SQL Server 2005 Service Pack 2, for each cleared cachestore in the plan cache, the SQL Server error log contains the following informational message: "SQL Server has encountered %d occurrence(s) of cachestore flush for the '%s' cachestore (part of plan cache) due to some database maintenance or reconfigure operations". This message is logged every five minutes as long as the cache is flushed within that time interval."

    I'd forgotten about this behavior, as its very rare that i deal with autoclose databases as most of the systems I tend to work with don't use it, autoclose is more designed for desktop systems as books online states.

    The act of the cache flush in itself is not particularly problematic, unless of course you constantly access the database and then disconnect from it, causing numerous SQL instance wide cache flushes, which severly impact performance on any server.

    Another indication of this type of behavior is that your error log will be full of lots of these messages

    2008-11-13 16:10:16.07 spid51      Starting up database 'test'.
    2008-11-13 16:10:16.22 spid51      CHECKDB for database 'test' finished without errors on 2008-11-12 16:11:55.453 (local time). This is an informational message only; no user action is required.
    2008-11-13 16:10:18.46 spid51      Starting up database 'test'.
    2008-11-13 16:10:18.50 spid51      CHECKDB for database 'test' finished without errors on 2008-11-12 16:11:55.453 (local time). This is an informational message only; no user action is required.

    The thing that I didn't realize and that really surprised me was that it was possible to create such flushes by just accessing the database through Management Studio. Basically if you were the only user on the database, and you decided to browse its contents in SSMS, you'd open a connection, and then every time that you expanded an object explorer tree control for example, you issue an query to the server to generate the list of objects in that part of the tree (this is easy enough to trace in profiler). However when such a query finishes if you are the only user then the database closes as it deems all active users to be disconnected, and therefore a cache flush event takes place. This of course is a bad thing, especially if you happen to restore your autoclose database to an instance which is running lots of other production databases maintaining large numbers of cache entries on your server.

    Just something to think about....

  • SQL in Stockholm

    Want to read some tips and tricks from the SQL Developer support team?

    • 0 Comments

    Well if you do you should check out the blog of my colleague and Microsoft pool team partner Spike, who works in the SQL Support team here with me in Stockholm, but specialises in the area that we refer to as developer support. This is a very broad area and encompasses many technologies but to give you a small taster of what he can get up to, he has just posted an article on connectivity issues.

    http://blogs.msdn.com/spike/archive/2008/11/07/sql-server-error-10060-sql-server-error-10061.aspx

    He has many more interesting articles in his archive. This is well worth adding to your RSS reader.

  • SQL in Stockholm

    Database Mirroring Monitor can cause large tempdb growth in SQL Server 2005 - changed behavior in SQL Server 2008

    • 3 Comments

    On extremely rare occasions in SQL Server 2005, when the following conditions are true:

     

    1.       You have a database configured for database mirroring

    2.       You have replication configured on the same database

    3.       You are running the database mirroring monitor

    4.       The SQL Agent job created by the database mirroring monitor is enabled

     

    It was possible to encounter a situation where tempdb grew suddenly and significantly, leading to all the complications that such a situation causes, including but not limited to:

     

    Msg 1105, Level 17, State 2, Line 3

    Could not allocate space for object xxxxxxx in database 'tempdb' because the 'PRIMARY' filegroup is full. Create disk space by deleting unneeded files, dropping objects in the filegroup, adding additional files to the filegroup, or setting autogrowth on for existing files in the filegroup.

     

    Error: 17053, Severity: 16, State: 1.

    C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\DATA\tempdb.mdf: Operating system error 112(There is not enough space on the disk.) encountered.

     

    Error: 1101, Severity: 17, State: 2.

    Could not allocate a new page for database 'tempdb' because of insufficient disk space in filegroup 'PRIMARY'. Create the necessary space by dropping objects in the filegroup, adding additional files to the filegroup, or setting autogrowth on for existing files in the filegroup.

     

    If you were able to capture this situation and have some monitoring running, you would notice the following event captured in profiler:

     

    An execution of the following which is the main entry point stored procedure for the mirroring monitor:

     

    exec sys.sp_dbmmonitorupdate @database_name

     

    followed by an execution of :

     

    exec @retcode = sys.sp_dbmmonitorMSgetthelatestlsn @database_name, @end_of_log_lsn output

     

    which was the last procedure executed by the mirroring monitor before tempdb started to grow. If you were to actually trace the details of this procedure, you could see the statements which it executes. What you will see is that last statement actually executed before tempdb starts filling is this:

     

    dbcc dbtable(blog_mirror) with tableresults, no_infomsgs

     

    Being that this procedure is actually held in the resource database (the others are in msdb so you can take a look easily if you want to), it is protected from normal viewing, so I won't reproduce it here, but it does exactly what it's name would imply, it gets the latest LSN and then uses this to do some calculations of amount of time required to synchronize and other such functions.

     

    The problem lies here in that in extremely rare occurrences this procedure can cause huge tempdb growth due to the way it calculates the latest LSN using dbcc dbtable (an undocumented internal DBCC command) and stores it's temporary calculations in a temp table. Unfortunately the only way to resolve this problem in SQL Server 2005 is to disable the mirroring monitor job in SQL Agent. You should only consider or need to do this though if you are definitely experiencing this specific problem, i.e. you have all the pre-requisite conditions and you have a profiler trace proving that the dbcc command was the command executed before tempdb filled up.

     

    However the good news is that in SQL Server 2008 we made a design change to collect this data in a far more efficient way. If you look at the results returned from a query of the sys.database_mirroring DMV in SQL 2008:

     

    select *

    from sys.database_mirroring

     

    you will note 2 columns have been added to the output:

     

    mirroring_end_of_log_lsn

    and

    mirroring_replication_lsn

     

    Meaning that the information that the mirroring monitor requires is now available in a much simpler format. Therefore if you look at the SQL 2008 version of the msdb stored procedure sys.sp_dbmmonitorupdate, you will note that the execution of sys.sp_dbmmonitorMSgetthelatestlsn has been replaced with a simple select from the new DMV columns:

     

    select @role = (mirroring_role - 1),

                      @status = mirroring_state,

                      @witness_status = mirroring_witness_state,

                      @failover_lsn = mirroring_failover_lsn,

                      @end_of_log_lsn = mirroring_end_of_log_lsn

    from sys.database_mirroring where database_id = @database_id

     

    which is both simpler, faster and avoids any possibility of the tempdb problem occurring.

     

  • SQL in Stockholm

    New SQL Blogger from my team

    • 0 Comments

    My hugely experienced colleague Clive, from the SQL Escalation team in the UK, has just started a blog here:

    http://blogs.msdn.com/sqlstuff/default.aspx

    It should be interesting reading as he's quite a guru on matters of clustering and high availability. He's started off with a post about one my least favourite subjects, setup in SQL 2000 Clustered instances :-)

    http://blogs.msdn.com/sqlstuff/archive/2008/10/23/service-pack-hotfix-setup-failures-with-sql2000-clusters.aspx

     

  • SQL in Stockholm

    Non-yielding Scheduler, IO Warnings and Cluster Failover

    • 0 Comments

    Here's another example of SQL Server producing a dump file which can help you diagnose critical problems. The scenario here is a a clustered SQL Server 2005 instance fails over and we need to investigate why.

    A good place to start in such problems (as it is with many problems) is to look into the SQL Server error logs. The best place to start is the log file leading up to the failure (this will not be the most recent error log as the SQL Server service has re-started when it failed over to the other node). If we do that we see the following

    2008-10-10 08:43:34.72 spid2s      SQL Server has encountered 1 occurrence(s) of I/O requests taking longer than 15 seconds to complete on file [Z:\DATA\MSSQL.1\MSSQL\DATA\tempdb.mdf] in database [tempdb] (2).  The OS file handle is 0x000006e3.  The offset of the latest long I/O is: 0x00000000b80000
    2008-10-10 08:45:44.72 spid2s      SQL Server has encountered 5 occurrence(s) of I/O requests taking longer than 15 seconds to complete on file [Z:\DATA\MSSQL.1\MSSQL\DATA\templog.ldf] in database [tempdb] (2).  The OS file handle is 0x000006b4.  The offset of the latest long I/O is: 0x00000000080000
    2008-10-10 08:47:01.63 Server      Using 'dbghelp.dll' version '4.0.5'
    2008-10-10 08:49:14.66 Server      ***Unable to get thread context - no pss
    2008-10-10 08:49:14.66 Server      * *******************************************************************************
    2008-10-10 08:49:14.66 Server      *
    2008-10-10 08:49:14.66 Server      * BEGIN STACK DUMP:
    2008-10-10 08:49:14.66 Server      *   10/10/08 08:49:14 spid 0
    2008-10-10 08:49:14.66 Server      *
    2008-10-10 08:49:14.66 Server      * Non-yielding Scheduler
    2008-10-10 08:49:14.66 Server      *
    2008-10-10 08:49:14.66 Server      * *******************************************************************************
    2008-10-10 08:49:14.66 Server      Stack Signature for the dump is 0x00000319

    2008-10-10 08:49:25.51 spid61      Autogrow of file 'templog' in database 'tempdb' took 238266 milliseconds.  Consider using ALTER DATABASE to set a smaller FILEGROWTH for this file.
    2008-10-10 08:49:26.30 Server      External dump process return code 0x20000001.
    External dump process returned no errors.

    2008-10-10 08:49:26.30 Server      Process 2:0:0 (0x11ec) Worker 0x0433C0E8 appears to be non-yielding on Scheduler 7. Thread creation time: 12865910449631. Approx Thread CPU Used: kernel 0 ms, user 0 ms. Process Utilization 0%. System Idle 99%. Interval: 70001 ms.


    About one minute after these events, the clustered instance is failed over to the other node, because according the cluster log the isalive check has failed.

    Now in this type of situation you have the option of reviewing the dump file yourself, as per my previous post, checking the stacks and seeing if something is obvious, or alternatively sending the dump file to use here as Microsoft and seeing what we can see with it using our private symbols.

    However in this case, there is more information to be had from the error logs themselves, which can be of use to use in this situation.

    1. We are receiving the classic "15 second IO warning" error that was introduced in SQL Server 2000 SP4. Think about what this is actually saying, it is staying that certain IO requests are taking at least 1000 times longer to run than a normal acceptable benchmark for IO.

    (When I say acceptable benchmark we could obviously discuss what is acceptable for what disk system, fixed or network storage or whatever, but the value I'm using is 15ms, because it's a nice round number compared to the warning message!)

    SQL Server is an IO dependent application (it likes memory as well) but with an IO system responding like that, you will often notice performance or other types of bad problems.

    2. We can clearly see that we have a non yielding scheduler, meaning that somewhere inside the SQL Server OS, we have one of our schedulers that has completely stopped working.

    3. We are getting the IO warning whilst doing an autogrow of tempdb.

    When I look at these facts I have the following thoughts:

    a) The IO could be a root cause of out problem, but it could also be an indicator. If tempdb is suddenly growing what have we attempted to put into tempdb that requires a grow?

    b) Is the IO system working normally at other times, and do we have any benchmarks on this IO system of what it normally runs like

    c) Why haven't we grown tempdb in advance at startup, as per standard best practices? Or have we pre grown it but it has just tried to grow massively again because the server is involved in some unexpected huge operation which requires innumerable amounts of tempdb space?

    d) What was the non-yielding scheduler doing. (I needed to look at the dump file in this regard, but I'm excluding the details here as the findings were inconclusive and there is lots of other public information).

    These are the typical types of questions that you should always be asking yourself and your database teams when scenarios of this nature are encountered.

    The 15 second IO warning message is one that I come across particularly frequently, and it should not be ignored. It can be reflective of all sorts of problems such as:

    Hardware Fault

    Driver Problems

    Overloading of disk system capacity

    Badly performing SQL batches issuing unreasonable amount of IO requests

    Mismanagement of data and log files

    and the list goes on like this...

    If you are experiencing this issue a lot in your production servers, you need to work with your storage support people to work out why. It can be harmless and just indicative of sudden burst of load which the disk cannot respond to, but it can also mean many worse things as well! My point here is I guess, ignore it at your peril!

    So back to the example in hand anyway....

    Since we can see some significant problems occurring around the time of the dump and the failover, you can easily investigate these using standard techniques without needing to debug the time or send it to Microsoft. In this case I would want to analyze tempdb usage and file sizes in detail and also monitor or review my disk performance baselines.

    It is not always possible to attribute a non-yielding scheduler to one single simple cause. Sometimes you do find a very obvious problem on the thread that is hung, or sometimes you do find a very obvious IO problem, but it also possible for it to be a combination of complex problems. When finding yourself in this situation you need to take note of the indicators which your log files show and consider implementing some proactive monitoring based upon what you see, to endeavor to catch or preempt the problem next time.

    Or just send us a case at Microsoft of course, and get us to look at the dump files :-)

  • SQL in Stockholm

    Non-yielding IOCP Listener - Stack Dump analysis

    • 2 Comments

     

    Here's something I was looking at yesterday which is worth a quick look. It's quite a common scenario and one that's relatively easy to address. Your SQL Server 2005 instance suddenly throws the following to the error log (apparently without warning):

    2008-09-09 08:30:44.810 Server       Using 'dbghelp.dll' version '4.0.5'
    2008-09-09 08:30:44.960 Server       **Dump thread - spid = 0, PSS = 0x0000000000000000, EC = 0x0000000000000000
    2008-09-09 08:30:45.000 Server       ***Stack Dump being sent to F:\MSSQL.1\MSSQL\LOG\SQLDump0001.txt                 
    2008-09-09 08:30:45.000 Server       * *******************************************************************************
    2008-09-09 08:30:45.010 Server       *                                                                                
    2008-09-09 08:30:45.010 Server       * BEGIN STACK DUMP:                                                              
    2008-09-09 08:30:45.010 Server       *   09/09/08 08:30:45 spid 0                                                     
    2008-09-09 08:30:45.010 Server       *                                                                                
    2008-09-09 08:30:45.010 Server       * Non-yielding IOCP Listener                                                     
    2008-09-09 08:30:45.010 Server       *                                                                                
    2008-09-09 08:30:45.010 Server       * *******************************************************************************
    2008-09-09 08:30:45.010 Server       * -------------------------------------------------------------------------------
    2008-09-09 08:30:45.010 Server       * Short Stack Dump                                                               
    2008-09-09 08:30:45.320 Server       Stack Signature for the dump is 0x000000000000020B                               
    2008-09-09 04:47:06.180 spid3s       LazyWriter: warning, no free buffers found.
                 
                            

    Following this several hundred lines of output similar to the following format are shown

    2008-09-09 08:31:08.750 spid3s      
    Memory Manager
        VM Reserved = 16963064 KB
        VM Committed = 16942872 KB
        AWE Allocated = 0 KB
        Reserved Memory = 1024 KB
        Reserved Memory In Use = 0 KB 
    2008-09-09 08:31:08.750 spid3s      
    Memory node Id = 0
        VM Reserved = 16957304 KB
        VM Committed = 16937200 KB
        AWE Allocated = 0 KB
        SinglePage Allocator = 2519016 KB
        MultiPage Allocator = 39632 KB 
    2008-09-09 08:31:08.750 spid3s      
    MEMORYCLERK_SQLGENERAL (Total)
        VM Reserved = 0 KB
        VM Committed = 0 KB
        AWE Allocated = 0 KB
        SM Reserved = 0 KB
        SM Committed = 0 KB
        SinglePage Allocator = 48576 KB
        MultiPage Allocator = 5136 KB


    You may well recognize that this latter output is basically the same as you would get if you run:
    DBCC MEMORYSTATUS

    You'll also have noted that a small dump file has been produced in the default error log directory of this SQL instance (the path of which is noted in the error log output above).

    In normal circumstances you can submit this dump file to us in support, and we can take a look at it and, all being well tell you what the problem is. The thing you have to remember here is that the production of the dump file does not necessarily mean that you have encountered a bug in SQL Server (as some people often believe). In the more recent versions of SQL there are innumerable scenarios where we control and catch certain scenarios well known to us and produce a dump file automatically.

    This particular scenario was caused by inappropriate configuration of SQL Server memory settings, and should be easily rectified, so I thought it a good example to show why.

    Although there is little that can normally be gleaned from debugging a SQL Server dump in the public domain, this one does contain a few hints. We don't publish private symbols for SQL Server, meaning that the amount you can do yourself with a dump file is quite limited (since you only have access to retail symbols), however you can look at the thread stacks and see if anything of note can be seen. Anyone who has read Ken Henderson's fantastic book about SQL Server internals will have used these techniques before. Sometimes it's just interesting to look a little deeper after all.

    I'll assume that you're already familiar with installing and configuring the debugger, but if you're not here's a good place to start. Debugging Tools and Symbols: Getting Started. This example just uses the retail symbol server address from this page:

    SRV*c:\websymbols*http://msdl.microsoft.com/download/symbols

    If you open this dump, you'll find yourself starting with the context set to the thread that generated the dump, in this case:

    0:004> kn
    # Child-SP          RetAddr           Call Site
    00 00000000`0474de58 00000000`77d704bf ntdll!NtWaitForSingleObject+0xa
    01 00000000`0474de60 00000000`016be0f9 kernel32!WaitForSingleObjectEx+0x130
    02 00000000`0474df00 00000000`016bdaeb sqlservr!CDmpDump::DumpInternal+0x4a3
    03 00000000`0474dff0 00000000`01c7c7b3 sqlservr!CDmpDump::Dump+0x3b
    04 00000000`0474e040 00000000`01d3af67 sqlservr!CImageHelper::DoMiniDump+0x413
    05 00000000`0474e1a0 00000000`025f8cf5 sqlservr!stackTrace+0x6e7
    06 00000000`0474f6d0 00000000`01a3b025 sqlservr!SQL_SOSNonYieldIOCPCallback+0x1ef
    07 00000000`0476f7e0 00000000`018662b6 sqlservr!SOS_OS::ExecuteNonYieldIOCPCallbacks+0x44
    08 00000000`0476f830 00000000`010b9628 sqlservr!alloca_probe+0x220815
    09 00000000`0476f990 00000000`029002c9 sqlservr!SchedulerMonitor::Run+0x164
    0a 00000000`0476fac0 00000000`010079fe sqlservr!SchedulerMonitor::EntryPoint+0x9
    0b 00000000`0476faf0 00000000`01008362 sqlservr!SOS_Task::Param::Execute+0xee
    0c 00000000`0476fc00 00000000`01007da1 sqlservr!SOS_Scheduler::RunTask+0xc9
    0d 00000000`0476fc90 00000000`014787a7 sqlservr!SOS_Scheduler::ProcessTasks+0xb4
    0e 00000000`0476fd00 00000000`014789ca sqlservr!SchedulerManager::WorkerEntryPoint+0xe7
    0f 00000000`0476fda0 00000000`01478911 sqlservr!SystemThread::RunWorker+0x59
    10 00000000`0476fde0 00000000`014795d9 sqlservr!SystemThreadDispatcher::ProcessWorker+0x130
    11 00000000`0476fe80 00000000`781337a7 sqlservr!SchedulerManager::ThreadEntryPoint+0x128
    12 00000000`0476ff20 00000000`78133864 msvcr80!_callthreadstartex+0x17
    13 00000000`0476ff50 00000000`77d6b6da msvcr80!_threadstartex+0x84

    This is the thread that is actually doing the monitoring and generating the dump. You can see in frame 02 the name of the function doing this. If you had the private symbols at this point, you can look through some of the earlier frames and determine which thread caused this condition to occur, but I can't show that here :-)

    The thread in question here was actually thread 8, we can switch to this and take a look

    0:008> kn 50
    # Child-SP          RetAddr           Call Site
    00 00000000`04f6f008 00000000`01750bd0 sqlservr!NumaNode::RampUp
    01 00000000`04f6f010 00000000`012f1523 sqlservr!alloca_probe+0xbf420
    02 00000000`04f6f110 00000000`0103f136 sqlservr!BPool::Steal+0x3ec
    03 00000000`04f6f1c0 00000000`0103f32e sqlservr!SQLSinglePageAllocator::AllocatePages+0x25
    04 00000000`04f6f200 00000000`0103f1ce sqlservr!MemoryNode::AllocatePagesInternal+0xce
    05 00000000`04f6f4a0 00000000`0103f7ba sqlservr!MemoryClerkInternal::AllocatePages+0x6e
    06 00000000`04f6f500 00000000`010d4bb4 sqlservr!CMemThread::PbGetNewPages+0x6a
    07 00000000`04f6f540 00000000`01068bba sqlservr!CFixedPageMgr::PbAllocate+0xcd
    08 00000000`04f6f5b0 00000000`01004d37 sqlservr!CMemThread::Alloc+0x1d7
    09 00000000`04f6f640 00000000`0147dc3a sqlservr!operator new+0x26
    0a 00000000`04f6f680 00000000`01045589 sqlservr!SOS_Node::EnqueueTaskDirectInternal+0x587
    0b 00000000`04f6f730 00000000`01045b03 sqlservr!SOS_Node::EnqueueTask+0x49
    0c 00000000`04f6f780 00000000`01046581 sqlservr!CNetConnection::EnqueuePacket+0x8f
    0d 00000000`04f6f890 00000000`01045cab sqlservr!TDSSNIClient::ReadHandler+0x171
    0e 00000000`04f6f900 00000000`010459bc sqlservr!SNIReadDone+0xb7
    0f 00000000`04f6f960 00000000`010079fe sqlservr!SOS_Node::ListenOnIOCompletionPort+0x251
    10 00000000`04f6faf0 00000000`01008362 sqlservr!SOS_Task::Param::Execute+0xee
    11 00000000`04f6fc00 00000000`01007da1 sqlservr!SOS_Scheduler::RunTask+0xc9
    12 00000000`04f6fc90 00000000`014787a7 sqlservr!SOS_Scheduler::ProcessTasks+0xb4
    13 00000000`04f6fd00 00000000`014789ca sqlservr!SchedulerManager::WorkerEntryPoint+0xe7
    14 00000000`04f6fda0 00000000`01478911 sqlservr!SystemThread::RunWorker+0x59
    15 00000000`04f6fde0 00000000`014795d9 sqlservr!SystemThreadDispatcher::ProcessWorker+0x130
    16 00000000`04f6fe80 00000000`781337a7 sqlservr!SchedulerManager::ThreadEntryPoint+0x128
    17 00000000`04f6ff20 00000000`78133864 msvcr80!_callthreadstartex+0x17
    18 00000000`04f6ff50 00000000`77d6b6da msvcr80!_threadstartex+0x84
    19 00000000`04f6ff80 00000000`00000000 kernel32!BaseThreadStart+0x3a

    So again, just by looking at the function names, we can make a guess that some internal memory routines related to the buffer pool are going on here, particularly from frames 02 and 03. If you ever attach a debugger into a running SQL Server instance (obviously never in production!) you'd see functions like this occurring regularly as SQL Server manages the buffer pool pages.

    There's not much more that can be seen from this stack, so we'll take a look through the rest of the threads, in this case by running

    ~* kn

    When browsing through these threads, we notice lots that also appear to be involved in memory related events, just one example of which is

      53  Id: 554.a6c Suspend: 1 Teb: 000007ff`fffd9000 Unfrozen
    # Child-SP          RetAddr           Call Site
    00 00000000`1521e140 00000000`012f1523 sqlservr!BPool::ReplenishFreeList+0x10d
    01 00000000`1521e240 00000000`0103f136 sqlservr!BPool::Steal+0x3ec
    02 00000000`1521e2f0 00000000`0103f32e sqlservr!SQLSinglePageAllocator::AllocatePages+0x25
    03 00000000`1521e330 00000000`0103f1ce sqlservr!MemoryNode::AllocatePagesInternal+0xce
    04 00000000`1521e5d0 00000000`0103f676 sqlservr!MemoryClerkInternal::AllocatePages+0x6e
    05 00000000`1521e630 00000000`0103f737 sqlservr!CVarPageMgr::PviNewVarPage+0x56
    06 00000000`1521e6a0 00000000`0100ca58 sqlservr!CVarPageMgr::PbAllocate+0x3ce
    07 00000000`1521e710 00000000`01004d37 sqlservr!CMemObj::Alloc+0xcc
    08 00000000`1521e780 00000000`0109b203 sqlservr!operator new+0x26
    09 00000000`1521e7c0 00000000`014607a1 sqlservr!operator new[]+0x23
    0a 00000000`1521e800 00000000`0145d4c8 sqlservr!CErrorReportingManager::CwchFormatToBuffer+0x180
    0b 00000000`1521e970 00000000`0128d202 sqlservr!CErrorReportingManager::CwchCallFormatToBuffer+0x18
    0c 00000000`1521e9b0 00000000`0128d0f8 sqlservr!CLoginClientInfo::CLoginClientInfo+0x71
    0d 00000000`1521ea30 00000000`01285265 sqlservr!IssueLoginSuccessReport+0x67
    0e 00000000`1521eb40 00000000`0128494f sqlservr!login+0x584
    0f 00000000`1521f480 00000000`01285610 sqlservr!process_login+0xee
    10 00000000`1521f840 00000000`010079fe sqlservr!process_commands+0x40d
    11 00000000`1521faf0 00000000`01008362 sqlservr!SOS_Task::Param::Execute+0xee
    12 00000000`1521fc00 00000000`01007da1 sqlservr!SOS_Scheduler::RunTask+0xc9
    13 00000000`1521fc90 00000000`014787a7 sqlservr!SOS_Scheduler::ProcessTasks+0xb4

    We note that sqlservr!BPool::ReplenishFreeList is common to a lot of the threads, and other BPool functions crop up frequently as well.

    At this point there is not much else to be gleaned from the dump without private symbols, however it's clear that there appears to be a significant number of threads waiting on what appear to be internal memory routines. In this case, we also have the fact that SQL Server itself has detected that it believes that memory allocations were the cause of the problem, which is why it has dumped out the DBCC MEMORYSTATUS to help us. The error log specifically states that it has "no free buffers".

    If you were to look through the whole of the DBCC MEMORYSTATUS and decode some of the latter items using this article as a guide

    http://support.microsoft.com/kb/907877

    then you would note that the buffer pool has been recently grown, and it is issuing a notification that it wished to grow further.

    So what next, is the question; you've noted that SQL Server appears to be having memory management problems, so maybe you think that you should add some more memory to alleviate this? Well in a word - "NO".

    The answer to this problem is that we have to look at the configuration of the hardware, and the configuration of SQL Server first (which in any troubleshooting scenario is always a good idea).

    What we saw here at the time of the dump was:

    MemoryLoad = 95%                    
    Total Physical = 16382 MB           
    Available Physical = 792 MB    

    And in fact over the preceding minutes before the dump we saw even lower memory availability than this, with physical utilization up to 98%. (this is something that we can extract from the dump with private symbols).

    So we do appear short of physical memory, but when we run sp_configure on the server, we see this:

    name                                minimum     maximum     config_value run_value
    ----------------------------------- ----------- ----------- ------------ -----------
    max server memory (MB)              16          2147483647  2147483647   2147483647

    Therefore the max server memory setting has not been set. Although SQL will dynamically attempt to manage its memory, there can be circumstances in large memory installations where if it runs unlimited you can get into situations where it ends up fighting with other processes or the OS for memory, and situations like this can occur. This is why we recommend running a "max server memory" setting in this type of scenario, to limit the size of the buffer pool to an amount lower than the total memory on the server itself. The most referenced article on this is probably here

    http://blogs.msdn.com/slavao/archive/2006/11/13/q-a-does-sql-server-always-respond-to-memory-pressure.aspx

    We see numerous cases revolving around this scenario and derivatives of it, and many can simply be avoided by limiting the memory of the SQL Server buffer pool with this setting, so that you don't ever get into this memory bound scenario. So the bottom line here is change the max server memory setting to something around the 13000 mark, and then monitor your memory usage to see if you're still encountering this type of problem, making further adjustments if necessary.

    If you have any questions surrounding this topic, feel free to post them.

  • SQL in Stockholm

    Kayaking and tempdb file placement

    • 0 Comments

    So, no posts for 10 weeks, but that's because sometimes there are other things going on in Stockholm as opposed to SQL Server support :-) Here's what I've been doing over the summer....

     Kayaking on Brunnsviken

    I've been fortunate enough to have several weeks paternity leave off to spend with the family and enjoy the Swedish summer. But I'm back to the technical stuff this week and of course to the release of SQL 2008.

    This week I've been working on some proactive advisory cases with Swedish customers, reviewing their installations for best practices and providing some tips and tricks. We do this quite frequently for Premier customers and it's good for all involved. Customers get some of their servers checked out in advance of any potential technical problems, and I get to stop things happening in advance, as opposed to my normal "fix it urgently" role :-)

    I looked back on my past few reports of this nature today, and I thought I'd just mention the one point that has come up in every review that I've done this year. This may just be co-incidence but maybe word isn't spread as well as it could be.

    In SQL 2005 there were significant numbers of enhancement to the implementation of tempdb, both from an implementation and a supportability standpoint. We now recommend that by default you create one tempdb data file per visible scheduler within the SQLOS, or in other words (for most people) one per logical processor core that you expose or affinitize to SQL Server. Then test and benchmark the results and do some fine tuning if you need be.

    Remember, you can look at the schedulers that have been created by looking in the following DMV

    select *

    from sys.dm_os_schedulers

    Here are the 2 main articles that I always point customers to on this tempdb issue:

    http://www.microsoft.com/technet/prodtechnol/sql/2005/workingwithtempdb.mspx

    http://msdn.microsoft.com/en-us/library/ms175527(SQL.90).aspx

    I'll be back in coming weeks with "proper troubleshooting".

  • SQL in Stockholm

    SQL Server Performance Assessment – What are my top 20 expensive batches

    • 1 Comments

    Previously here , I discussed the nature of collecting data when you are experiencing a SQL Server performance problem. A performance problem sometimes intimidates people as they are not sure where to start. This post details one of my preferred ways to take a quick overview of what the situation is.

    Let’s assume that using one of the techniques described, you have some SQL Profiler data. One of the first things that I like to do is aggregate the data together into a database so that I can then simply query it for both an overview of aggregated problems or fetch details of individual batches quickly and easily, using TSQL.

    My current preferred method of doing this is to use the readtrace tool that installs as part of RML Utilities. You might choose to run readtrace.exe from the command line (as I do), or you might prefer to also install SQLNexus and set the options there to have it run readtrace for you. Either way at the end of it you’ll be left with a database that contains the readtrace schema.

    What I like to discover first off, is answer the question

    “Are there actually SQL batches that are taking a long time to run?”

    This may sound obvious, but considering the number of different types of performance problems, and the different sub sets of data you might want to analyze, this provides an immediate split of the problem into separate areas. With this data you can follow a line of questioning like this:

    1. Do I have long running SQL batches?
    2. If yes – is it the same batches that are long running all the time?
    3. Is it a small number or even a single batch that is long running?
    4. Can I recognize a pattern to this behavior? Is the result of a recent application change?
    5. Is this a crazy query being issued manually by a user?

    And so on….the point being as well that if the answer to the first question is “no”, then maybe you should be looking elsewhere within your application architecture for the problem, or maybe you should be collecting additional or different diagnostics. There’s no hard and fast rule here, but it’s always nice to have some sort of an idea of what direction you might wish to proceed in.

    Consider the following example. You make a small change to a previously healthy application and suddenly all your users start complaining of unbearably slow performance. You collect the profiler data and run it through readtrace. When you query the readtrace data you find that a single SQL batch is always at the top of every list of the most expensive batches, under every category. This is a stored procedure which acts as a common entry point for all application functionality. Knowing this you can now work out that every user is experiencing application slowness because this common stored procedure always runs slowly. You can now focus on why this specific piece of code is going slow and deal with it as necessary.

    So anyway, the script is attached to this post. It’s based upon one of the old scripts which we used to use internally before the advent of SQLNexus, and has been amended by me to suit my requirements. I often use it to send customers summaries of their badly performing code by email, and so it’s designed to have results output to text. This is also why it only prints the initial 100 characters of each batch, but you can change this easily enough if you have long SQL batches.
    You might note that it only does batches, and not statements. This is because first off when I have a random performance problem, I don’t tend to collect statement level data in profiler as it’s so much more expensive to collect and sometimes can exaggerate the situation. I work on the basis of taking a summary collection and then going back for more targeted data if need be. readtrace does know how to analyze statement level data if you collect it (in fact you’ll note that if you don’t collect it, it warns you) so you can always write your own version on statement level data by following this principal and looking at the readtrace schema which comes with the documentation.

    Finally you might ask, why not do this with DMVs and look at the data real time. Well you can do this, and sometimes I would, but that’s not what this post is about! There are both advantages and disadvantages to this method, and I do have an archive of DMV queries which I’m ready to run should I ever do online troubleshooting of this nature. These are the type of queries which you can also see if you download the performance dashboard reports. But as I say, I really like readtrace because once you have the data in this format, you can run your stock queries like the one attached here, but if you find something out of the ordinary, then you can write something custom to analyze the data in more detail.

  • SQL in Stockholm

    SQL Server Assertions and Dumps – a typical example and some proactive help

    • 0 Comments

    I recently worked with a customer who was seeing the following error in their error log:

    SQL Server Assertion: File: <qxcntxt.cpp>, line=1069 Failed Assertion = 'cref == 0'
    Could not find an entry for table or index with partition ID 491897996509184 in database 2

    As you would expect a mini dump was produced at the same time, which detailed the stored procedure that was being run when the assertion occurred.

    In the case of such errors I would always advise you to contact us here at PSS so that we can analyze the dump and provide an explanation or solution. However this particular case reminded me of an interesting side issue where you can proactively help yourself, before contacting PSS.

    You should consider running SQL Server Best Practice Analyzer on your server. You can download it from here. We regularly see cases where the running of the tool could have identified known issues in advance. When you run the tool initially, it checks for updates to the tool itself, but more importantly it checks for updated configuration files. For your reference (and if you're nosey like me) you can look at the configuration file it has downloaded here:

    C:\Program Files (x86)\Microsoft SQL Server 2005 Best Practices Analyzer\EN\SqlBPA.config.XML

    (removing the x86 from the directory path  if you don’t have a 64 bit installation of windows)

    Within PSS nowadays we add rules to this tool when we find interesting new issues or where we find common causes of problems that our customers experience. If you run the tool regularly you’ll pick up our latest updates, things that we in PSS think are important you should check for, and check for them automatically. A large number of the support cases that I work on contain known issues of configuration and the like which could be avoided or at least highlighted by this tool.

    The reason that I mention this now is that the tool currently contains a rule for a known issue with a driver from Dell which can cause random data corruptions which show up as assertions and errors similar to the one above.

    For reference the driver in question is LSI_SAS.SYS where the version is less than 1.21.26.01. If you notice that you are running an environment that has the matching configuration options ( >4GB of system memory and on servers with a SAS backplane, Microsoft Windows 2000 or 2003 and Dell OpenManage Server Administrator Storage Management Software) to be susceptible to this problem, you can download the update which resolves it here.

    It actually turned out that this was not the cause in this case, but it’s always good to check these things, and the BPA is a great way of doing it.

    The key to actually finding the cause of this particular error was to note the rest of the SQL Server error logs, which were littered with errors 602, 676, and 677. For example:

    Error: 602, Severity: 21, State: 30.
    Could not find an entry for table or index with partition ID 491897996509184 in database 2. This error can occur if a stored procedure references a dropped table, or metadata is corrupted. Drop and re-create the stored procedure, or execute DBCC CHECKDB. 

    Error 676
    Error occurred while attempting to drop worktable with partition ID 491897996509184

    Error 677
    Unable to drop worktable with partition ID 491897996509184 after repeated attempts. Worktable is marked for deferred drop. This is an informational message only. No user action is required.

    In themselves, these errors are not completely conclusive, as further memory dumps may be produced, which again would also need to be analyzed. However sometimes they are caused by this problem http://support.microsoft.com/default.aspx?scid=kb;EN-US;937343

    If you are getting errors such as these, or others like them, you should consider what the errors themselves recommend. You should run DBCC CHECKDB to check your data integrity (a data corruption can cause all manner of different assertions depending upon its nature) and you should check your hardware for possible problems as well.

    I guess what I’m trying to say in conclusion is that there doesn’t have to be one single simple cause when you look at particular errors such as these. The basics still always apply;

    You should run DBCC CHECKDB regularly (this is another BPA rule it will highlight it if you haven’t)
    You should keep an eye on the overall health of your environment, especially the storage
    You should keep an eye on your SQL Server error logs and consider all the events together.

    The answer to this actual problem? Well it was the hotfix noted above, but we had to collect some dumps to prove this definitively, and that’s something that you do need PSS involvement with.

  • SQL in Stockholm

    Slow recovery times and slow performance due to large numbers of Virtual Log Files

    • 1 Comments

    In my previous post about database mirroring configuration problems caused by large numbers of virtual log files, I mentioned that such a scenario can have other unwanted side effects, with or without database mirroring being enabled. Here’s an example that demonstrates one of these scenarios. This shows how recovery times after a disaster can be significantly extended by allowing your log file to grow in small increments, thereby creating thousands of VLFs. It’s worth noting as well that this scenario also demonstrates the performance impact of doing online file growth during periods of large activity. This scenario can affect you with or without mirroring enabled, it’s just quicker to demonstrate with configuring mirroring.

    I mentioned before that to fully understand this optic you need to be familiar with Transaction Log Physical Architecture and also Factors That Can Delay Log Truncation.

    There is one simple lesson to be learnt here, so I’ll write it early on in the article:

    Pre-grow your log files (and your data files) in advance. Avoid repeated online auto-growth if at all possible.

    Consider the following example scripts which are based upon the same routines as used in the previous post, but with some small changes.

    These scripts do the following actions, one for a good scenario and one for the bad scenario.

    1. Create a database named VLF

    2. Create a starting backup point

    3. Open a transaction in this database to ensure that the transaction cannot be truncated

    4. Fill up the log with thousands of transactions until the log reaches the space limit set for it – in this case we test with a 5GB log. The test of the log running out of space is used just to guarantee log files of equivalent sizes for the recovery time comparison. This situation holds true for logs that haven’t run out of space.

    5. Restart the server without taking a backup or dealing with the log space. This forces SQL to recover the entire contents of the active portion of the log, which is in effect, all of it.

    We then just start the SQL Server service again and monitor the recovery times and look in the SQL Server error logs.

    We notice a number of interesting points here:

    1. The runtime required to fill up the 5GB log files is significantly faster when using the pre-grown log file. Less than 2 minutes for the pre-grown log file compared to over 9 minutes for the incrementally auto grown file. This demonstrates why auto-growths can be bad. They are expensive to manage and can harm your performance. The fact that the log fills quicker indicates that you are managing to throughput the transactions faster.

    2. There is a delay of about 2 minutes when nothing is logged about the recovery phases of the VLF database in the bad scenario. This is the period in between the database starting up and the analysis phase beginning. In reality there is a large amount of work going on here; SQL Server is processing the 20000 VLFs that are present in the log file. This delay can be extended in proportion to the number of VLFs that you allow your log file to hold. I have seen real world examples where hundreds of thousands of VLFs have significantly delayed a recovery time.

    3. The actual recovery times are similar (about 2 minutes) once the recovery actually begins. These recovery times are this length due to the fact that we created the old open transaction in the test meaning that SQL is forced to recover the entirety of the log.

    4. As well as being significantly faster to fill up the log (in terms of runtime) which indicates that the processing of transactions is faster, when we look at the log in detail we see that we have processed about 6500 more transactions in the shorter period. Further evidence of how much better performance the pre-grown log files gives you. We’ve managed to process 25% more transactions in a quarter of the time!

    The scripts and error log output are shown below. The scripts use the undocumented DBCC LOGINFO command to examine the number of virtual log files in the physical log. Remember as with all undocumented commands, use of it is not supported, and you use it at your own risk. The number of rows that this command outputs represent the number of VLFs currently in the physical log.

    SCRIPT 1 – Bad scenario

    --CONNECT to the test server instance

     

    use master

    go

     

    if exists (select name from sys.databases where name = 'vlf')

    begin

    drop database vlf --(if you need to)

    end

     

     

    --create a database on the root of C with a small starting log size

    --which then grows in small increments

    create database vlf

    on (name=vlf, filename='c:\vlf.mdf')

    log on (name=vlf_log, filename='c:\vlf_log.ldf',size=1MB,maxsize=5GB,filegrowth=1MB)

    go

     

    use vlf

    go

     

    --create a dummy table to fill up

    create table vlf_test

    (

    a int identity(1,1) not null,

    b char(8000) not null,

    c int not null

    )

    go

     

    insert vlf_test values ('vlftest',0)

    go

     

    --create a starting point of backups

    backup database vlf to disk = 'c:\vlf.bak' with stats = 1

    go

    backup log vlf to disk = 'c:\vlf1.bak'

    go

     

    --open a 2nd new connection and open an explicit uncommitted transaction

    --to ensure that the transaction log cannot truncate

     

    use vlf

    go

     

    begin tran

    create table opentran

    (

    opentran int

    )

    Go

     

    --switch back to the first connection

    --now run the following to grow the log to create lots of VLFs

    --run until the log runs out of space - a convenient stopping point

     

    use vlf

    go

    set nocount on

    go

     

    declare @x int

    declare @a char(8000)

    declare @b char(8000)

    set @a = replicate('a',8000)

    set @b = replicate('b',8000)

    set @x = 1

    while 1=1

    begin

    update vlf_test set b = @b, c = @x where a = 1

    update vlf_test set b = @a where a = 1

    set @x = @x + 2

    end

    go

     

    --let this run until the log runs out of space

    --on the 2nd connection, use dbcc loginfo to check how many VLFs you have

     

    DBCC LOGINFO (vlf)

     

    --when you have reached 5GB of log file with several thousands VLFs

    --simulate a server disaster by cutting the power to the principal server

    --alternatively you could just power off your server ungracefully

     

    shutdown with nowait

     

    --now restart the SQL service or power on the server

    I get the following results from the error log when I run this on my test server.

    Runtime of filling the log: 9 minutes 21 seconds

    Number of VLFs when log full: 20480

    Time for VLF to recover fully: 3 minutes 40 seconds

    Excerpts from the SQL Server log:

    Server starts:

    2008-05-16 13:22:40.85 Server Microsoft SQL Server 2005 - 9.00.3186.00 (X64)

    VLF db starts up

    2008-05-16 13:22:47.97 spid18s Starting up database 'vlf'.

    2 minute pause whilst nothing is logged about vlf database, then analysis starts. It is during this period that SQL Server is processing the large number of VLFs.

    2008-05-16 13:24:34.28 spid18s Analysis of database 'vlf' (19) is 3% complete (approximately 8 seconds remain). This is an informational message only. No user action is required.

    Redo has started

    2008-05-16 13:24:45.80 spid18s Recovery of database 'vlf' (19) is 3% complete (approximately 108 seconds remain). Phase 2 of 3. This is an informational message only. No user action is required.

    Note how many transactions have been rolled forward – this is how many we did in the test

    2008-05-16 13:26:21.36 spid18s 26533 transactions rolled forward in database 'vlf' (19). This is an informational message only. No user action is required.

    Undo has started

    2008-05-16 13:26:21.39 spid18s Recovery of database 'vlf' (19) is 99% complete (approximately 0 seconds remain). Phase 3 of 3. This is an informational message only. No user action is required.

    Recovery complete

    2008-05-16 13:26:21.49 spid5s Recovery is complete. This is an informational message only. No user action is required.

    SCRIPT 2 – Good scenario

    The script is identical to the bad scenario shown above, apart from the simple fact that we create the log to a pre-expand size of 5GB, using this command:

    create database vlf

    on (name=vlf, filename='c:\vlf.mdf')

    log on (name=vlf_log, filename='c:\vlf_log.ldf',size=5GB,maxsize=5GB,filegrowth=1MB)

    This gives the following results on the identical server.

    Runtime of filling the log: 1 minutes 54 seconds

    Number of VLFs when log full: 16

    Time for VLF to recover fully: 2 minutes 2 seconds

    Excerpts from the SQL Server log:

    Server starts:

    2008-05-16 13:40:16.10 Server Microsoft SQL Server 2005 - 9.00.3186.00 (X64)

    VLF db starts up

    2008-05-16 13:40:22.64 spid18s Starting up database 'vlf'.

    6 second pause whilst nothing is logged about vlf database, then analysis starts. It is during this period that SQL Server is processing the VLFs, but in this case because there are only 16, it completes much quicker. This is the fundamental difference between the examples.

    2008-05-16 13:40:28.99 spid18s Analysis of database 'vlf' (19) is 2% complete (approximately 157 seconds remain). This is an informational message only. No user action is required.

    Redo has started

    2008-05-16 13:40:44.22 spid18s Recovery of database 'vlf' (19) is 4% complete (approximately 106 seconds remain). Phase 2 of 3. This is an informational message only. No user action is required.

    Note how many transactions have been rolled forward – this is how many we did in the test

    2008-05-16 13:42:18.69 spid18s 33216 transactions rolled forward in database 'vlf' (19). This is an informational message only. No user action is required.

    Undo has started

    2008-05-16 13:42:18.69 spid18s Recovery of database 'vlf' (19) is 99% complete (approximately 0 seconds remain). Phase 3 of 3. This is an informational message only. No user action is required.

    Recovery complete

    2008-05-16 13:42:18.89 spid5s Recovery is complete. This is an informational message only. No user action is required.

    I hope this demo gives further weight to the case for pre growing your SQL Server files in advance. Feel free to post a comment if you have a question.

  • SQL in Stockholm

    1413 error when starting Database Mirroring – How many virtual log files is too many?

    • 1 Comments

    Recently in Cumulative update package 6 for SQL Server 2005 Service Pack 2 we released a fix for a problem where you receive a 1413 error when starting database mirroring. The error that you get in the looks like this:

    Error: 1413, Severity: 16, State: 1.

    Communications to the remote server instance 'TCP://ComputerName:PortNumber' failed before database mirroring was fully started. The ALTER DATABASE command failed. Retry the command when the remote database is started.

    In addition to this, if you look in the error logs themselves you will normally see the following 2 errors reported as well:

    Error: 1443, Severity: 16, State: 2.

    Database mirroring has been terminated for database 'xxx'. This is an informational message only. No user action is required.

    And you also note:

    Error: 1479, Severity: 16, State: 1.

    The mirroring connection to "TCP://ComputerName:PortNumber" has timed out for database "xxx" after 10 seconds without a response. Check the service and network connections.

    Although you can install the hotfix, I think it’s worth considering why this problem actually occurs and how you can potentially avoid it happening at all through good database management practices, which have benefits far and beyond the configuration of database mirroring.

    Hopefully you are familiar with the architecture of a SQL Server transaction log file. In principal the primary reason that this 1413 and its associated time out errors occur is that when you start database mirroring SQL Server needs to read through the transaction log file to bring the 2 databases into a synchronized state, following the restore of the last transaction log backup to the mirror partner. The time is takes to do this is dependent upon the number of virtual log files contained within the physical log file. This sometimes leads to confusion as a relatively small physical log file can contain thousands of virtual log files if it’s been badly managed. This in turn means that a relatively small physical log file (say 1GB) can be the cause of this 1413 error, and in turn the exact same database configured for mirroring on the exact same hardware will successfully set up when it has an identically sized physical log, but one that has been better managed and doesn’t contain so many virtual log files.

    To demonstrate this, try the following reproduction examples. To run these you’ll need 2 SQL Server instances with mirroring endpoints already set up (I’ll assume if you’re reading this far that you already understand this). A good place to try this repro is a test server environment where you already have a mirroring partnership configured. The repros use the undocumented DBCC LOGINFO command to examine the number of virtual log files in the physical log. Remember as with all undocumented commands, use of it is not supported, and you use it at your own risk.

    The example uses a common scenario where a log has been left to grow and runs out of space, but actually there are many different ways that you can cause your log to grow. Some are documented here. This is just a quick way to run the demo.

    Script 1: produces the 1413 failure – note the thousands of VLFs

    --CONNECT to the primary server insatnce

    use master

    go

     

    drop database vlf --(if you need to)

     

    --create a database on the root of C with a small starting log size

    --which then grows in small increments

    create database vlf

    on (name=vlf, filename='c:\vlf.mdf')

    log on (name=vlf_log, filename='c:\vlf_log.ldf',size=1MB,maxsize=1GB,filegrowth=1MB)

    go

     

    use vlf

    go

     

    --create a dummy table to fill up

    create table vlf_test

    (

    a int identity(1,1) not null,

    b char(8000) not null,

    c int not null

    )

    go

     

    insert vlf_test values ('vlftest',0)

    go

     

    --create a starting point of backups

    backup database vlf to disk = 'c:\vlf.bak' with stats = 1

    go

    backup log vlf to disk = 'c:\vlf1.bak'

    go

     

    set nocount on

    go

     

    --grow the log until it runs out of space

    declare @x int

    declare @a char(8000)

    declare @b char(8000)

    set @a = replicate('a',8000)

    set @b = replicate('b',8000)

    set @x = 1

    while 1=1

    begin

    update vlf_test set b = @b, c = @x where a = 1

    update vlf_test set b = @a where a = 1

    set @x = @x + 2

    end

    go

     

    --now alter the log size and backup the log

    alter database vlf modify file (name=vlf_log, maxsize=2GB)

    go

    backup log [vlf] to disk = 'c:\vlf2.bak' with stats=1

    go

     

    --use the undocumented DBCC LOGINFO command to check the number of VLFs in our

    dbcc loginfo (vlf)

    go

    --note the number of VLFs, it should be several thousand

    --for me it was 4100

     

    --connect to the other server which you wish to be the mirror

    --copy the 3 backup files across to it's C drive and run the following

    use master

    go

     

    restore database [vlf] from disk = 'c:\vlf.bak' with norecovery, replace, file=1,

    move 'vlf' to 'c:\vlf.mdf',

    move 'vlf_log' to 'c:\vlf_log.LDF'

    go

    restore log [vlf] from disk = 'c:\vlf1.bak' with norecovery

    go

    restore log [vlf] from disk = 'c:\vlf2.bak' with norecovery, stats=1

    go

     

    --you now have a pair of databases ready to have mirroring started

    --try to start database mirroring using the SSMS GUI and you will receive the 1413 error

    --the only exception to this might be if you run it on particularly fast hardware

    Script 2: works fine - identical except that the log has been pre-grown to the same size. This is good practice for all database files, whatever their type.

    --CONNECT to the primary server instance

    --remember to delete any old backup files of the same name from previous tests

    use master

    go

     

    drop database vlf --(if you need to)

     

    create database vlf

    on (name=vlf, filename='c:\vlf.mdf')

    log on (name=vlf_log, filename='c:\vlf_log.ldf',size=1GB,maxsize=1GB,filegrowth=250MB)

    go

     

    use vlf

    go

     

    create table vlf_test

    (

    a int identity(1,1) not null,

    b char(8000) not null,

    c int not null

    )

    go

     

    insert vlf_test values ('vlftest',0)

    go

    backup database vlf to disk = 'c:\vlf.bak' with stats = 1

    go

    backup log vlf to disk = 'c:\vlf1.bak'

    go

    set nocount on

    go

     

    --this will grow the log until it runs out of space

    declare @x int

    declare @a char(8000)

    declare @b char(8000)

    set @a = replicate('a',8000)

    set @b = replicate('b',8000)

    set @x = 1

    while 1=1

    begin

    update vlf_test set b = @b, c = @x where a = 1

    update vlf_test set b = @a where a = 1

    set @x = @x + 2

    end

    go

     

    --now alter the log size and backup the log

    alter database vlf modify file (name=vlf_log, maxsize=2GB)

    go

    backup log [vlf] to disk = 'c:\vlf2.bak' with stats=1

    go

     

    --use the undocumented DBCC LOGINFO command to check the number of VLFs in our

    dbcc loginfo (vlf)

    go

    --note the number of VLFs, it should be only about 16 or so

     

    --connect to the other server which you wish to be the mirror

    --copy the 3 back files across to it's C drive

     

    restore database [vlf] from disk = 'c:\vlf.bak' with norecovery, replace, file=1,

    move 'vlf' to 'c:\vlf.mdf',

    move 'vlf_log' to 'c:\vlf_log.LDF'

    go

    restore log [vlf] from disk = 'c:\vlf1.bak' with norecovery

    go

    restore log [vlf] from disk = 'c:\vlf2.bak' with norecovery, stats=1

    go

     

    --you now have a pair of databases ready to have mirroring started

    --try to start database mirroring using the SSMS GUI and it should work

    --the only exception to this might be if you run it on particularly slow hardware

    I ran my test on 2 small single processor servers. You might get slightly different results as the scripts note, depending upon your exact configuration. I used SQL Server 2005 build 3186 (CU#3) for this test.

    My main point here is that you should look to control your transaction log proactively. There are obviously times when this is not possible, or other mitigating circumstances arise, which is why we have released a hotfix to improve the speed with which the database mirroring configuration reads the log files. This improvement is also included in SQL Server 2008 as well. However there are other issues that you can encounter if you let you virtual log files grow unchecked, such as excessive recovery times, so as with all SQL Server file management tasks, proactive is best. We always recommend that our customers analyze their expected usage and pre-grow their database files.

    I hope this helps to explain the database mirroring error in more depth, but feel free to post a comment if you have a question.

  • SQL in Stockholm

    Performance Problems in SQL Server - What data should I collect?

    • 5 Comments

    If you ever talk to me or any of my colleagues in the SQL Server PSS support team, and you have a suspected SQL Server performance problem, you’ll undoubtedly be asked to run the PSSDIAG tool which usually collects the data that we need. This tool acts as a one stop shop for us to pick up a variety of data sets, including SQL Profiler, Performance Monitor, SQL Server configuration information, SQL Server logs, blocking scripts and many more custom scripts of our choosing. We published a version of the tool for SQL Server 7 and 2000 under http://support.microsoft.com/kb/830232 . Although we still maintain a version for SQL Server 2005, the tool was such a success that it began to be incorporated into the core product itself thereby enabling people to analyze their own performance problems more easily, or collect the data that PSS would need to help them.

    Nowadays the public version of the tool is named SQLDIAG and installs by default with all SQL Server 2005 installations. You can read about its options in Books Online http://msdn2.microsoft.com/en-us/library/ms162833.aspx.

    It works very simply in default mode, try it now on a test box: simply open a command prompt, switch to this directory

    C:\Program Files\Microsoft SQL Server\90\Tools\binn

    And run the following:

    sqldiag.exe

    You’ll see something like the following

    image

    By default it connects to the local machine name, and every SQL 2005 instance it can find. There are many switches and options and a detailed configuration file that can be used to tailor the process (more on this later) but if you’re experiencing a problem and you want to grab some data super quickly without thinking too much, you can just do this. The collector is now writing out data to a sub-directory named sqldiag, and as it says, when you’re ready stop, simply press Ctrl+C. The collector might take a few minutes to shutdown depending upon the speed of your server. When it’s finished, take a look in the sqldiag sub-directory and you’ll see a collection of files. By default you’ll have collected the following:

    Default Trace files (if server option enabled)

    sqldiag output text file (contains sql error logs, configuration information, internal performance counters , the output from several key DMVs and much more)

    msinfo32.txt

    sqldumper error log

    log and control files for the sqldiag process.

    I’ll talk about interpreting this data at another time, but you’ve already collected some interesting data about your SQL Server instance and what it was up to whilst the collector was running.

    What you should also notice is that when you ran the tool in this mode for the first time, it unpackaged several .XML files which it uses for configuration purposes. These are as follows:

    SQLDiag.XML

    SD_Detailed.XML

    SD_General.XML

    By default if you don’t use the /I switch (to specify an input file) the tool uses the SQLDiag.XML config file and you get the results as above. The true power of SQLDiag in my opinion though is in amending these config files and collecting data sets of your choice. The extra config files automatically created have the following collections turned on:

    SD_Detailed.XML

    Windows Event Logs

    Perfmon

    SQL Profiler (including several verbose events such as performance statistics, SQL:BatchStarted and SQL:BatchCompleted)

    SD_General.XML

    Windows Event Logs

    Perfmon

    SQL Profiler (with less verbose events)

    Within both the perfmon and SQL Profiler events, you have the ability to configure every potential sub event on or off, based upon the situation that you wish to monitor. All you do is edit a copy of the XML file and change individual elements like these from “true” to “false” and vice versa.

    <PerfmonObject name="\MSSQL$%s:Buffer Manager" enabled="true">
    <PerfmonCounter name="\*" enabled="true" />
    <PerfmonCounter name="\Buffer cache hit ratio" enabled="true" />
    
    <EventType name="Locks">
    <Event id="148" name="Deadlock Graph" enabled="true" description="Occurs when an attempt to acquire a lock is ......" />

    In many cases though, one of the default templates with contain enough info for you to make an initial assessment of situation. To run sqldiag with one of the other config files (or one you create yourself) use the following syntax:

    sqldiag.exe /ISD_General.XML

    and you might also want to add a new output directory:

    sqldiag.exe /ISD_General.XML /Oc:\sqldiagoutput

    One thing that the default configs don’t enable though is the blocking collector, which I do think is particularly useful. To enable this collector you need to create a copy of the SQLDiag.XML and edit the following section:

    <BlockingCollector enabled="false" pollinginterval="5" maxfilesize="350"/>

    Changing the BlockingCollector enabled value to “true” will turn on the blocking collector, and provide a blocked process report for any blocking experienced whilst the collector is running. This is a particularly common cause of application performance issues and well worth enabling.

    In addition to these configurations, one of my colleagues in the US has posted the internal performance statistic T-SQL script which we use to capture all sorts of data from the SQL Server DMVs. This is the principal script that I use every day to troubleshoot performance problems. This script doesn’t install by default with a standard SQL Server setup like the rest of SQLDIAG, but you can obtain the script and download it from the PSS SQL Server Engineers blog here. He also helpfully provided an XML config file with all the required entries to make it run. I found the easiest way to deploy it in a default install of SQL Server was to unzip the 2 SQL scripts and the 2 XML files into the directory where sqldiag.exe lives. You can then simply run the same diagnostics as above, but include these extra T-SQL scripts, by running something like this:

    sqldiag /ISQLDiagPerfStats_Trace.XML /Oc:\sqldiag

    Although you can also use the cmd files he provided for registering and running SQLDiag as a service. Either way you get 3 extra text files in the output

    Over time you’ll find it useful to build up a selection of your own config files, and then you can just copy them to any server you wish to monitor, based upon the situation you want to look at. For example I keep custom ones to hand which focus on the following subjects

    Memory Usage

    Blocking and general Profiler Trace

    Disk Performance

    Profiler Trace

    Recompilation

    Remember though, if you enable a custom configuration file like the ones above, your collections can be extremely verbose. You should therefore consider the following:

    1. Disk Space for output: potentially many GBs on heavy loaded machines

    2. Disk Drive chosen for output: If you want to monitor disk performance don’t write the output to the same drive, and don’t ever write the output to a network drive

    3. Verbose events such as performance statistics can place a heavy CPU load on a server. I only enable these if I’m certain that I need them

    Finally you might wish to configure the collector to only run against specific machines or instances, and the following XML elements allow you to do this:

    a) Change the machine name element to the name of the server you wish to connect to (the default of “.” means connect to the local machine)

    <Machines>
    <Machine name=".">

    (Or use the /M switch when running from the command line.)

    b) Change the instance name element to the SQL Server instance that you wish to connection to (the default of “*” connects to every instance it can on the machine specified above.

    <Instance name="*" windowsauth="true" ssver="9" user="">

    I hope this all makes sense. If you have any questions, just post a comment. Over coming posts I'll talk about how to interpret the data you collect.

  • SQL in Stockholm

    Somewhere to start

    • 0 Comments
    Following my guest posts with Tess, here and here, I thought it would be interesting to expand some of the topics further, so here we are. First off I'll start with a revision of the "what data to collect" post, with a couple of small additions.
Page 3 of 3 (64 items) 123