SQL in Stockholm

SQL Server thoughts from the support team in Stockholm

SQL Server troubleshooting thoughts from Graham Kent, (ex) Senior Support Escalation Engineer at Microsoft Sweden.

  • SQL in Stockholm

    New SQL Blogger from my team

    • 0 Comments

    My hugely experienced colleague Clive, from the SQL Escalation team in the UK, has just started a blog here:

    http://blogs.msdn.com/sqlstuff/default.aspx

    It should be interesting reading as he's quite a guru on matters of clustering and high availability. He's started off with a post about one my least favourite subjects, setup in SQL 2000 Clustered instances :-)

    http://blogs.msdn.com/sqlstuff/archive/2008/10/23/service-pack-hotfix-setup-failures-with-sql2000-clusters.aspx

     

  • SQL in Stockholm

    Non-yielding Scheduler, IO Warnings and Cluster Failover

    • 0 Comments

    Here's another example of SQL Server producing a dump file which can help you diagnose critical problems. The scenario here is a a clustered SQL Server 2005 instance fails over and we need to investigate why.

    A good place to start in such problems (as it is with many problems) is to look into the SQL Server error logs. The best place to start is the log file leading up to the failure (this will not be the most recent error log as the SQL Server service has re-started when it failed over to the other node). If we do that we see the following

    2008-10-10 08:43:34.72 spid2s      SQL Server has encountered 1 occurrence(s) of I/O requests taking longer than 15 seconds to complete on file [Z:\DATA\MSSQL.1\MSSQL\DATA\tempdb.mdf] in database [tempdb] (2).  The OS file handle is 0x000006e3.  The offset of the latest long I/O is: 0x00000000b80000
    2008-10-10 08:45:44.72 spid2s      SQL Server has encountered 5 occurrence(s) of I/O requests taking longer than 15 seconds to complete on file [Z:\DATA\MSSQL.1\MSSQL\DATA\templog.ldf] in database [tempdb] (2).  The OS file handle is 0x000006b4.  The offset of the latest long I/O is: 0x00000000080000
    2008-10-10 08:47:01.63 Server      Using 'dbghelp.dll' version '4.0.5'
    2008-10-10 08:49:14.66 Server      ***Unable to get thread context - no pss
    2008-10-10 08:49:14.66 Server      * *******************************************************************************
    2008-10-10 08:49:14.66 Server      *
    2008-10-10 08:49:14.66 Server      * BEGIN STACK DUMP:
    2008-10-10 08:49:14.66 Server      *   10/10/08 08:49:14 spid 0
    2008-10-10 08:49:14.66 Server      *
    2008-10-10 08:49:14.66 Server      * Non-yielding Scheduler
    2008-10-10 08:49:14.66 Server      *
    2008-10-10 08:49:14.66 Server      * *******************************************************************************
    2008-10-10 08:49:14.66 Server      Stack Signature for the dump is 0x00000319

    2008-10-10 08:49:25.51 spid61      Autogrow of file 'templog' in database 'tempdb' took 238266 milliseconds.  Consider using ALTER DATABASE to set a smaller FILEGROWTH for this file.
    2008-10-10 08:49:26.30 Server      External dump process return code 0x20000001.
    External dump process returned no errors.

    2008-10-10 08:49:26.30 Server      Process 2:0:0 (0x11ec) Worker 0x0433C0E8 appears to be non-yielding on Scheduler 7. Thread creation time: 12865910449631. Approx Thread CPU Used: kernel 0 ms, user 0 ms. Process Utilization 0%. System Idle 99%. Interval: 70001 ms.


    About one minute after these events, the clustered instance is failed over to the other node, because according the cluster log the isalive check has failed.

    Now in this type of situation you have the option of reviewing the dump file yourself, as per my previous post, checking the stacks and seeing if something is obvious, or alternatively sending the dump file to use here as Microsoft and seeing what we can see with it using our private symbols.

    However in this case, there is more information to be had from the error logs themselves, which can be of use to use in this situation.

    1. We are receiving the classic "15 second IO warning" error that was introduced in SQL Server 2000 SP4. Think about what this is actually saying, it is staying that certain IO requests are taking at least 1000 times longer to run than a normal acceptable benchmark for IO.

    (When I say acceptable benchmark we could obviously discuss what is acceptable for what disk system, fixed or network storage or whatever, but the value I'm using is 15ms, because it's a nice round number compared to the warning message!)

    SQL Server is an IO dependent application (it likes memory as well) but with an IO system responding like that, you will often notice performance or other types of bad problems.

    2. We can clearly see that we have a non yielding scheduler, meaning that somewhere inside the SQL Server OS, we have one of our schedulers that has completely stopped working.

    3. We are getting the IO warning whilst doing an autogrow of tempdb.

    When I look at these facts I have the following thoughts:

    a) The IO could be a root cause of out problem, but it could also be an indicator. If tempdb is suddenly growing what have we attempted to put into tempdb that requires a grow?

    b) Is the IO system working normally at other times, and do we have any benchmarks on this IO system of what it normally runs like

    c) Why haven't we grown tempdb in advance at startup, as per standard best practices? Or have we pre grown it but it has just tried to grow massively again because the server is involved in some unexpected huge operation which requires innumerable amounts of tempdb space?

    d) What was the non-yielding scheduler doing. (I needed to look at the dump file in this regard, but I'm excluding the details here as the findings were inconclusive and there is lots of other public information).

    These are the typical types of questions that you should always be asking yourself and your database teams when scenarios of this nature are encountered.

    The 15 second IO warning message is one that I come across particularly frequently, and it should not be ignored. It can be reflective of all sorts of problems such as:

    Hardware Fault

    Driver Problems

    Overloading of disk system capacity

    Badly performing SQL batches issuing unreasonable amount of IO requests

    Mismanagement of data and log files

    and the list goes on like this...

    If you are experiencing this issue a lot in your production servers, you need to work with your storage support people to work out why. It can be harmless and just indicative of sudden burst of load which the disk cannot respond to, but it can also mean many worse things as well! My point here is I guess, ignore it at your peril!

    So back to the example in hand anyway....

    Since we can see some significant problems occurring around the time of the dump and the failover, you can easily investigate these using standard techniques without needing to debug the time or send it to Microsoft. In this case I would want to analyze tempdb usage and file sizes in detail and also monitor or review my disk performance baselines.

    It is not always possible to attribute a non-yielding scheduler to one single simple cause. Sometimes you do find a very obvious problem on the thread that is hung, or sometimes you do find a very obvious IO problem, but it also possible for it to be a combination of complex problems. When finding yourself in this situation you need to take note of the indicators which your log files show and consider implementing some proactive monitoring based upon what you see, to endeavor to catch or preempt the problem next time.

    Or just send us a case at Microsoft of course, and get us to look at the dump files :-)

Page 1 of 1 (2 items)