Welcome to MSDN Blogs Sign in | Join | Help

When in doubt, Reboot!

I tend to get quite a bit of Kerberos related cases.  These are related across the box, from the Engine, to Reporting Services to just straight connectivity with custom applications.  I had one given to me yesterday because the engineer had gone through everything we normally go through and wasn’t getting anywhere.

The situation was an 8 node cluster with multiple instances across the nodes.  These were running Windows 2008 with SQL 2008.  One node in particular was having an issue when they were issuing a Linked Server Query from a remote client.

image

When trying to hit the linked server from within Management Studio on the client machine, we received the following message:

Msg 18456, Level 14, State 1, Line 1
Login failed for user 'NT AUTHORITY\ANONYMOUS LOGON'

Kerberos Configuration:

When we see this type of error, it is typically Kerberos related as the Service we are using (ServerA) is unable to delegate the client’s credentials to the backend server (ServerB – Linked Server).  The first thing we do is go through our regular kerberos checklist – SPN’s and Delegation settings. Both SQL Servers were using the same Domain User Service Account (SNEAKERNET\SQLSvc).  We can use SetSPN to check what SPN’s are on that account.  NOTE:  There are numerous ways to look for SPN’s but SetSPN is one of the easier command line tools available.  You could also use LDIFDE (http://support.microsoft.com/kb/237677), ADSIEdit (http://technet.microsoft.com/en-us/library/cc773354(WS.10).aspx) and other tools.  You will see us use an in house tool called DHDiag to collect SPN’s.  This is just a wrapper that calls LDIFDE to output the results. 

So, here are the SetSPN results:

C:\Users\Administrator>setspn -l sqlsvc
Registered ServicePrincipalNames for CN=SQL Service,OU=Service Account,DC=sneakernet,DC=local:
        MSSQLSvc/SQL02:26445
        MSSQLSvc/SQL02.sneakernet.local:26445
        MSSQLSvc/SQL01.sneakernet.local:14556
        MSSQLSvc/SQL01:14556

Why do we see SQL01 and SQL02 when our machine names are ServerA and ServerB?  This is because SQL01 and SQL02 are the virtual names for the cluster.  This name will move to whatever the active node is for that given instance.  Where as ServerA and ServerB are the physical machine names and may or may not be actually hosting that instance.  We can also see from this that we have two distinct instances because of the ports (14556 & 26445).  If you look at some of our documentation (i.e. http://msdn.microsoft.com/en-us/library/ms189585(SQL.90).aspx), it indicates that for clusters, you need to also add a SQL SPN that does not include the port number.  I have yet to see where this is actually needed.  Every cluster I’ve seen has never had one.  Typically if it is needed, you will receive a KRB_ERR_S_PRINCIPAL_UNKNOWN error if you enable Kerberos Event Logging.  If you do see that and it lists that SPN, then go ahead and add it.  But, from my experience, you won’t see it.

Ok, our SPNs look good. Lets look at our Delegation Settings.  In this case we really care about the SQL Service Account, because that is the context that will be performing the delegation.

image 

We can do this by going to the properties for that account within Active Directory Users and Computers.  You will see a Delegation tab on the account.  If you don’t see the delegation tab, then the account does not have an SPN attached to it.  In this case we have “Trust this user for delegation to any service (Kerberos only)”.  This is what I call Full or Open Delegation as opposed to Constrained Delegation (which is more secure).  We are good to go here.  Nine times out of ten, the SPN or Delegation setting is going to be the cause of your issue.  In this case it isn’t.  What can we do now?

Kerberos Event Logging and Network Traces:

We can enable Kerberos Event Logging (http://support.microsoft.com/default.aspx?scid=kb;EN-US;262177) which will give us errors within the System Log for Kerberos.  This can sometimes be very helpful in diagnosing what may or may not be happening.  This produced the following results on ServerA:

Error Code: 0x1b Unknown Error
Error Code: 0x19 KDC_ERR_PREAUTH_REQUIRED
And KDC_ERR_BADOPTION

These are not uncommon and when we looked at these, they didn’t really relate to our issue.  Which means we had nothing here.  Of note, doing a linked server query from ServerB to ServerA worked, and it also produced the same events listed above.  So, nothing to gain here.

The next thing we can look at is getting a network trace as this will show us the communication between Service in question and the Domain Controller.  I usually end up at this level if the SPN’s and Delegation settings check out.  This is really where some customers can have issues, because typically these are hard to interpret and will require a call to CSS.  We grabbed a trace in the failing and working condition to see what was different.  We saw the following:

Failing:
525355 2009-06-30 15:55:39.468865 10.0.0.90 10.0.0.10 KRB5 TGS-REQ
KDC_REQ_BODY
KDCOptions: 40810000 (Forwardable, Renewable, Canonicalize)
Realm: SNEAKERNET.LOCAL
Server Name (Enterprise Name): ServerA$@SNEAKERNET.LOCAL

Working:
353115 23.437037 10.0.0.20 10.0.0.11 KRB5 TGS-REQ
KDC_REQ_BODY
KDCOptions: 40810000 (Forwardable, Renewable, Canonicalize)
Realm: SNEAKERNET.LOCAL
Server Name (Service and Instance): MSSQLSvc/SQL02.sneakernet.local:26445

You’ll notice that we are hitting different DC’s here, but that wasn’t the issue as we also saw the failing one hitting different DC’s as we continued.  The other item that is different is the working one requested the right SPN, where as the failing one is requesting the physical machine account context.  This is what was forcing us into NTLM and causing the Login failed error.  But why was that happening?  So far we have zero information to indicate what could be causing it.

SSPIClient:

We then used an internal tool called SSPIClient which makes direct calls to the InitializeSecurityContext API call which is how we do impersonation.  This tool allowed us to take SQL Server out of the picture and focus on the Kerberos issue directly.  We could see that we were failing back to NTLM which really confirmed what we saw in the network trace.

2009-07-01 16:34:24.577 ENTER InitializeSecurityContextA
2009-07-01 16:34:24.577 phCredential              = 0x0090936c
2009-07-01 16:34:24.577 phContext                 = 0x00000000
2009-07-01 16:34:24.577 pszTargetName             = 'MSSQLSvc/SQL02.sneakernet.local:26445'
2009-07-01 16:34:24.577 fContextReq               = 0x00000003 ISC_REQ_DELEGATE|ISC_REQ_MUTUAL_AUTH
2009-07-01 16:34:24.577 TargetDataRep             = 16
2009-07-01 16:34:24.577 pInput                    = 0x00000000
2009-07-01 16:34:24.577 phNewContext              = 0x0090937c
2009-07-01 16:34:24.577 pOutput                   = 0x0017d468
2009-07-01 16:34:24.577 pOutput->ulVersion        = 0
2009-07-01 16:34:24.577 pOutput->cBuffers         = 1
2009-07-01 16:34:24.577 pBuffers[00].cbBuffer   = 52
2009-07-01 16:34:24.577 pBuffers[00].BufferType = 2 SECBUFFER_TOKEN
2009-07-01 16:34:24.577 pBuffers[00].pvBuffer   = 0x02c99f90
2009-07-01 16:34:24.578 02c99f90  4e 54 4c 4d 53 53 50 00 01 00 00 00 97 b2 08 e2   NTLMSSP.........
2009-07-01 16:34:24.578 02c99fa0  03 00 03 00 31 00 00 00 09 00 09 00 28 00 00 00   ....1.......(...        
2009-07-01 16:34:24.578 pfContextAttr             = 0x00001000 ISC_RET_INTERMEDIATE_RETURN
2009-07-01 16:34:24.578 ptsExpiry                 = 0x0017d43c -> 2009-07-01 10:39:24 *** EXPIRED *** (05:55:00 diff)
2009-07-01 16:34:24.578 EXIT  InitializeSecurityContextA returned 0x00090312 SEC_I_CONTINUE_NEEDED (The function completed successfully, but must be called again to complete the context)

NOTE:  We purged all of the Kerberos Tickets before we did this to make sure we would request the ticket from the KDC.  This was done using KerbTray which is part of the Windows Resource Kit.

This tells us that we were requesting a given SPN for the Target, but the buffer shows NTLMSSP.  This means we fell down to NTLM instead of getting Kerberos.  This still doesn’t explain why.

End Result:

Unfortunately, this was one of those issues that just escaped us.  This tends to happen with odd Kerberos cases.  We had the Directory Services team engaged as well and they did not know what else we could do in terms of data collection outside of a Kernel Dump to see what may be going on.  We noticed that the nodes had not been rebooted since April 5th which is a while.  The SQL Service was recycled on June 25th.  We decided to fail over to another node and reboot ServerA. After we rebooted, we tried SSPIClient again and we saw a proper response come back which also didn’t list EXPIRED.  The issue at this point it was resolved.  We don’t have hard data to indicate what exactly the issue was, but the thought is that something was cached and invalid causing the issue.  Rebooting cleared that out and allowed us to work as expected.

Which leads me to my motto:  When in doubt, Reboot!

Adam W. Saxton | Microsoft SQL Server Escalation Services

Installing .NET 3.5 Framework for SQL Server 2008 on a Windows 7 / Windows 2008 RC2 Builds

The SQL Server 2008 install requires the .NET 3.5 framework.   Changes in Windows 7 can prevent SQL Server 2008 from installing the .NET 3.5 framework pre-requisite.   Instead you may need to install the framework separately.

In Windows 7, the .NET Framework 3.5 is included as a Windows component. To avoid this installation failure, you must turn on the .NET Framework 3.5 from the Windows Features before you run the SQL Server 2008 RTM installation.

Alternate Method

1. Download the .NET 3.5 Framework and save the dotnetfx35setup.exe on the local drive.  (http://download.microsoft.com)

2. Using Explorer access the File Properties | Compatibility tab and enable compatibility mode for Vista SP1.

3. Run donnetfx36setup.exe as ‘ADMINISTRATOR’ to complete the install.

4. Execute SQL Server setup again to install SQL Server.

image

Bob Dorr – Principle SQL Server Escalation Engineer

Zero Byte Attachments in Email Subscriptions

I had an issue a while back with RS 2000 where we would get a Zero Byte attachment when hitting a Lotus Notes server.  This ended up being corrected in KB 872774. So, when I was assigned a case with a similar description, but this time with RS 2005 SP3, my first question was “Are we hitting a Lotus Notes Email Server?”.  The answer was Yes.  It also turned out to be affecting every attachment type except for MHTML.

When we hit issues with emails, and we suspect the issue to be with the Email Server itself, we can enable the SMTP Drop Folder to see what the actual generated email looks like before we send it onto the email server.  This involves the SMTPServerPickupDirectory and SendUsing keys within the rsreportserver.config file.

Configuring a Report Server for E-Mail Delivery
http://msdn.microsoft.com/en-us/library/ms159155.aspx

image

We want to set the SMTPServerPickupDirectory to a local path to the Report Server.  (i.e. C:\Temp\Emails).  We also want to change the SendUsing key from 2 to 1.  This will generate the EML files to the directory specified.  These can be opened in Outlook Express.

image

image

Well that’s not good.  The attachment is actually zero’d out in the raw email.  That rules out the email server.  So, lets go back to the RS Log and bump it up to Verbose output.

ReportingServicesService!chunks!e!6/17/2009-16:53:09:: v VERBOSE: ### ChunkWriteStream - Closed! id=6b235fc9-5263-43df-8fc9-434c066c32c5, name='RenderingInfo_PDF'
ReportingServicesService!chunks!e!6/17/2009-16:53:09:: v VERBOSE: ### ID=6b235fc9-5263-43df-8fc9-434c066c32c5, Length=96284, CompressedLength=70420, TimeCompressing=162, TimeUncompressing=0, Ratio=0.731378, Buffering=1, Permanent=False
ReportingServicesService!session!e!6/17/2009-16:53:09:: v VERBOSE: Finished serializing report
ReportingServicesService!library!e!6/17/2009-16:53:09:: v VERBOSE: Transaction commit.
ReportingServicesService!session!e!6/17/2009-16:53:09:: v VERBOSE: Saved report snapshot to session in a background thread. Key=MyDomain\Adam;thnsonmxaghj14v5al2zfpih;/Reports/MyFolder/MyReport;<Parameters />
ReportingServicesService!library!e!6/17/2009-16:53:09:: v VERBOSE: Transaction rollback.
ReportingServicesService!emailextension!4!06/17/2009-16:53:09:: Email successfully sent to "adam@myserver.com"
adam@myserver.com

Based on this, we know that the attachment was generated and the email was sent.  The EML file was generated.  From that I can make an assumption that the issue is in the process of generating the EML file.  We use CDO to generate the email.  Typically this is going to come down to a content /encoding issue.

If we look at the Process Monitor output at the time the email was being generated we see the following:

image

We see “HKLM\System\CurrentControlSet\Control\Nls\CodePage” and for 20127 the data is blank.  Here is what this key looks like on my machine:

[HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Nls\CodePage]
"10000"="c_10000.nls"

"20127"="c_20127.nls"

Here is what it looked like for the server in question:

"10000"="c_10000.nls"
"10001"=""

"20108"=""
"20127"=""

On the bad server, there were actually a ton of blank entries in this key.  That doesn’t look good.  We exported this key from a working machine, and then imported it to the failing machine.  Restarted the Reporting Services Windows Service, and then tested the email again.  This time the attachments were correct.

NLS stands for National Language Support.  These files are used to create consistent, globalized Windows User experiences.  The fact that we had one missing that was trying to referenced caused the attachment to not be MIME encoded properly due to the lack of the codepage.

I did a few quick searches and found that there are some tools out there that will go through and either clear or blank out these entries to try and speed windows up.  I’m not sure that that is what happened in this case as we came into the picture too late.  If you run into this type of issue, you can restore the values.  If it continues to recur, I would recommend using Process Monitor to determine what is clearing these values out.  Although you would need to know about when it is occurring.

 

Adam W. Saxton | Microsoft SQL Server Escalation Services

SQL Server on Windows 7 / Windows 2008 R2 with more than 64 CPUs

The release of the Windows 7/Windows 2008 R2 will support more than 64 CPUs.   It is also documented that the SQL Server 10.5 (Kilimanjaro - http://news.cnet.com/8301-13860_3-10236936-56.html) will extend the SQL Server Engine to support more than 64 CPUs.

Question: What if I am running a version of SQL Server that is not greater than 64 CPU aware?

Windows 7 introduces processor groups of up to 64 CPUs per group.   It is not common, but it is possible, to have configurations with unbalanced group CPU counts.   For example, on a machine with 96 CPUs you could have two groups 1 == 64 CPUS and 2 == 32 CPUs.  If you have SQL Server 2008 Enterprise installed it can use up to 64 CPUs.   You don’t want it associated with group 2 and only 32 CPUs.

Windows provides capability to assign a service to a preferred node using SC.exe.  You may need to use SC.exe to establish a preferred node for the SQL Server based service.

//To reset a preferred node

C:\>sc preferrednode spooler -1

//To set a preferred node

C:\>sc preferrednode spooler 2

//To query a preferred node

C:\>sc qpreferrednode spooler

Group Reference: http://download.microsoft.com/download/a/d/f/adf1347d-08dc-41a4-9084-623b1194d4b2/MoreThan64proc.docx

Group, Process, and Thread Affinity

In earlier versions of Windows, a process or thread could specify an affinity for a particular processor, so that the thread or process was guaranteed to run on that processor. Windows 7 expands this notion of affinity to apply to groups and to the processors in a group.

Windows 7 uses the following defaults for affinity:

·         Windows 7 initially assigns each process to a single group in a round-robin manner across the groups in the system. A process starts its execution assigned to exactly one group.

·         The first thread of a process initially runs in the group to which Windows assigns the process. However, an application can override this default as described in “Setting Process Affinity” later in this paper.

·         Each newly created thread is by default assigned to the same group as the thread that created it. However, at thread creation, an application can specify the group to which the thread is assigned.

·         Only the system process is assigned a multigroup affinity at startup time. All other processes must explicitly assign threads to a different group to use the full set of processors in the system.

Over time, a process can expand to contain threads that are running on all groups in a machine, but a single thread can never be assigned to more than one group at any time. However, a thread can change the group to which it is assigned.

Bob Dorr – Principle SQL Server Escalation Engineer

SQL Server 2005 Setup Failure (Multi-Core, Tri-Core, …)

SQL Server 2005 has a startup check that fails on some of the newer CPU socket designs (triple core, tri core, 12 core, …) as outlined in the knowledge base article: http://support.microsoft.com/kb/954835

Continued testing shows that the NUMPROCS workaround documented in the article does not work for all combinations.  Today I submitted a change request to add an alternate workaround.

1. Copy the RTM setup to a local drive.  (This includes the Server and Tools or install will fail.)

  \Servers
  \Tools
  Autorun.inf

2. Download the SP2 or SP3 package and expand it on the local drive using the /X parameter.

  SQLServer2005SP2-KB921896-x64-ENU.exe /X

3. Determine the full path to the expanded SP sqlrun_sql.msp patch package.

  Example: SP2\hotfixsql\files\sqlrun_sql.msp

4. Execute the following command with the proper paths to involve the .MSP in the setup.  This will use the patch package to update the SQL Engine components to the service pack level and allow install.
  
.....\Servers\setup.exe HOTFIXPATCH="D:\temp\SQL2005_12Procs\SP2\hotfixsql\files\sqlrun_sql.msp"

NOTE:  You are in a configuration that is UNSUPPORTTED

5. Apply the service pack package by executing the hotfix.exe in the expanded location.   This applies all service pack components and returns the system to a supported state.

 

Bob Dorr – Principle SQL Server Escalation Engineer

Posted by psssql | 0 Comments
Filed under: , ,

SQL Server and Large Pages Explained….

I gave a presentation on “debugging memory” at the recent Europe PASS Summit in April. In the talk, I mentioned that Large Pages would be used by SQL Server if you used trace flag 834. At the conference, Christian Bolton, a well-known MVP from the UK, mentioned to me that he thought he saw messages in the ERRORLOG that referenced “large pages” but he didn’t have the trace flag turned on. At the time, I told him I didn’t see how that was possible. Well, Christian, you were not imagining things.

The subject came up again just recently as I was helping someone within the CSS team on the same subject. So I thought it might be a good time to really dig in and find how this all works.

First, what is a “Large Page”?. Basically, the concept is to use a larger page size for memory as organized by the kernel. This makes the process of virtual address translation typically faster. Read the section titled “Large and Small Pages” from the book Microsoft Windows Internals by Mark Russinovich and David Solomon for more information. But just to give you an example, the normal page size for Windows memory is 4Kb on x64 systems. But with large pages, the size is 2Mb.

SQL Server supports the concept of Large Pages when allocating memory for some internal structures and the buffer pool. To achieve this we use the Large Page Support provided by Windows via VirtualAlloc(). As you will find out, the use of large pages for the buffer pool is not for everyone and must be carefully tested and thought out.

So, two terms to define when discussing large pages and SQL Server:

LargePagesAllocator – SQLOS decides whether it can use the Large Page Support from VirtualAlloc.

Large Pages Used by Buffer Pool – Buffer Pool decides to use the LargePageAllocator from SQLOS to allocate buffer pool memory.

Let’s first talk about the LargePageAllocator. When SQLOS is “booted” (when SQL Server is first started), it makes a decision about whether it can use large page support from Windows. This decision is based on the following three conditions, which all must be true:

  • SQL Server Enterprise Edition
  • The computer must have 8Gb or more of physical RAM
  • The “Lock Pages in Memory” privilege is set for the service account.

This is a requirement by Windows to use the MEM_LARGE_PAGES option with VirtualAlloc(). Important note here. This check has nothing to do with the “Locked Pages” functionality for the buffer pool when using the AWE APIs. It just so happens that large page memory is not part of the working set and cannot be paged. Therefore, just like memory allocated with AWE APIs, the Lock Pages in Memory privilege must be set.

If these conditions are true, SQLOS will “initialize” the LargePageAllocator for each memory node on the computer. Your ERRORLOG will show messages like the following:

2009-06-04 12:21:08.16 Server      Large Page Extensions enabled.
2009-06-04 12:21:08.16 Server      Large Page Granularity: 2097152
2009-06-04 12:21:08.21 Server      Large Page Allocated: 32MB

The Large Page Granularity is the minimum size of a “large page” on the given Windows Platform. The SQL engine simply calls the Windows API GetLargePageMinimum() to get this information. So on my x64 server, the min size is 2Mb. The next message indicates something the LargePageAllocator for SQLOS does when it is initialized. It allocates 32Mb of large page memory to prime the system for any component that needs large pages memory. You will see one of these messages for each memory node created by SQL Server.

This all takes place even if trace flag 834 is not enabled. This is why I suspect Christian saw these messages in an ERRORLOG. But you would only see them when the above conditions are true (EE SKU, 8Gb RAM+, and lock pages privilege). As it turns out this memory for the LargePageAllocator is not wasted. As of SQL Server 2008, internal structures for lock management and buffer hash can use large pages.

When you see these messages, you can also see that large pages are being used by examining the DMV sys.dm_os_process_memory (this DMV only exists in SQL Server 2008). So a query on my system without trace flag 834 enabled, showed this:

select large_page_allocations_kb, locked_page_allocations_kb from sys.dm_os_process_memory

large_page_allocations_kb         locked_page_allocations_kb
-------------------------                    --------------------------
61440                                         49140

Notice here I have allocation for large and locked pages. The Large pages are for the LargePageAllocator (notice more than the original 32Mb was allocated). The Locked pages are for the use of AWE APIs since this is 64bit, EE SKU, and the lock pages privilege is set. The Locked pages are for Buffer Pool Memory and represent what is commonly known as “Locked Pages”.

So LargePageSupport is enabled and used by the engine even if you don’t enable trace flag 834. But not much memory is used for this and buffer pool memory is not used unless trace flag 834 is enabled. So let’s talk about that.

The second scenario is when I enable trace flag 834 as documented at:

Tuning options for SQL Server 2005 that is running in high performance workloads

When you enable trace flag 834 on 64bit systems, you tell the SQL Server Engine to use the LargePageAllocator for SQLOS to allocate all SQL Server buffer pool memory. So first the LargePageAllocator must be enabled per the conditions I outlined earlier in this post.

Then, if the engine detects trace flag 834 is enabled at startup, it will allocate memory for the buffer pool using the LargePageAllocator. Unlike normal startup, the server will allocate all of the buffer pool memory at startup. This is because the allocation of large page with VirtualAlloc() can be very slow. If the buffer pool grew dynamically, it could cost the performance of standard queries. So we allocate memory all at one time.  This is where things get very interesting regarding the usage of large pages. The algorithm for this startup allocation is as follows:

  • We attempt to allocate the size equal to the minimum of ‘max server memory’ and total physical memory on the computer. So if ‘max server memory’ is set to 0, basically we try to allocate all of total physical RAM. For this reason, you should find a suitable value for ‘max server memory’ on your computer. Suitable is likely somewhat less than total physical RAM to give room for the operating system and system cache.
  • This also means you should only use large pages on servers where SQL is the only application that uses any significant memory (i.e. a dedicated SQL Server)
  • In either case, if the engine cannot allocate ‘max server memory’ or total physical memory, it may try to allocate some value lower than this. If it can’t allocate a “lower” value, an error will be raised and the server will not start. The “lower” value depends on the ‘max server memory’ setting or total physical memory. I’ll show you an example of this below.

Here are some example ERRORLOG entries for these situations (in all cases I had ‘max server memory’ set to 0 on a computer with 16Gb of physical RAM):

This machine actually has some virtual machines running so there are other competing resources for memory. In one case when I started the server it failed to allocate a “lower” level of memory for large pages so the server failed to start. (I didn’t include all of the output in the ERRORLOG):

2009-06-04 12:18:32.41 Server      Large Page Extensions enabled.
2009-06-04 12:18:32.41 Server      Large Page Granularity: 2097152
2009-06-04 12:18:32.41 Server      Large Page Allocated: 32MB
2009-06-04 12:18:32.46 Server      Using large pages for buffer pool.
2009-06-04 12:18:32.88 Server      0 MB of large page memory allocated.
2009-06-04 12:18:39.21 Server       Failed allocate pages: FAIL_PAGE_ALLOCATION 1
2009-06-04 12:18:39.22 Server     
Memory Manager                                   KB
---------------------------------------- ----------
VM Reserved                                   54848
VM Committed                                  54572
Locked Pages Allocated                            0
Reserved Memory                                1024
Reserved Memory In Use                            0
2009-06-04 12:18:39.22 Server     
Memory node Id = 0                               KB

.

.

2009-06-04 12:18:39.22 Server      Error: 17138, Severity: 16, State: 1.
2009-06-04 12:18:39.22 Server      Unable to allocate enough memory to start 'SQL OS Boot'. Reduce non-essential memory load or increase system memory.

In another instance, the server was able to allocate memory but only 2Gb of RAM as you can see from this ERRORLOG entry.

2009-06-04 14:20:31.13 Server      Large Page Extensions enabled.
2009-06-04 14:20:31.13 Server      Large Page Granularity: 2097152
2009-06-04 14:20:31.14 Server      Large Page Allocated: 32MB
2009-06-04 14:20:40.03 Server      Using large pages for buffer pool.
2009-06-04 14:27:56.98 Server      2048 MB of large page memory allocated.

This shows one of the problems with large pages: the memory size requested must be contiguous. This is called out very nicely at the MSDN article on Large Pages

These memory regions may be difficult to obtain after the system has been running for a long time because the space for each large page must be contiguous, but the memory may have become fragmented. This is an expensive operation; therefore, applications should avoid making repeated large page allocations and allocate them all one time at startup instead.

In this case above, even if ‘max server memory’ was set to say 8Gb, the server could only allocate 2Gb and that now becomes a maximum allocation for the buffer pool. Remember we don’t grow the buffer pool when using large pages so whatever memory we allocate at startup is the max you get.

The other interesting thing you will find out with large pages is a possible slowdown in server startup time. Notice in the ERRORLOG entry above the gap of 7 minutes between the server discovering trace flag 834 was on (the "Using large pages..” message) and the message about how much large memory was allocated for the buffer pool. Not only does it take a long time to call VirtualAlloc() but in the case where we cannot allocate total physical memory or ‘max server memory” we attempt to allocate lower values several times before either finding one that works or failing to start. We have had some customers report the time to start the server when using trace flag 834 was over 30 minutes.

So in summary:

  • Large page support is enabled on Enterprise Edition systems when physical RAM is >= 8Gb (and lock pages in memory privilege set)
  • SQL Server will allocate buffer pool memory using Large Pages on 64bit systems if Large Page Support is enabled and trace flag 834 is enabled
  • Large page for the buffer pool is definitely not for everyone. You should only do this for a machine dedicated to SQL Server (and I mean dedicated) and only with careful consideration of settings like ‘‘max server memory’. Furthermore, you should test out the usage of this functionality to see if you get any measureable performance gains before using it in production.
  • SQL Server startup time can be significantly delayed when using trace flag 834.

So I hope this post provides some inside information about how large support works and some of the issues surrounding its usage. Plus I wanted to make sure Christian knows he was right all long about seeing the messages in the ERRORLOG when trace flag 834 was not enabled.

 

Bob Ward
Microsoft

Posted by psssql | 3 Comments
Filed under: ,

RDTSCTest CPU Speed Output Clarified

Starting with SQL Server 2005 SP3 and SQL Server 2008 the RDTSC counter is not used for timing in SQL Server but you may still have occasion to use RDTSCTest. 

From: Robert Dorr
Sent: Thursday, June 04, 2009 4:14 PM
Subject: RE: SQL Troubleshooting - RDTSC
 

In the registry the speed is stored but we have found machines where the registry values are not correct.  In order to check the current speed you have to do some timing on the CPU and if speed step or such technology is involved it is possible the CPU is running at a level below max and until it gets resource pressure it does not need to run at the max level.  

CPU Speed: 1595Mhz  (1595000 ticks per ms)     <============ From registry  

CPU 00 Current Speed: 399Mhz *Speed Variance   <============ Base on actual RDTSC movement  

       What we do here is set the thread affinity to the CPU.

       Grab the current RDTSC

       Sleep(500)

       Grab the RDTSC now

       Calculate the difference.

       So this system is likely throttling down to 399Mhz during the sleep period.   We would have to spin the CPU to get it to wake up and increase to the max.  

-----Original Message-----

Sent: Thursday, June 04, 2009 2:11 PM

To: Robert Dorr

Subject: SQL Troubleshooting - RDTSC  

I used the RDTSC utility on a customer's 4 dual core Itamium 2 system recently.  Is the utility accurate on IA64 systems?  

Actual CPU speed was being reported at 399MHz across all 8 CPU's, which are 1.6GHz Itanium 2 procs.  I had run across this in pssdiag data I collected, and asked them to verify their power settings were set to not throttle down the CPU.  They sent back the report below which still indicates the CPU's at 399MHz.

I asked them to verify with their hardware vendor CPU's are set to operate at full speed, but thought I would also double check this is a valid utility for Itanium 2 procs. 

======================= Processor =======================

       Processor Architecture: IA64

                    Page Size: 8192

               Processor Type: Intel IA64

               Processor Type: Family 32, Model 1, Stepping 1

                    CPU Speed: 1595Mhz  (1595000 ticks per ms)

        Active Processor Mask: 255 (0x00000000000000FF)

Warning: CPUID Leaf 4 is not supported or disabled in BIOS. May be unable to determine core/processor mapping.  

              Hyper-Threading: Not capable

                   Multi-core: No

Physical Package/Socket Count: 1

         Processor Core Count: 1 (1 per package)

      Logical Processor Count: 8 (1 per core)

 

Logical Processor/Core/Socket Mapping

CPU    APIC ID     Physical ID Core ID     HT/SMT ID 

------ ----------- ----------- ----------- -----------

     0           0           0           0           0

     1           0           0           0           0

     2           0           0           0           0

     3           0           0           0           0

     4           0           0           0           0

     5           0           0           0           0

     6           0           0           0           0

     7           0           0           0           0

 

======================= NUMA ============================

                Max NUMA Node: 0 (no NUMA)

        Node 0 Processor Mask: 255 (0x00000000000000FF)  

======================= Memory ==========================

        Total Physical Memory: 16352 MB

        Avail Physical Memory: 13776 MB

              % Memory In Use: 15

              Total Page File: 32498 MB

              Avail Page File: 29697 MB

     NUMA Node 0 Avail Memory: 13776 MB

======================= Actual CPU Speed ================

         CPU 00 Current Speed: 399Mhz *Speed Variance

         CPU 01 Current Speed: 399Mhz *Speed Variance

         CPU 02 Current Speed: 399Mhz *Speed Variance

         CPU 03 Current Speed: 399Mhz *Speed Variance

         CPU 04 Current Speed: 399Mhz *Speed Variance

         CPU 05 Current Speed: 399Mhz *Speed Variance

         CPU 06 Current Speed: 399Mhz *Speed Variance

         CPU 07 Current Speed: 399Mhz *Speed Variance  

======================= RDTSC Deltas ====================

CPU 00 Ticks: 0x00006817EAADE77B, Drift: 0.0ms (0x00000000 ticks)

CPU 01 Ticks: 0x00006817EAAE128A, Drift: 0.0ms (0x00002B0F ticks)

CPU 02 Ticks: 0x00006817EAAE39A6, Drift: 0.0ms (0x0000522B ticks)

CPU 03 Ticks: 0x00006817EAAE63A4, Drift: 0.0ms (0x00007C29 ticks)

CPU 04 Ticks: 0x00006817EAAE8D6E, Drift: 0.0ms (0x0000A5F3 ticks)

CPU 05 Ticks: 0x00006817EAAEB639, Drift: 0.0ms (0x0000CEBE ticks)

CPU 06 Ticks: 0x00006817EAAEDEC2, Drift: 0.0ms (0x0000F747 ticks)

CPU 07 Ticks: 0x00006817EAAF0B75, Drift: 0.0ms (0x000123FA ticks)

 

Bob Dorr – Principle SQL Server Escalation Engineer

Posted by psssql | 0 Comments
Filed under: , , ,

Trying to keep up with Version Numbers….

Updated. See information at the bottom of the post

I can’t begin to tell you how many times I find myself seeing a version number for SQL Server and not knowing exactly what service pack or cumulative update the build belongs to without quite a bit of pain. I have to admit years ago (ok how about a decade ago) we just had to worry about service packs so I practically had these numbers memorized, but when we introduced the cumulative update model, there were too many build numbers to remember.

If you happen to have an ERRORLOG file or can query the SQL Server Engine, you do have some information you can use to determine if a build is RTM or Service Pack.

On my laptop I have SQL Server 2008 Service Pack 1. How do I know this?

First at the top of the ERRORLOG file is version information:

2009-06-02 15:39:54.73 Server      Microsoft SQL Server 2008 (SP1) - 10.0.2531.0 (X64)
    Mar 29 2009 10:22:31 2007.0100.2531.00
    Copyright (c) 1988-2008 Microsoft Corporation
    Developer Edition (64-bit) on Windows NT 6.0 <X64> (Build 6001: Service Pack 1)

Notice right after the major version of SQL Server it has (SP1) and right before the version (or build) number.

This same information is available by running the following query:

select @@version

You can also use the T-SQL function SERVERPROPERTY() to get this information. For example, on this machine, I ran these queries:

select SERVERPROPERTY('productversion'), SERVERPROPERTY('productlevel')

and got these results

10.0.2531.0    SP1

This is fine, but what happens when I install Cumulative Update 1 for SQL Server 2008 SP1?

The same SERVERPROPERTY() query yield this result:

10.0.2710.0    SP1

So you cannot tell from any T-SQL query whether you are running a cumulative update. Furthermore, there may be other times where you don’t have the ERRORLOG or can’t run a T-SQL query to see this data. For me, many of these situations are when I’m evaluation a problem by looking at a user dump file. I can see the version by examining the properties of SQLSERVR.EXE in the dump, but I don’t know if this belongs RTM, a service, pack, or cumulative update.

I thought there might be a need to build a table that shows all of these release versions. But after searching the web a bit, I found that my friend from PASS Bill Graziano already has such a site posted at:

http://www.sqlteam.com/article/sql-server-versions

This link has a table with major SQL Server versions, a link to the fix article to download the fix, and a date when the version was released.

So now I have a reference point myself for those situations where I know a build number but am not quite sure what cumulative update it belongs to.

Since the time of the original post, one of my colleagues Mark House pointed out to me another great resource for version numbers at:

http://www.sqlsecurity.com/FAQs/SQLServerVersionDatabase/tabid/63/Default.aspx

 

Bob Ward
Microsoft

Posted by psssql | 1 Comments
Filed under:

SQL Server TempDB – Number of Files – The Raw Truth

I continue to answer questions about the number of TEMPDB files and trace flag –T1118.   It seems there are plenty of advice blogs, wikis, articles and other resources.   To help clarify this I am going to post my latest e-mail exchange.

From: Robert Dorr
Sent: Thursday, June 04, 2009 9:08 AM
Subject: RE: TempDB / data files per processor
 

It is not just SGAM in SQL 2000 that encountered contention. SQL Server can still heat up the PFS and other allocation pages on SQL 2005 and 2008 under very heavy use conditions.   The temp table caching added to SQL 2005 has helped but on larger systems we can still get issues when the limits are pushed.  

-T1118 forces uniform extent allocations so the answer is yes, it does help.   Each data files is tracked with a file control block (FCB).   The FCB contains members for the last uniform extent allocation and last mixed page allocation.    So what happens during a new allocation is a call to the file group manager.   The file group manager determines the next file to allocate from and from there the file  control block reports the targeted starting point for free space searching based on the request type (mixed or uniform).  

By adding a file per CPU what you are doing is allowing the file group manager to pick the next file (there are some optimizations here for tempdb to avoid contention and skip to the next file when under contention) and then use the target location to start locating free space.    

-T1118 avoids all the logic to determine if a mixed extent is available or needs to be allocated and avoids the SGAM allocation patterns.  So when the same file does have a collision a uniform extent is used and any mixed extent allocation contention on the SGAM associated allocation pages can be avoided.  

Sent: Thursday, June 04, 2009 8:57 AM
To: Robert Dorr
Subject: RE: TempDB / data files per processor

Hello Bob,

I understand that this recommendation comes from the SGAM contention in SQL 2000.

·         Does multiple file help the performance even without TF1118?

·         Since SQL2005 has had improvements towards page allocation caching, do you believe it is still a valid recommendation to use TF1118?  

From: Robert Dorr
Sent: Thursday, June 04, 2009 10:52 AM
Subject: RE: TempDB / data files per processor
 

One file per CPU that SQL Server sees.   SQL Server creates a logical scheduler for each of the CPUs presented to it (logical or physical).   The reason for the multi-file TEMPDB recommendation is to allow each of the logical schedulers to loosely align with a file.   Since you can only have 1 active worker per scheduler this allows each worker to have its own tempdb file (at that instant) and avoid allocation contention.  

8 is not a limit for TEMPDB we have many deployments with more than 8 TEMPDB files.   8 is the general cap for MAXDOP setting.  

Sent: Thursday, June 04, 2009 8:48 AM
Subject: TempDB / data files per processor
 

Hi ,  

I have had a conversation with a colleague.  We were talking about optimization of TempDB for SQL Server 2005.  

The general rule I know is:  1 data file per logical or physical processor  (i.e. a SQL Server 2005 running on 4 quad-core cpus -> 16 cores should have 16 data files for TempDB)  

The colleague says there is a limit to this rule. We should not use more than 8 data files for TempDB.  This information shall be from the product group.  

My problem. I do not see any information in the world wide web where this limit is mentioned. Maybe someone can give me a hint???  

Bob Dorr – Principle SQL Server Escalation Engineer

Posted by psssql | 2 Comments
Filed under: , , ,

SQL Server, Lock Manager, and “relaxed” FIFO….

An interesting question came in from a customer a few weeks ago. Did SQL Server change its traditional FIFO method for granting locks in SQL Server 2005 and 2008?

Although there is no official documentation (not yet, I’ve put in a request to get this changed), the answer to the question is yes. SQL Server 2005 and future versions use a “relaxed” FIFO method.

You can find a few bits of information about this topic on the web:

http://blogs.msdn.com/weix/archive/2005/11/22/496000.aspx – This is a post from from weix, a former developer in the engine team

http://www.informit.com/articles/article.aspx?p=686168&seqNum=5 – This is an excerpt from the book SQL Server 2005 Practical Troubleshooting. The chapter on Locking was written by Santeri Voutilainen (he goes by Santtu), a developer in the engine team who owned the Lock Manager in SQL Server 2005. In fact, if you look at the previous blog post, Wei is simply quoting Santtu on this behavior.

Rather than me trying to paraphrase what Santtu said, let me just quote him as I think it is an excellent description of the change in behavior from SQL Server 2000 to SQL Server 2005 and 2008.

In the book Santtu states the following:

Locks are granted in a relaxed first-in, first-out (FIFO) fashion. Although the order is not strict FIFO, it preserves desirable properties such as avoiding starvation and works to reduce unnecessary deadlocks and blocking. New lock requests where the requestor does not yet own a lock on the resource become blocked if the requested mode is incompatible with the union of granted requests and the modes of pending requests. A conversion request becomes blocked only if the requested mode is incompatible with the union of all granted modes, excluding the mode in which the conversion request itself was originally granted. A couple exceptions apply to these rules; these exceptions involve internal transactions that are marked as compatible with some other transaction. Requests by transactions that are compatible with another transaction exclude the modes held by the transactions with which they are compatible from the unions just described. The exclusion for compatible transactions means that it is possible to see what look like conflicting locks on the same resource (for example, two X locks held by different transactions).

The FIFO grant algorithm was significantly relaxed in SQL Server 2005 compared to SQL Server 2000. This relaxation affected requests that are compatible with all held modes and all pending modes. In these cases, the new lock could be granted immediately by passing any pending requests. Because it is compatible with all pending requests, the newly requested mode would not result in starvation. In SQL Server 2000, the new request would not be granted, because, under its stricter FIFO implementation, new requests could not be granted until all previously made requests had been granted. In the following example, connections 1 and 3 would be granted when run against SQL Server 2005 in the specified order. In SQL Server 2000, only connection 1 would be granted:

/* Conn 1 */
begin tran
exec sp_getapplock 'amalgam-demo', 'IntentExclusive'
/* Conn 2 */
begin tran
exec sp_getapplock 'amalgam-demo', 'Shared'
/* Conn 3 */
begin tran
exec sp_getapplock 'amalgam-demo', 'IntentShare

So I’ve requested that we add information into our product documentation to show this example and talk about the change in this behavior for SQL Server 2005, 2008, and future releases

 

Bob Ward
Microsoft

Posted by psssql | 1 Comments
Filed under:

How It Works: SQL Server Timings and Timer Output (GetTickCount, timeGetTime, QueryPerformanceCounter, RDTSC, …)

A series of questions related to start time, duration, end time, T-SQL waitfor delay command and others crossed my path again this week.    As I was trying to explain it to a fellow engineer and I realized it has become a bit complicated.

The problem stems from a lack of a reliable, light-weight, high resolution timer that does not require additional CPU cycles leading to unwanted power consumption.   For example, on a laptop running SQL Express you are often more concerned about battery life than you are high resolution timing of trace events and query execution. 

Across the broad spectrum of hardware there are a couple of timer choices.  Several timers of differing accuracy are offered by the operating system: http://support.microsoft.com/kb/172338 and Microsoft is currently working with hardware manufactures to accommodate a lightweight, high resolution timer.

Function                 Units                      Resolution
---------------------------------------------------------------------------
Now, Time, Timer         seconds                    1 second
GetTickCount             milliseconds               approx. 10 ms
TimeGetTime              milliseconds               approx. 10 ms
QueryPerformanceCounter  QueryPerformanceFrequency  same             

Interrupt Ticks

Interrupt ticks are accurate but not high resolution – I.E. GetTickCount.   The common interrupt clock activity yields ~12ms granularity but can vary based on hardware and the BIOS.    This is the central timer mechanism used by Windows.

 

For example, SQL Server uses this as the basis for all DATETIME, DATE, TIME, DATETIME2, … and other time based types and calculations.   It also uses it for all timeout and delay calculations.   This would include but not be limited to query timeout, lock timeout, and T-SQL waitfor delay command.

 

RDTSC Ticks

The RDTSC ticks are the raw CPU ticks.   The CPU ticks are incremented by the CPU at the frequency the CPU is running.   This can vary as CPU frequencies are altered to save power.    It is difficult to use this raw counter as a basis for timing when the CPU frequencies are altered but various implementations are available.

 

QueryPerformanceCounter (http://msdn.microsoft.com/en-us/library/ms644904.aspx)

The QPC is a high resolution counter that attempts to determine the best high resolution counter available to it.  It attempts to tick an exposed counter at a constant frequency so external applications can use the counter for high resolution timings.   However, to keep the counter a constant frequency can be an involved process and not always accurate.

 

On some hardware the boot.ini must contain the /USEPMTIMER (http://support.microsoft.com/kb/895980) switch to improve the QPC accuracy by forcing use of the ACPI clock instead of the TSC.   This is usually reported by customers as strange performance counter behaviors.

 

http://blogs.technet.com/perfguru/archive/2008/02/18/explanation-for-the-usepmtimer-switch-in-the-boot-ini.aspx

 

SQL and Timers

Now that I have recapped the basic timers I will attempt to explain the different uses of the timers by SQL Server.

 

SQL 2000

Trace events used the interrupt timer producing ~12ms granularity for trace events and other timing outputs.

 

This was accurate but the granularity made it hard to track down issues for queries that took less than ~12ms.    If a query ran in 4ms it would commonly show a duration of 0 because the start and end tick count was the same.    If the query executed 1 million times a day it could be your largest CPU consumer but difficult to determine.

 

Support also struggled with this because if the query was running at 4ms but something changed, making it execute in 8ms it was still reported with 0 millisecond duration.  Support and the DBA could tell a specific set of queries are taking twice as long but it was difficult to pinpoint the source of the problem.

 

Other timing outputs such as statistics time used the RDTSC counter.   This is prone to errors because of the CPU frequency changes.

SQL 2005 until SP3

Trace events and other use the RDTSC timer to achieve high resolution timings.   This is prone to errors because of the CPU frequency changes but SQL Server attempted to accommodate this when possible or warn the user about frequency drift in the SQL Server error log.

SQL 2005 SP3

Ported some of the SQL 2008 changes into the SQL 2005 SP3 release.

 

Instead of the RDTSC timer and adjusted interrupt timer is used.  The API timeBeginPeriod (http://msdn.microsoft.com/en-us/library/ms713413.aspx) can be called to lower the granularity from ~12ms to 1ms.    During startup SQL Server attempts to lower the granularity to 1ms.

 

The outputs allow much better troubleshooting granularity at 1ms but the system may consume additional power.

SQL 2008

Trace events and some XEvents use QueryPerformanceCounter, returning microsecond timing to these outputs.   This is a heavier timer and can require the boot.ini switch of /USEPMTIMER to retain accuracy.

 

Most of the other timing outputs use the adjusted interrupt timer (timeBeginPeriod) as discussed in the SQL Server 2005 SP3 section above.

 

Bad BIOS - timeBeginPeriod

We have seen a couple of issues with BIOS bugs where setting the interrupt timer frequency to 1ms results in operating system warnings.

Event Source:    W32Time  Event ID:              50

The time service detected a time difference of greater than 5000 milliseconds  for 900 seconds. The time difference might be caused by synchronization with  low-accuracy time sources or by suboptimal network conditions. The time service is no longer synchronized and cannot provide the time to other clients or update  the system clock. When a valid time stamp is received from a time service  provider, the time service will correct itself.  

 

The workaround for this is to start SQL Server with trace flag –T8038.   The trace flag disables the call to timeBeginPeriod retaining the default interrupt timing granularity ~12ms.    The SQL Server timings return to the lower granularity setting but power consumption is reduced and the BIOS bug avoided.

 

Power Consumption

On laptops battery life is a priority.  When running SQL Server builds SQL Server 2005 SP3, SQL 2008 and newer adding the trace flag –T8038 to avoid the timeBeginPeriod call retains the lower granularity setting and power consumption is reduced.   Be aware that use of the trace flag reduces accuracy of the SQL Server trace events and other SQL Server timing outputs.

 

Duration and End Time Invalid

Customer reported that the duration and end time for a trace event on SQL Server 2008 was invalid.   It was showing ~5min of duration but the query timeout was 30 seconds.   Looking at this closer the QueryPerformanceCounter (QPC) was not accurate and required the boot.ini switch /USEPMTIMER.

 

Idle SQL Server

The SQL Server Express SKU is designed to go into an idle state.   Around 15 minutes if SQL Server has detected no query activity it will suspend all internal tasks (Checkpoint, Lazy writer, Lock Monitor, ….) to reduce CPU consumption.   It is woken when a new request arrives or Windows indicates a significant memory change event  When the SQL Server is declared idle it returns the interrupt timer (multi-media timer) to the original default to reduce power consumption.   When SQL Server is woken it will return to the 1ms timing granularity unless trace flag –T8038 is enabled.

 

Note: Microsoft is currently evaluating the need to make –T8038 on by default for the SQL Express variety of SKUs.

 

Additional References

http://support.microsoft.com/kb/931279

http://blogs.msdn.com/psssql/archive/2007/08/19/sql-server-2005-rdtsc-truths-and-myths-discussed.aspx

http://blogs.msdn.com/psssql/Default.aspx?p=5

http://developer.amd.com/assets/TSC_Dual-Core_Utility.pdf#search=%22TSC_Dual-Core_Utility.pdf%22  

Bob Dorr – Principle SQL Server Escalation Engineer

Posted by psssql | 1 Comments
Filed under: , , ,

How It Works: SQL Server Engine Exception Handling

I was recently asked how SQL Server handles exceptions in the core code of the engine. I looked around at various references and did not find a concise document.

 

When an exception is encountered in SQL Server error details such as the following are inserted into the SQL Server error log, a mini-dump is captured and saved in the \LOG directory along with a .TXT file containing additional information. Then the connection is terminated which initiates a rollback for the active transaction.

 

2009-03-27 11:39:46.76 spid52 ***Stack Dump being sent to C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\LOG\SQLDump0001.txt

2009-03-27 11:39:46.76 spid52 * *******************************************************************************

2009-03-27 11:39:46.76 spid52 *

2009-03-27 11:39:46.76 spid52 * BEGIN STACK DUMP:

2009-03-27 11:39:46.76 spid52 * 03/27/09 11:39:46 spid 1

2009-03-27 11:39:46.76 spid52 *

2009-03-27 11:39:46.76 spid52 * StackDump ()

2009-03-27 11:39:46.76 spid52 *

2009-03-27 11:39:46.78 spid52 *

2009-03-27 11:39:46.78 spid52 *

2009-03-27 11:39:46.78 spid52 * MODULE BASE END SIZE

2009-03-27 11:39:46.78 spid52 * sqlservr 0000000001000000 0000000003651FFF 02652000

2009-03-27 11:39:46.78 spid52 * ntdll 0000000077EC0000 0000000077FF8FFF 00139000

2009-03-27 11:39:46.78 spid52 * kernel32 0000000077D40000 0000000077EB2FFF 00173000

2009-03-27 11:39:46.78 spid52 * MSVCR80 0000000078130000 00000000781F8FFF 000c9000

2009-03-27 11:39:46.78 spid52 * msvcrt 000007FF7FC00000 000007FF7FC85FFF 00086000

2009-03-27 11:39:46.78 spid52 * MSVCP80 000000007C420000 000000007C528FFF 00109000

2009-03-27 11:39:46.78 spid52 * ADVAPI32 000007FF7FEE0000 000007FF7FFE4FFF 00105000

….. << Lines Removed >> …

2009-03-27 11:39:46.78 spid52 * sqlevn70 000000000F4F0000 000000000F698FFF 001a9000

2009-03-27 11:39:46.78 spid52 * dbghelp 00000000093B0000 000000000950CFFF 0015d000

2009-03-27 11:39:46.78 spid52 *

2009-03-27 11:39:46.78 spid52 * P1Home: 0000000000000000:

2009-03-27 11:39:46.78 spid52 * P2Home: 0000000000000000:

2009-03-27 11:39:46.78 spid52 * P3Home: 0000000000000000:

2009-03-27 11:39:46.78 spid52 * P4Home: 0000000000000000:

2009-03-27 11:39:46.78 spid52 * P5Home: 0000000000000000:

2009-03-27 11:39:46.78 spid52 * P6Home: 0000000000000000:

2009-03-27 11:39:46.78 spid52 * ContextFlags: 000000000010000B:

2009-03-27 11:39:46.78 spid52 * MxCsr: 0000000000001F80:

2009-03-27 11:39:46.78 spid52 * SegCs: 0000000000000033:

2009-03-27 11:39:46.78 spid52 * SegDs: 0000000000000000:

2009-03-27 11:39:46.78 spid52 * SegEs: 0000000000000000:

2009-03-27 11:39:46.78 spid52 * SegFs: 0000000000000000:

2009-03-27 11:39:46.78 spid52 * SegGs: 0000000000000000:

2009-03-27 11:39:46.78 spid52 * SegSs: 000000000000002B:

2009-03-27 11:39:46.78 spid52 * EFlags: 0000000000000246:

2009-03-27 11:39:46.78 spid52 * Rax: 0000000000000000:

2009-03-27 11:39:46.78 spid52 * Rcx: FFFFFFFFFFFFFFFF:

2009-03-27 11:39:46.78 spid52 * Rdx: 0000000000000003:

2009-03-27 11:39:46.78 spid52 * Rbx: 00000000048BF068: FFFFFFFFFCFEF500 0000000000000048 0000000000000001 0000000000000000 0000000000000000 0000000000000000

2009-03-27 11:39:46.78 spid52 * Rsp: 00000000048BF018: 0000000077D6CFFB 0000000000000019 00000000015956C1 0000000000000000 00000000801051E0 00000000048BF068

2009-03-27 11:39:46.78 spid52 * Rbp: 0000000000000000:

2009-03-27 11:39:46.78 spid52 * Rsi: 00000000048BF0C0: 0000000000000084 0000000000000070 0000000000000078 0000000000000003 0000000000080000 000007FFFFEE0000

2009-03-27 11:39:46.78 spid52 * Rdi: 0000000000000003:

2009-03-27 11:39:46.78 spid52 * R8: 00000000048BEFC0: 0000000210155000 000000020EF8C000 FFFFFFFF00007CD2 0000000006FE3000 0000000006FD3000 0000000000049AE0

2009-03-27 11:39:46.78 spid52 * R9: 0000000000000060:

2009-03-27 11:39:46.78 spid52 * R10: FFFFFFFFFFFFFFFF:

2009-03-27 11:39:46.78 spid52 * R11: 0000000000000246:

2009-03-27 11:39:46.78 spid52 * R12: 0000000000000000:

2009-03-27 11:39:46.78 spid52 * R13: 0000000000000000:

2009-03-27 11:39:46.78 spid52 * R14: 0000000000000003:

2009-03-27 11:39:46.78 spid52 * R15: 00000000000013B0:

2009-03-27 11:39:46.78 spid52 * Rip: 0000000077EF0F9A: 8B4C9066906666C3 050F00000059B8D1 8B4C9066906666C3 050F0000005AB8D1 8B4C9066906666C3 050F0000005BB8D1

2009-03-27 11:39:46.78 spid52 * *******************************************************************************

2009-03-27 11:39:46.78 spid52 * -------------------------------------------------------------------------------

2009-03-27 11:39:46.78 spid52 * Short Stack Dump

2009-03-27 11:39:46.82 spid52 0000000077EF0F9A Module(ntdll+0000000000030F9A)

2009-03-27 11:39:46.82 spid52 0000000077D6CFFB Module(kernel32+000000000002CFFB)

2009-03-27 11:39:46.82 spid52 0000000077D6BB01 Module(kernel32+000000000002BB01)

2009-03-27 11:39:46.82 spid52 0000000001595AB4 Module(sqlservr+0000000000595AB4)

2009-03-27 11:39:46.82 spid52 0000000002B43967 Module(sqlservr+0000000001B43967)

2009-03-27 11:39:46.82 spid52 0000000002B489E1 Module(sqlservr+0000000001B489E1)

2009-03-27 11:39:46.82 spid52 00000000016BC32E Module(sqlservr+00000000006BC32E)

2009-03-27 11:39:46.82 spid52 00000000016BC5C9 Module(sqlservr+00000000006BC5C9)

2009-03-27 11:39:46.82 spid52 00000000016B5CC4 Module(sqlservr+00000000006B5CC4)

2009-03-27 11:39:46.82 spid52 000000000155E837 Module(sqlservr+000000000055E837)

2009-03-27 11:39:46.82 spid52 0000000001522F59 Module(sqlservr+0000000000522F59)

2009-03-27 11:39:46.82 spid52 00000000015372B0 Module(sqlservr+00000000005372B0)

2009-03-27 11:39:46.83 spid52 00000000014F72F8 Module(sqlservr+00000000004F72F8)

2009-03-27 11:39:46.83 spid52 00000000781337D7 Module(MSVCR80+00000000000037D7)

2009-03-27 11:39:46.83 spid52 0000000078133894 Module(MSVCR80+0000000000003894)

2009-03-27 11:39:46.83 spid52 0000000077D6B6DA Module(kernel32+000000000002B6DA)

2009-03-27 11:39:46.85 spid52 * -------------------------------------------------------------------------------

2009-03-27 11:39:46.85 spid52 Stack Signature for the dump is 0x00000000E8A23173

 

In C/C++ you may have seen exception logic for such exceptions as Access Violations (AV), Divide By Zero, Floating Point Overflow and others listed in the windows headers.

 

try
{

    Executing common code path

}

catch(……..) -- could be the _except handler

{
    Do something with the exception
}

 

SQL Server uses this logical approach, catching the exception and capturing the details in the error log, event log, mini-dump and .txt file.

The majority of SQL Server exceptions are considered fatal to the connection but not the process. SQL Server does install a default exception handler so if an exception is encountered that is NOT handled the handler can capture detailed information. An unhandled exception can result in process termination. SQL Server also installs handlers for issues such as runtime errors. Again, SQL Server attempts to capture information and terminate the process gracefully under these conditions.

Background processes such as lock monitor, lazy writer, checkpoint and others catch exceptions, handle them and continue. This is usually done by creating a new worker to perform the task and terminating the worker that encountered the exception.

Bob Dorr – Principle SQL Server Escalation Engineer

Posted by psssql | 1 Comments
Filed under:

How a log file structure can affect database recovery time

We frequently come across situations where databases take a long time to recover. A common scenario is where the recovery process has to roll forward or back several transactions for a database after a SQL Server restart. However, you might also see one of the following symptoms which can result in databases take long time to recover (even if there are not many transactions in the log file).

1. Startup of database from restart of SQL Server or Backup/Restore or Attach/Detach or from auto close.

2. Log reader latency when a database is used in transactional replication: The reader thread of the log reader process is responsible for scanning the transaction log to identify which transactions need to be replicated.

     

CAUSE:

This can happen if the log file has grown several times (usually as a result of very small auto-growth setting). As a result, the log will have several 1000 to several million VLFs (you can read more about VLFs in Transaction Log Physical Architecture). The first phase of recovering a database is called discovery where all the VLFs are scanned (in serial and single threaded fashion) before actual recovery starts. Since this happens much before the analysis phase, there are no messages indicating the progress in the SQL Server error log. Depending on the number of VLFs this initial discovery phase can take several hours even if there are no transactions in the log that need to be processed.

   

COLLECT THE FOLLOWING DATA TO DETERMINE IF YOU ARE RUNNING INTO THIS PROBLEM:

1. Error log from the time where a database is taking a long time to recover.

2. From Management Studio, execute DBCC LOGINFO(dbname)--for the database in question. This will have to be collected when the database is online. Note: This query can take a long time to run depending on number of VLFs. Use caution when running this on a live production server.

   

WHAT TO LOOK FOR:

1. To confirm that you are running into this problem, check the error log for the database in recovery. The last message you will see for that database(while in recovery) in the log:

2008-06-26 10:29:20.48 spid58      Starting up database 'pubs'.

Once the pre-recovery has completed, you will see the following message which indicates that recovery has actually started reading the transactions. In this example, you can see that it took almost 9 minutes before the following message appeared.

2008-06-26 10:38:23.25 spid58      Analysis of database 'pubs' (12) is 37% complete (approximately 0 seconds remain). This is an informational message only. No user action is required.

2. DBCC LOGINFO(dbname) -- Since you will not be able to collect this during recovery of the database in question, run this as soon as recovery finishes to give you an idea of how many VLFs the log file has. For one of the databases that I recently came across, the log file had 1.9 million VLFs. (Hint: The number of rows returned by the DBCC command correlates to number of VLFs). The FileSize column indicates size of each VLF in bytes. Here is a sample output:

   

   

SOLUTION:

1. Run DBCC SHRINKFILE to reduce the ldf(s) to a small size, thereby reducing the number of VLFs.

2. Run DBCC LOGINFO(dbname) and make sure the number of VLFs is < 500-1000.

3. Expand(resize) the log file to the desired size using a single growth operation. This can be achieved by setting a new Initial Size for the ldf(s) in the Database Properties->Files->Database files section.

4. Run DBCC LOGINFO(dbname)and verify that the number of VLFs is still a small number. NOTE: Step 3 may create multiple VLFs even if you specify a large size. However, each VLF will be large.

5. Turn-off auto close option for the database. This can have adverse effects every time a database has to be started.

6. Backup the database.

   

Here are some references regarding the topics discussed above:

949523 The latency of a transactional replication is high in SQL Server 2005 when the value of the "Initial Size" property and the value of the Autogrowth property are small

907511 How to use the DBCC SHRINKFILE statement to shrink the transaction log file in SQL Server 2005

317375            A transaction log grows unexpectedly or becomes full on a computer that is running SQL Server

On a related topic, our recommendation to customers is to not rely on auto-grow for sizing a data/log file.  Auto-grow should be used a fallback mechanism and a good database design takes into account capacity planning needs so that they are sized appropriately upfront according to application needs. Physical operations like growth/shrink during OLTP operations can be expensive and cause performance issues.  The articles below give guidelines on using these settings as well as some diagnostics/known issues as a result of auto-grow operations.

315512  Considerations for the "autogrow" and "autoshrink" settings in SQL Server

305635  PRB: A Timeout Occurs When a Database Is Automatically Expanding

822641  Additional diagnostics added to diagnose long-running or canceled database autogrow operations in SQL Server

   

Ajay Jagannathan | Microsoft SQL Server Escalation Services

Posted by psssql | 3 Comments
Filed under: ,

An update for Standard SKU Support for Locked Pages….

Note this article has been updated to include an update for SQL Server 2005

I posted in April that we would be releasing cumulative updates for SQL Server 2008 and 2005 to support the concept of Locked Pages for SQL Server Standard.

SQL Server 2008 Cumulative Update 2 for SP1 was released yesterday to provide this support. You can download the CU at:

http://support.microsoft.com/kb/970315

SQL Server 2005 Cumulative Update 4 for SP3 was also recently released to provide this support. You can download this CU at:

http://support.microsoft.com/kb/970279

We have published a KB article that talks about the change in the CU to support Locked Pages for SQL Server Standard:

Support for Locked Pages on SQL Server 2005 Standard Edition 64-bit systems and on SQL Server 2008 Standard Edition 64-bit systems

As you can read in the article, you must do the following to enable this support:

1) Install the Cumulative Update (which requires you first install SQL 2008 SP1. Remember if you have not installed SQL Server 2008 yet, you can use slipstream support to install RTM, SP1, and CU 2 all in one setup execution).

2) Enable trace flag 845 as a startup parameter

3) Ensure the service account for SQL Server has the Locked Pages in Memory privilege set. This is not required if the service account is LOCAL SYSTEM.

As part of this change, we have updated the following KB article (note the article is updated to indicate it applies to both SQL Server 2005 and 2008)  that has many more details about Locked Pages, Working Set Trim, and the possible ramifications of enabling this for 64bit systems.

How to reduce paging of buffer pool memory in the 64-bit version of SQL Server.

 

Bob Ward
Microsoft

SQL Server Support Policy for Failover Clustering and Virtualization gets an update…

Last October I posted on our updated policy for support of SQL Server in a virtualization environment. This policy is based on a KB article that summarizes our policy. You recall this article is the following:

Support policy for Microsoft SQL Server products that are running in a hardware virtualization environment

One fairly controversial aspect to this policy was our support (actually non-support is a better word) for “guest” failover clustering. We didn’t support installing SQL Server failover clustering in a virtual machine. Well this policy is now changed effective immediately as updated in the article.

The article now contains the following new wording on this topic:

  • Guest Failover Clustering is supported for SQL Server 2005 and SQL Server 2008 in a virtual machine for Windows Server 2008 with Hyper-V, Microsoft Hyper-V Server 2008, and SVVP certified configurations provided both of the following requirements are met:
    • The Operating System running in the virtual machine (the “Guest Operating System”) is Windows Server 2008 or higher
    • The virtualization environment meets the requirements of Windows 2008 Failover Clustering as documented at The Microsoft Support Policy for Windows Server 2008 Failover Clusters.
  • Guest Failover Clustering is when you create a SQL Server failover cluster inside a virtual machine where the nodes are running as a virtual machine. A non-SVVP configuration that meets these requirements will receive support from Microsoft CSS per the policies documented in

    897615 (http://support.microsoft.com/kb/897615/ ) Support policy for Microsoft software running in non-Microsoft hardware virtualization software

    It is important to read through some of these details:

    • We support both Hyper-V and SVVP configurations
    • SQL Server 2005 and SQL Server 2008 are supported.
    • We only support Windows Server 2008 or higher as the operating system running inside the virtual machine (known as the “guest” operating system). Because of this requirement, SQL Server 2000 and earlier SQL Server releases are not supported.
    • The virtualization vendor must meet the requirements for failover clustering as documented in the following KB article: The Microsoft Support Policy for Windows Server 2008 Failover Clusters
    • Non-SVVP configurations that meet all of these requirements still can get some support from Microsoft per http://support.microsoft.com/kb/897615.

    The article on Windows 2008 Failover Clustering has 2 requirements:

    • All hardware and software components must be meet “Certified for Windows Server 2008” logo requirements.
    • The configuration must pass the Validate test in the Failover Clusters Management snap-in. This is run inside the virtual machine.

    Logo certification information is found at http://www.windowsservercatalog.com/default.aspx

    The Windows 2008 Failover Clustering article discusses the requirements for the validation test that must pass through the Failover Cluster Management snap-in.

    There are likely to be some other questions on this topic and I’ll post updates to this blog as they come in. Stay tuned for my next blog post as i walk you through the steps to install SQL Server in a cluster with Hyper-V including what the clustering validation test looks like.

     

    Bob Ward
    Microsoft

    More Posts Next page »
     
    Page view tracker