CSS SQL Server Engineers

This is the official team Web Log for Microsoft Customer Service and Support (CSS) SQL Support. Posts are provided by the CSS SQL Escalation Services

SQL Server 2008/2008 R2 on Newer Machines with More Than 8 CPUs Presented per NUMA Node May Need Trace Flag 8048

SQL Server 2008/2008 R2 on Newer Machines with More Than 8 CPUs Presented per NUMA Node May Need Trace Flag 8048

Rate This
  • Comments 9

Applies To:  SQL 2008, 2008 R2 and Denail builds

The SQL Server developer can elect to partition memory allocations at different levels based on the what the memory is used for.   The developer may choose a global, CPU, Node, or even worker partitioning scheme.   Several of the allocation activities within SQL Server use the CMemPartitioned allocator.  This partitions the memory by CPU or NUMA node to increase concurrency and performance.  

You can picture CMemPartitioned like a standard heap (it is not a HeapCreate) but this concept is the same.  When you create a heap you can specify if you want synchronized assess, default size and other attributes.   When the SQL Server developer creates a memory object they indicate that they want things like thread safe access, the partitioning scheme and other options.

The developer creates the object so when a new allocation occurs the behavior is upheld.  On the left is a request from a worker against a NODE based memory object.  This will use a synchronization object (usually CMEMTHREAD or SOS_SUSPEND_QUEUE type) at the NODE level to allocate memory local to the workers assigned NUMA NODE.   On the right is an allocation against a CPU based memory object.  This will use a synchronization object at the CPU level to allocate memory local to the workers CPU.

In most cases the CPU based design reduces synchronization collisions the most because of the way SQL OS handles logical scheduling.  Preemptive and background tasks make collisions possible but CPU level reduces the frequency greatly.  However, going to CPU based partitioning means more overhead to maintain individual CPU access paths and associated memory lists.  

The NODE based scheme reduces the overhead to the # of nodes but can slightly increase the collision possibilities and may impact ultimate, performance results for very specific scenarios.  I want to caution you the scenarios encountered by Microsoft CSS have been limited to very specific scopes and query patterns.

image

 

Newer hardware with multi-core CPUs can present more than 8 CPUs within a single NUMA node.  Microsoft has observed that when you approach and exceed 8 CPUs per node the NODE based partitioning may not scale as well for specific query patterns.   However, using trace flag 8048 (startup parameter only requiring restart of the SQL Server process) all NODE based partitioning is upgraded to CPU based partitioning.   Remember this requires more memory overhead but can provide performance increases on these systems.

HOW DO I KNOW IF I NEED THE TRACE FLAG?

The issue is commonly identified by looking as the DMVs dm_os_wait_stats and dm_os_spin_stats for types (CMEMTHREAD and SOS_SUSPEND_QUEUE).   Microsoft CSS usually sees the spins jump into the trillions and the waits become a hot spot.   

Bob Dorr - Principal SQL Server Escalation Engineer

Leave a Comment
  • Please add 3 and 7 and type the answer here:
  • Post
  • Do you mean exactly the number of CPU Sockets within a single NUMA node, not the number of Cores, Don't you?

    Thank you for greate information. :)

  • I believe the article meant to reference sys.dm_os_spinlock_stats instead of dm_os_spin_stats

  • Answering the question(s).

    I meant the number of CPUs presented for a given node.   This includes cores and HT CPUs represened in the same NUMA node.

    Yes dm_os_spinlock_stats - good catch.

  • at what rate of spins\hour or day would this be considered a bottleneck. wait stats can be skewed by other wait types far exceeding the ones mentioned in this article.  

  • For these issues we would see a big jump (trillions of spins) in a small windows (5 / 10 minutes) that was really unusual for the normal running of the system.

  • "Remember this requires more memory overhead"...

    What can be captured before and after implementing trace flag 8048 to indicate the overhead on a given system?  I've implemented trace flag 8048 on a number of DSS systems, and its been beneficial on most of them.  But I haven't figured out where the overhead of per-core memory allocation vs per NUMA node memory allocation would show up.  Somewhere in memory clerk stats, or purely CPU consumption?  Cross-posting the question to Rohit Nayak's recent post on Diagnosing CMEMTHREAD Waits...

    blogs.msdn.com/.../how-it-works-cmemthread-and-debugging-them.aspx

  • " I want to caution you the scenarios encountered by Microsoft CSS have been limited to very specific scopes and query patterns."

    In my experience the pattern I've seen this with is a batch reporting model, where SAP or SSRS sends hundreds (or thousands) of queries to SQL Server in a very short period of time.

  • 1. In SQL Server 2008 R2 and SQL Server 2012, trace flag 8048 no longer seems to be a startup only trace flag. I'm not sure it's safe to change memory serialization strategy while a workload is in place, but I've seen it take place in a spinlock convoy situation and it had the desired performance impact.

    2. Memory serialization spinlock contention at the NUMA node or 'scheduler group' level can occur with less than eight cores per socket, and can occur even in a six vcpu vm.  The necessary ingredients are enough queries/threads competing for the spinlock resource while stealing memory against query memory grant.  Each allocation against the grant is a potential for spinlock contention.

    3. VMs consolidated on a physical server can be even more susceptible than physical servers to this spinlock, if competing for physical CPU time with other guests.  Time waiting for physical CPU can exaggerate the hold time of spinlock resources, increasing the time window for contention in each possession.

  • We have a server with 1500 users databases and an awful adhoc workload. We are on 2008 and have 10 logical cores per socket/numa node. We see a large percentage of CMEMTHREAD waits (75%). BUT as far as spinlock stats  - seeing SOS_TLIST as the largest spinner. This is a delta on a 10 minute sample:

    Spinlock Collisions Spins SpinsPerCollision SleepTime Backoffs

    SOS_TLIST 4673403 2274201787 486 2877 36028

    LOCK_HASH 7391 6823378 923 0 893

    BUF_FREE_LIST 5409 2324733 429 7 154

    SOS_OBJECT_STORE 3036 66379 21 0 3

    SOS_SUSPEND_QUEUE 1801 588115 326 1 42

    How do I determine if this trace flag is appropriate?

Page 1 of 1 (9 items)