SQL Server 2005 has been designed to take advantage of NUMA features exposed by both hardware and the Windows operating system. There are several caveats that you need to be aware when attempting to run SQL Server on NUMA platforms.
In this post I would like to go over both Windows's and SQL Server's NUMA support, describe their possible configuration, and give some advises on troubleshooting some of the issues.
The latest SQL Server's CTP release has most of the NUMA support already built in so that you can play with it and see for yourself actual NUMA in action. If you are more interested in troubleshooting without understanding the actual issues you could jump to troubleshooting section right a way.
Background: I expect you understand classic cc-NUMA configuration so that I don't have to go into details of explaining them.
Windows NUMA Support
NUMA hardware platforms could be configured differently. The two major configurations are pure NUMA or Interleaved NUMA. Pure NUMA will appear to operating system as a set of CPU nodes, sometimes called pods, with local memory. Depending on a hardware manufacture there will be a different penalty to access memory residing on other nodes - remote memory. In this configuration for an application to perform well it needs to minimize a number of remote accesses.
When system is configured to use interleaved memory, to operating system it appears as a large SMP box where memory for cache lines is interleaved from different NUMA's nodes. This configuration is suitable for applications that don't have NUMA optimizations and might behave badly in pure NUMA configuration. For example if you don't have SQL2000 post SP3 QFE that has NUMA optimizations you shouldn't be running SQL Server on the pure NUMA configuration. In general SQL2000 doesn't have many NUMA optimization so that you might want always to configure system as interleaved.
When Windows starts up on pure NUMA hardware it recognizes system multi node configuration and boots accordingly. From our experience we have noticed that during the start OS will allocate memory mostly from a single node - keep in mind that this does depend on memory availability on the nodes as well as a number of applications OS has to start. This allocation pattern during OS startup could be problematic for a NUMA aware application - since memory is not distributed evenly across the nodes and there is no way to find out memory distributions across the nodes. As a system continue running the problem becomes more severe - there is less and less free memory available on the nodes. It could be due to memory hungry applications or to System File Cache, SFC. Depending on the system configuration it is possible for System File Cache to allocate significant amount of memory. Moreover on NUMA configurations it is quite possible for SFC to unevenly use memory across nodes. If a NUMA aware application only attempts to allocate local memory it might become a victim of memory starvation on the nodes in which it collided with SFC or with other memory hungry applications.
Windows exposes a set of APIs that allows applications to take advantage of NUMA:
There are several API's "gotchas" that you have to be aware of:
The consequences of A & C is that NUMA aware applications should have more of special handling for remote memory. A & C might cause OS to return memory in random order such as remote, local, remote, and etc. Keep in mind if applications decides not to cache remote memory it might get in trouble because when memory is freed it gets put on the free list and will be given away on the next allocation request.
The consequences of B is that applications can't reliably decide whether they can or can't allocate memory from the node. Case B usually occurs after system runs for some time especially when large applications are present.
SQL Server 2005 NUMA Support
SQL Server 2005 attempts to take full advantage of NUMA architecture by leveraging SQLOS for more info see http://blogs.msdn.com/slavao/articles/441058.aspx. During startup SQL Server will configure itself base on underlying OS and hardware configuration. It will create the environment inside of itself to mimic the actual hardware - SQL Server will create a software abstraction, for purpose of our discussion we will call it a Node, around every NUMA node and its memory.
Every SQL Server's Node has a memory manager, set of schedulers each of which maps to a underlying CPU; I/O port and other components. SQL Server's Node could be considered as a separate SQL Server instance with its own I/O port that it listens on. Clients or separate applications can be configured to connect to a specific Node only. A Node can be online - accepting new work or offline, accepting neither new work or connections. State of a Node could be changed on the fly using affinity settings - see setting affinity below. A connection could be made to a given Node or set of Nodes - a network port could be associated with multiple Nodes. If no extra configuration specified a new connection is assigned to a next Node in a round robin fashion. If connection is made to a set of Nodes it will be assigned to a next Node in a round robin fashion across that set of Nodes. BOL has plenty of information on how to configure SQL Server's ports, nodes and clients to connect. Once connection is created it is bound to a Node until disconnected. If connection's Node is moved offline its work will be scheduled on other Nodes. All the memory that is required to serve the connection's needs is allocated locally from the memory that is attached to the Node. Database pages are always allocated locally from the Node's memory they accessed on unless they have been brought into the memory by connection assigned to another Node. Please keep in mind that in the latest CTP SQL Server attempts not to allocate memory from different Node. This behavior might cause premature OOM or significant query slow down.
To support different NUMA features SQL Server can be configured both on server and client site. On the server site DBAs can configure amount of memory, active Nodes, a number of threads, locked pages, network configuration - what NIC is bound to what Node. On the client side DBA can configure particular clients/applications to connect to specific Nodes.
Below are two settings that affect SQL Server behavior on NUMA
Troubleshooting problems related to SQL Server running on NUMA
A query runs sporadically slow even if plan doesn't change or clients observes other significant perf degradations - is the most common problem. The major cause of the problem is memory exhaustion on the Node query is running on. The memory exhaustion could happen due to multiple reasons:
Possible solution for this problem is to change SQL Server affinity to move Node 0 offline.
Possible solution for this problem is to grab all the memory for SQL Server before starting other applications. You can achieve it by utilizing max server memory, dynamic scheduling and clients affinity. The idea is to start one Node at a time, connect to the node, apply load to allocate memory on the Node, start next Node, apply load, force memory allocation and so on. At the end set min and max memory to be the same so that SQL Server doesn't shrink its memory usage under the load. Here are the steps you need to make to achieve it:
Your comments are welcome!