by Bob Golding

The introduction of NUMA (Non-Uniform Memory Access) required changes in memory management.  Since accessing memory that is not local to the node can result in an access going over an interconnect such as fibre, the memory manager tries to allocate memory locally to avoid the performance cost of remote access.  This is how the mechanism works in 2008.

What is NUMA?

The NUMA architecture is basically a small number of processors each having its own memory and possibly its own I/O channels.  Each processor group can access another group’s memory without worrying about coherency.  Each group of CPUs is called a ‘node’.  Memory that is local to a node is called local or near memory.  Memory outside of a node is called foreign or far memory.  Local memory is on the same node as the group of processors – although we do support configurations where some memory nodes may not have any local CPUs.  Far memory is memory that is local to other nodes, however for a remote node to access the memory it may have to go over an interconnect such as fibre.  This is more expensive so the OS tracks which node the physical memory resides on and uses this information to optimally allocate accordingly.

How is memory tracked?

In versions prior to Windows XP, each memory page had its own ‘color’ or cache locality.  When a memory page was on the Free or Zeroed list, it also was on a color list.  This mechanism was enhanced so the color includes both the processor node number and each node has a number of colors.  When the system is initialized the memory manager calls a function called HalpNumaQueryPageToNode to get the node number for a physical address.

How is the memory organized?

There are two lists; one is the Zeroed Memory list and the other is the Free Memory list.  For example:

nt!MmFreePagesByColor = struct _MMCOLOR_TABLES *[2]

This location has two pointers that point to the Free and Zeroed color lists.  The first address in the array is a pointer to the Zeroed color list.  The other points to the Free color list.  Each entry looks like this:

nt!_MMCOLOR_TABLES

   +0x000 Flink            : Uint8B      <<-- Page #

   +0x008 Blink            : Ptr64 Void  <<-- PFN address

   +0x010 Count            : Uint8B

To find out how many tables there are you need to look at a number of locations.  The example below is from a system with 4 nodes:

nt!MmSecondaryColorNodeShift = 0x3

nt!MmSecondaryColorMask = 7

MmSecondaryColorNodeShift is how many times a color is shifted to get the offset to the first color for the node.  The other, MmSecondaryColorMask is the mask that is used to get the color within the node.  The mask is used to mask the page number and the result is OR’d into the shifted result to get the offset.

Can I have an example?

Ok, for example, let’s take the page 86152d which has been assigned to Node 1:

14: kd> !pfn 86152d

    PFN 0086152D at address FFFFFA801923F870

    flink       0086152C  blink / share count 0086152E  pteaddress FFFFF6FC0430A968

    reference count 0000    used entry count  0000      Cached    color 1   Priority 0

    restore pte 00861525  containing page        FFFFFFFFFFFFF  Zeroed

For the forward and backward links for the color table, the restore PTE is the forward (861525) and the containing page is the back link  (-1).

So to get the offset into the color table use ( 1 << 3 | ( 0x7 & 86152D) * 0x18 (0x18 is the size of each color table entry):

14: kd> dq fffffa80`317fffd0+(18*d)      <<-- fffffa80`317fffd0 is the start of the Zeroed color table

fffffa80`31800108  00000000`0086152d fffffa80`18bd5670

fffffa80`31800118  00000000`0000445c 

Are there any debugger extensions that will help with this?

To get a display of what memory belongs to what node use !numa_hal:

14: kd> !numa_hal

HAL NUMA Summary

----------------

    Node Count      : 4

    Processor Count : 16

    Node   ProximityId

    ------------------

    0x00   0x00000000

    0x01   0x00000001

    0x02   0x00000002

    0x03   0x00000003

    Proc   Domain       APIC Id

    ---------------------------

    0x00   0x00000000   0x00000000

    0x01   0x00000000   0x00000001

    0x02   0x00000000   0x00000002

    0x03   0x00000000   0x00000003

    0x04   0x00000001   0x00000004

    0x05   0x00000001   0x00000005

    0x06   0x00000001   0x00000006

    0x07   0x00000001   0x00000007

0x08   0x00000002   0x00000008

    0x09   0x00000002   0x00000009

    0x0A   0x00000002   0x0000000A

    0x0B   0x00000002   0x0000000B

    0x0C   0x00000003   0x0000000C

    0x0D   0x00000003   0x0000000D

    0x0E   0x00000003   0x0000000E

0x0F   0x00000003   0x0000000F

    Domain      Range

    -----------------

    0x00000000  0x0000000000000000 -> 0x0000000480000000

    0x00000001  0x0000000480000000 -> 0x0000000880000000

    0x00000002  0x0000000880000000 -> 0x0000000C80000000

    0x00000003  0x0000000C80000000 -> 0xFFFFFFFFFFFFFFFF

As you can see from above the memory is assigned linearly by node.  What kind of problem do you think MmAllocatePagesForMdlEx will cause if highest address was fffff000 and it ran on CPU 9?  What if a number of requests ran on all nodes except 0?

Epilog

The question asked above actually happened.  The answer is the machine would ‘pause’ for a period of time as it futilely searched node 2’s memory to find pages to satisfy the request (before eventually searching the other nodes).  That is the issue that we worked on that resulted in this research.  I hope this gives a better understanding of NUMA and memory management.

Bob Golding has been with Microsoft since 1997. He is a Senior Escalation Engineer on the Global Escalation Services team where he supports Microsoft's largest customers with their most critical issues. Bob can be reached at rgolding@microsoft.com.  For more information about debugging Windows, visit http://blogs.msdn.com/ntdebugging.