I ran into this behavior working on a 1 trillion row, spatial index build but the behavior can apply to any Gather Streams operator, retaining the sort order as rows pass though it. I was just surprised a bit by the behavior until I dug deeper to understand.
The index was taking just short of 2 hours to build on my 64 way, 128 GB RAM test system. The behavior I observed was the drop in CPU usage and parallelism. The first ~40 minutes all 64 CPUs are using 100% of the CPU but the last 1 hour and 10 minutes of the index build, only the controlling worker consumes 100% CPU on a single processor.
Digging into this I found the top of the plan looked like the following. Node 1 being the final, SORTED insert, from the gather streams activity.
During the index build the lower part of the plan was executing the parallel, nested loop behavior on all 64 CPUs and building the large sort groupings, in parallel, on each of the child workers (first 40 minutes.) Once all the rows are sorted, per worker, the gather streams activity has to merge the 64 individual sorts into the overall sort order as it performs the final inserts.
This is deemed an order preserving gather (merge) operation. The consumer pulls a row from each worker and keeps an in-memory tree. In the case of the 64 parallel workers the tree would have 64 entries. If MAX DOP is 16 the tree would contain 16 entries. The tree is maintained in sorted order so the pattern of execution will look like the following on a 4 processor system.
- Get Row From Worker/Partition #1 – Insert into tree
- Get Row From Worker/Partition #2 – Insert into tree
- Get Row From Worker/Partition #3 – Insert into tree
- Get Row From Worker/Partition #4 – Insert into tree
- While (entries in tree)
Output lowest, sorted value from tree
Get Row from Worker/Partition you just removed from tree as lowest value
This design keeps the tree pruned equal to or just above the maximum, sub-process workers. The consumer performs the merging of the individual, parallel, sort operations and inserts the results into the index as requested. The serialization of the final sort order is what I am seeing during the final phase of my index build.
Since my table and index all fit into memory on this large system the activity is taking place in memory and leveraging the CPU fully. Looking at the details in sys.dm_exec_requests and sys.dm_os_waiting_tasks I can see the gather streams activity, associated with Node 1 in the plan, is driving the last ~01:10:00 on a single CPU. In fact, setting processor affinity you will observe the controlling workers’ CPU light up for the final 01:10:00 of the index build.
In observing the behavior the sys.dm_os_wait_stats shows a sharp increase in CXPacket waits and wait times. This is expected as the final thread is going to be pulling the data from 64 workers that ran at 100% CPU already, so it is unlikely a single CPU can process the data as fast and the CXPacket exchange activity will encounter waits.
The description for the wait indicates a wait at node 1 asking for a new row from the producers.
exchangeEvent id=Pipe307f0be530 WaitType=e_waitPipeNewRow nodeId=1
There are lots of documents and posts about the best setting to optimize parallel plans. Some state MAX DOP = 8, others MAX DOP = 32 and both are correct depending on the type of plan (actual query patterns) as the performance may vary. Since this was a create index I decided to do some experiments with the DOP level.
- 64 CPUs = 01:50:00
- 32 CPUs = 02:17:00
- 16 CPUs = 03:16:00
What I observed is that for this specific create index (spatial) the lower the MAX DOP the larger the change in the original 40 minute part of the plan. This is what I expected to see. The first part of the plan is already CPU bound so adding more CPU resources lets that part of the plan execute faster.
The CXPacket waits on the final portion of the plan don’t change significantly with different DOP levels. The time remains generally steady at 01:10:00.
This was a unique pattern because the controlling worker was not showing additional waits (usually the I/O will show up but because everything fit into memory the CPU was the dominant resource.) The controlling worker only showed normal scheduler yield activities.
What I found was it takes 01:10:00 on my 1.88Ghz CPU to sort (merge) 1 trillion rows onto the index pages. If I want to reduce the final portion of the plan I would need to move to a faster CPU. SQL Server did use parallel resources as much as possible to build the index.
Bob Dorr - Principal SQL Server Escalation Engineer