SQL Server uses one of two different strategies to parallelize a hash join. The more common strategy uses hash partitioning. In some cases, we use broadcast partitioning; this strategy is often called a “broadcast hash join.”
The more common strategy for parallelizing a hash join involves distributing the build rows (i.e., the rows from the first input) and the probe rows (i.e., the rows from the second input) among the individual hash join threads using hash partitioning. If a build and probe row share the same key value (i.e, they will join), they are guaranteed to hash to the same hash join thread. After the data has been hash partitioned among the threads, the hash join instances all run completely independently on their respective data sets. The absence of any inter-thread dependencies ensures that this strategy scales extremely well as we increase the degree of parallelism (i.e., the number of threads).
As with all of my parallelism examples, I am using a large table to induce the optimizer to choose a parallel plan. If you try these examples, it may take a few minutes to create these tables.
create table T1 (a int, b int, x char(200))
set nocount on
declare @i int
set @i = 0
while @i < 1000000
insert T1 values(@i, @i, @i)
set @i = @i + 1
select * into T2 from T1
select * into T3 from T1
select * from T1 join T2 on T1.b = T2.a
|--Parallelism(Gather Streams) |--Hash Match(Inner Join, HASH:([T1].[b])=([T2].[a]), RESIDUAL:([T2].[a]=[T1].[b])) |--Parallelism(Repartition Streams, Hash Partitioning, PARTITION COLUMNS:([T1].[b])) | |--Table Scan(OBJECT:([T1])) |--Parallelism(Repartition Streams, Hash Partitioning, PARTITION COLUMNS:([T2].[a])) |--Table Scan(OBJECT:([T2]))
Note that unlike the parallel nested loops join, we have exchanges (i.e., parallelism operator) between both table scans (both the build and the probe inputs) and the hash join. These exchanges hash partition the data among the hash join threads.
Consider what will happen if we try to parallelize a hash join using hash partitioning, but we have only a small number of rows on the build side of the hash join. If we have fewer rows than hash join threads, some threads might receive no rows at all. In this case, those threads would have no work to do during the probe phase of the join and would remain idle. Even if we have more rows than threads, due to the presence of duplicate key values and/or skew in the hash function, some threads might get many more rows than others.
To eliminate the risk of skew, when the optimizer estimates that the number of build rows is relatively small, it may choose to broadcast these rows to all of the hash join threads. Since all build rows are broadcast to all hash join threads, in a broadcast hash join, it does not matter where we send the probe rows. Each probe row can be sent to any thread and, if it can join with any build rows, it will.
Here is an example:
select * from T1 join T2 on T1.b = T2.a where T1.a = 0
|--Parallelism(Gather Streams) |--Hash Match(Inner Join, HASH:([T1].[b])=([T2].[a]), RESIDUAL:([T2].[a]=[T1].[b])) |--Parallelism(Distribute Streams, Broadcast Partitioning) | |--Table Scan(OBJECT:([T1]), WHERE:([T1].[a]=(0))) |--Table Scan(OBJECT:([T2]))
Note that the exchange above the scan of T1 is now a broadcast exchange while we have completely eliminated the exchange above the scan of T2. We do not need an exchange above T2 because the parallel scan automatically distributes the pages and rows of T2 among the hash join threads. This is similar to how the parallel scan distributed rows among nested loops join threads for the parallel nested loops join. Similar to the parallel nested loops join, if we have a serial zone on the probe input of a broadcast hash join (e.g., due to a top operator), we may need a round robin exchange to redistribute the rows.
So, if broadcast hash joins are so great (they do reduce the risk of skew problems), why don’t we use them in all cases? The answer is that broadcast hash joins use more memory than their hash partitioned counterparts. Since we send every build row to every hash join thread, if we double the number of threads, we double the amount of memory that we need. With a hash partitioned parallel hash join, we need the same amount of memory regardless of the degree of parallelism.
select * from T1 join T2 on T1.b = T2.a where T1.a < 100000
Finally, suppose we have a fairly selective filter on the build input to a hash join:
|--Parallelism(Gather Streams) |--Hash Match(Inner Join, HASH:([T1].[b])=([T2].[a]), RESIDUAL:([T2].[a]=[T1].[b])) |--Bitmap(HASH:([T1].[b]), DEFINE:([Bitmap1008])) | |--Parallelism(Repartition Streams, Hash Partitioning, PARTITION COLUMNS:([T1].[b])) | |--Table Scan(OBJECT:([T1]), WHERE:([T1].[a]<(100000))) |--Parallelism(Repartition Streams, Hash Partitioning, PARTITION COLUMNS:([T2].[a]), WHERE:(PROBE([Bitmap1008])=TRUE)) |--Table Scan(OBJECT:([T2]))
What’s the bitmap operator? The predicate “T1.a < 100000” eliminates 90% of the build rows from T1. It also indirectly eliminates 90% of the rows from T2 because they no longer join with rows from T1. The bitmap operator provides an efficient way to apply the T1 filter directly to T2 without passing the rows all the way through the exchange to the join. As its name suggests, it builds a bitmap. Just like a hash join, we hash each row of T1 on the join key T1.b and set the corresponding bit in the bitmap. Once the scan of T1 and the hash join build is complete, we transfer the bitmap to the exchange above the scan of T2 where we use it as a filter. This time we hash each row of T2 on the join key T2.a and test the corresponding bit in the bitmap. If the bit is set, the row may join and we pass it along to the hash join. If the bit is not set, the row cannot join and we discard it. For more information on bitmaps see this post from the SQL Server Query Processing Team blog.
Why does broadcast hash join use more memory? Instead of copying the build rows to each worker thread, can we just let them read data from shared memory?
Theoretically, we could build a single shared hash table for a broadcast hash join. Of course, this approach would require extra synchronization in the hash join itself. (Although we should only use a broadcast hash join if the hash table fits in memory, the required synchronization would be especially complex if the join runs out of memory and needs to spill to disk.) By broadcasting the data, we can use the same hash join implementation for both hash partitioned and broadcast hash joins. In addition, there are other potential benefits to broadcasting the data. For example, on a NUMA machine, each hash join thread can use local memory for its hash table. HTH, Craig
What are the pros and cons while forcing the Hash join?
I have two tables of 500M rows each , i inner join them, in the execution diagram it shows that both tables are inner joined by merge join. I want to optimize the query performance and want to force Hash join.
You may want to review the table in this post: blogs.msdn.com/.../702828.aspx
The primary disadvantages of hash joins are that they consume memory and block. The main advantages are better scalability and parallelism for large joins. Whether a merge or hash join is better in your case, depends on the details of your scenario. If your merge join is one to many and uses indexes for the sort order, it may well be a better choice.
Craig, is there a way to force broadcast hashing? If I have enough memory, I would not mind burning DRAM and avoid the exchange operator.
AFAIK, there is no way to force broadcast partitioning for hash join. ;(