Did you know that hash join is most efficient when one of the tables is significantly different in size than another one? Want to know why? Then read on…
SQL Server uses different kinds of join operations – e.g. nested loop join, merge join, hash join… So, when you say “INNER JOIN”, SQL Server has to decide exactly on how to implement this…
In the worst cast situation, when you have no indexes on the joined columns, SQL may choose to use a hash join. Or you as a developer can use a hash join hint, e.g.
SELECT a.au_id FROM authors a
JOIN titleauthor b ON a.au_id = b.au_id OPTION (HASH JOIN)
INNER HASH JOIN titleauthor b ON a.au_id = b.au_id
(assuming there is no index on the au_id column)
A hash join, uses what’s known as a hash table (you may want to check out my previous post, SYSK 89, where I talked about what is hashing and hash tables).
The column name that joins the table is called a hash key. In the example above, it’ll be au_id. SQL Server examines the two tables being joined, chooses the smaller table (so called build input), and builds a hash table applying a hash algorithm to the values of a hash key. Each row is inserted into a hash bucket depending on the hash value computed for the hash key. If build input is done completely in-memory, the hash join is called an “in-memory hash join”. If SQL Server doesn’t have enough memory to hold the entire build input, the process will be done in chunks, and is called “grace hash join”.
The other (larger) table (referred to as probe input) is then is scanned one row a time, the hash key value is computed and compared against the build input, but only in the matching hash bucket, resulting in better performance.
As you can see, a hash join is a memory intensive operation. The optimizer will avoid generating access plans using a hash join if it detects that a low memory situation may occur during query execution.
Another interesting tit bit of information is that hash join locks rows in its inputs before they are copied to memory.