Conor Cunningham’s blog on SQL Server, data-driven applications, and pretty much whatever other random stuff he decides to post.
In my previous post, I reviewed the basic logic behind why the Optimizer chooses one join algorithm over another. In this post, I’ll go one step further and discuss what happens when the basic logic for joins is combined with the logic for how UDFs work.
(Apologies – I need to find the code plugin for my blog writer)
Here is the code snippet from the customer:
Drop function Func go create function Func(@id integer) returns varchar(510) as begin return cast(@id as varchar(510)) end go drop table #x1 go Create table #x1 ( Id integer ) go insert into #x1(Id) select top(67) ROW_NUMBER() over (order by (select null)) from sys.columns a cross join sys.columns b go select * from #x1 a join #x1 b on dbo.Func(a.id) = dbo.Func(b.id) option(recompile) go insert into #x1(Id) select 68 go select * from #x1 a join #x1 b on dbo.Func(a.id) = dbo.Func(b.id) option(recompile)
The question is why the first query does a loops join and the second does a hash join for these UDFs. As you might guess, the answer is actually mostly independent of the UDFs and really has more to do with the number of rows. However, there is a bit more to the story, so I will post a bit more for you about UDFs.
This particular UDF wraps a single piece of scalar logic. However, UDFs are opaque to the Optimizer in SQL Server. So, it doesn’t expand these functions and in-line them. (it could, but the code does not do this today). Generally, most general scalars are not costed in the current SQL Server code since they are relatively cheap compared to the cost of IO. Relational operations, such as a hash join, do have a cost function that has a CPU component, so the cost of processing the scalars are typically wrapped up there. There is no real scalar costing support for your own scalar functions, however. So, these are not really impacting the plan choice in this example.
Interestingly, if you were to rewrite the query without the UDFs and just inline the scalar logic manually, there does appear to be a cost to that and the plan choice tips to hash join earlier.
I generally recommend that customers NOT use UDFs in cases where plan choice is important to their system (ie big queries) and to specifically not put subqueries inside of UDFs. Using a procedural programming paradigm when writing SQL will usually lead to horrible performance because it prevents the Optimizer from making reasoned choices about the join order/algorithm (they too are not costed, even though they do queries in them).
So, I hope that gives you a bit of insight as to how the costing framework works.
Can you explain the join logic on table-valued UDFs? We utilize TVFs for large chunks of data in joins and it seems that performance is hit or miss.
In my experience (and you can find a talk I did on this at SQLBits by searching for "Designing for Simplification"), inline TVFs (ones that don't use BEGIN/END) will be treated in the same way as subqueries, and simplified out. Multi-line TVFs will be treated as procedural, without stats, much like UDFs.
But I'd love to hear Conor's thoughts too.
Single line query statements are inlined today and these are treated like subqueries in SQL Server's model.
Multi-line TVFs are spooled and this spool is not populated at the time that the containing query is compiled, generally.
I tell customers to be very careful using TVFs unless they are very sure of the plan choices or simply don't need to care (performance is not a concern). Perhaps in the future we will improve this story.