Recently I had to troubleshoot a query that performed much slower than expected. The solution to that seemed sufficiently interesting to warrant a write-up, and the root cause of the problem reinforced a well-known best practice of avoiding scalar user-defined functions in set-based queries.
The original query looked like this:
SELECT 'Literal1', u.Col2FROM dbo.Table1 AS uINNER JOIN dbo.Table2 AS l ON u.Col2 = l.Col2 WHERE l.Col3 = 'Literal2' AND NOT EXISTS ( SELECT 1 FROM dbo.Table2 AS l2 WHERE u.Col2 = l2.Col2 AND Col3 = 'Literal1' ) AND dbo.UDF1(u.Col1) IS NOT NULL;
Table1 had about 3 million rows, and Table2 had about 7 million rows. In its original form, the query was running for more than one hour, while the requirement was that it would complete within minutes.
The scalar user-defined function (UDF) in the WHERE clause immediately stood out as a warning sign (highlighted). Using scalar UDFs in set-based queries is generally not recommended for two reasons:
The function is invoked for each row in the set, and each invocation has a fixed and not insignificant cost, which is in addition to the cost of the statements in the function body.
The optimizer cannot accurately estimate the cost of the function (which could vary greatly), therefore it may come up with less than optimal query plans.
For the query above, removing the UDF from the WHERE clause brought down the execution time to 12 seconds. Looking at the query plan, it was clear why it took so long to execute the original statement – a full clustered index scan of Table1 (all 3 million rows of it) was followed by a filter operator, where the UDF was applied to the Col1 column of Table1. In other words, rather than first reduce the set via joins with Table2 and then apply the UDF to that much smaller set, the optimizer chose to first filter millions of rows with the UDF. This is a fairly typical example of a sub-optimal plan caused by an inaccurate estimate (reason #2 above), which was in turn caused by the usage of an "opaque" scalar UDF.
The obvious workaround for this problem would be to remove the UDF-based filter from the WHERE clause, insert the resulting intermediate result set into a temporary table, and then select from that temporary table, filtering the much smaller intermediate result set with the UDF. However, I wanted to avoid a procedural approach, which would be more complex and less robust than a single set-based statement.
The first attempt to come up with a single statement that avoids the original problem looked like this:
SELECT 'Literal1', up.Col2FROM ( SELECT u.Col1, u.Col2 FROM dbo.Table1 AS u INNER JOIN dbo.Table2 AS l ON u.Col2 = l.Col2 WHERE l.Col3 = 'Literal2' AND NOT EXISTS ( SELECT 1 FROM dbo.Table2 AS l2 WHERE u.Col2 = l2.Col2 AND l2.Col3 = 'Literal1' ) ) AS upWHERE dbo.UDF1(up.Col1) IS NOT NULL;
Here we have a subquery in the FROM clause, and apply the filter with the UDF "on top" of the subquery, in the WHERE clause of the outer query. The intent is to evaluate the subquery first, before applying the filter. Unfortunately, this does not work – this query is logically equivalent to the original query, and SQL Server is at liberty to evaluate the predicates in the order that it deems optimal, which in this case still means filtering millions of rows in Table1 with the UDF-based filter, before joining with Table2.
Here is the query that did work (changes to previous attempt are highlighted):
SELECT 'Literal1', up.Col2FROM ( SELECT u.Col1, u.Col2 FROM dbo.Table1 AS u INNER JOIN dbo.Table2 AS l ON u.Col2 = l.Col2 WHERE l.Col3 = 'Literal2' AND NOT EXISTS ( SELECT 1 FROM dbo.Table2 AS l2 WHERE u.Col2 = l2.Col2 AND l2.Col3 = 'Literal1' ) ) AS upCROSS JOIN ( SELECT '' AS EmptyString ) AS eWHERE dbo.UDF1(up.Col1 + e.EmptyString) IS NOT NULL;
The reason this succeeds in forcing the optimizer to evaluate the UDF filter after reducing the row set via joins is because we pass an expression, rather than a base column, to the function. The passed expression in this case is logically equivalent to the base column, but because it contains a reference to a column from a seemingly pointless outer query, the optimizer cannot push down the UDF-based filter to the inner subquery. In the plan generated for this query, the UDF executes only against a few hundred rows produced by the inner subquery, and the statement completes in 30 seconds – well within required time. Note that it wouldn’t be sufficient to use a literal empty string as the second part of the expression – to avoid predicate pushdown, it has to be a column from an outer query.
As a conclusion, this could be a useful query tuning technique for the specific case when the optimizer pushes down a filtering predicate (not necessarily UDF-based), and the resulting plan is less optimal than the one where the filter is evaluated later in the plan.
I would be curious how long the query below would take. This way UDF1 should happen the last.
And I am curious about how UDF1 looks - after all, I am sure it is possible to get rid of it without writing any procedural code (but of course we pay the price in the complexity of WHERE clause...
SELECT 'Literal1', u.Col2
FROM dbo.Table1 AS u
INNER JOIN dbo.Table2 AS l
ON u.Col2 = l.Col2 and l.col3='Literal2'
LEFT OUTER JOIN dbo.Table2 AS l2
ON u.Col2 = l2.Col2 AND Col3='Literal1'
WHERE l2.Col2 IS NULL AND
dbo.UDF1(u.Col1) IS NOT NULL
I compared the plan of the rewrite you suggested with the plan of the original query. While they are slightly different as the result of replacing the EXISTS clause with the LEFT JOIN, the filter operator calling the UDF is still executed immediately after the clustered index scan of Table1, so this will result in similar performance.
The UDF in this case performs character-by-character processing of the input string in a loop. I do not see any straightforward way to rewrite that in a non-procedural way (perhaps something could be done using a table of numbers). On SQL Server 2005 and later, it should be possible to create a persisted column in Table1 based on the function, but the original problem was on a SQL Server 2000 instance.
Very clever - I like it. Written procedurally, if you were, say, checking for the existence of ASCII value-ranges, you might find a solution but probably would be the equivalent (or worse) than what you've done here, query plan wise.
I like you solution but I don't see how this is really much different than creating a temp table. If you used a temp table to get the first result set, then you could do a distinct select on that for the column that the function uses and run you function against a smaller set. Once you have all the results from the function, join that table back into the first temp table. This would only really make sense if there were a lot duplicate data in that column. I like you solution but I think this could give better performance depending on the data.
Yes, using a temp table for the intermediate result set would work, and could provide a performance increase if there were many duplicates in the column passed to the function (it was actually unique in this case).
That said, I think that the simplicity and robustness of a single-statement solution are generally preferable, even if it results in a slight performance hit.
"The intent is to evaluate the subquery first, before applying the filter." It looks to me like using group by will force stream aggregate to come before filter and accomplish what you were intending.
SELECT max(u.Col1) Col1,
FROM dbo.Table1 AS u
INNER JOIN dbo.Table2 AS l
ON u.Col2 = l.Col2
WHERE l.Col3 = 'Literal2'
NOT EXISTS (
FROM dbo.Table2 AS l2
WHERE u.Col2 = l2.Col2
l2.Col3 = 'Literal1'
GROUP BY u.Col1, u.Col2
) AS up
WHERE dbo.UDF1(up.Col1) IS NOT NULL;
Yes, this does move the filter to the top of the plan, so it is a valid approach when the aggregation in the subquery does not change the result set (as in this case where Col1 is unique). Thanks for mentioning an alternative solution!
I'm with Eric. A temporary table is the better solution as it retains clarity of expression without polluting the query with bizzare CROSS JOINS, aggregates etc. These might be clever, but the end result is less readable code.
The problem with the GROUP BY solution is that it forces a tempdb hit and sort, right?
Seems like a good use of a CTE here, with the outer CTE being the join between table1 and table 2 and the SELECT of the CTE using the UDF. Did you try that?
@Dave: I very much agree with you "clever tricks" should not be used when they hurt clarity of expression. That said, the clarity and readability of a particular solution is a subjective measure. To me, a step-by-step procedural solution is often harder to read than one using a single statement, even if that statement seems somewhat "bizarre". That said, if someone would prefer a procedural solution in this case because it is more readable to them, I'd have no objections whatsoever. My main goal in writing this post was to demonstrate a particular technique to avoid predicate pushdown in cases when it hurts performance.
@TheSQLGuru: In this case, using GROUP BY does not case a tempdb hit, aggregation is done using two Stream Aggregate operators.
@Jonathan: The original problem was on a SQL 2000 instance, so I couldn't use a CTE. Even on a later version, this is not guaranteed to work, as the optimizer is at liberty to expand the CTE as if it were a subquery in the FROM clause.