Author: Kun Cheng
Reviewers: Shaun Tinline-Jones, Silvano Coriani, Steve Howard, Thomas Kejser, Sanjay Mishra
A computed column is computed from an expression that can use other columns in the same table. The expression can be a noncomputed column name, constant, function, and any combination of these connected by one or more operators, but the expression cannot be a subquery. A simple example of a computed column is:
A benefit of using computed columns is that they can save developers from having to write calculation logic at the application layer; at the table level, the application can simply reference the computed column. In addition, performance can be improved by persisting a computed column and building an index on it (see “Creating Indexes on Computed Columns” at http://msdn.microsoft.com/en-us/library/ms189292.aspx). In real-world scenarios (unlike the simple example above), a common way to define a computed column is by using a user-defined function (UDF) to encapsulate the calculation logic. For example, the script below uses an XQuery expression to extract a summary description as a computed column out of the XML data, which is stored in another column of the same table. Once the computed column is persisted to the physical storage of the table, we can build an index on top of it to speed up queries that access the computed column.
CREATE FUNCTION udf_compcol(@contents XML) RETURNS nvarchar(255) WITH SCHEMABINDING BEGIN RETURN @contents.value('(/item/description/text())', 'nvarchar(255)') END GoALTER TABLE Books ADD summary_compcol AS udf_compcol(Contents) PERSISTEDGoCREATE INDEX ix_Books_summary ON Books(summary_compcol)Go
In this example, we are promoting the “description” element out of the XML data to a relational column for performance gain, because querying against XML semi-structured data is slower than querying against structured data. However, the computed column might behave in an unexpected way that can impact query performance. One example of this unexpected behavior is discussed in the blog post “Query Plan for Persisted Computed Column Query” at http://sqlcat.com/sqlcat/b/msdnmirror/archive/2011/09/01/quoted-identifier-causes-unexpected-query-plan-for-persisted-computed-column-query.aspx.
In this blog post, I will describe another unexpected behavior of computed column that occurs in a different scenario. On a multi-core server, which is typical for a SQL Server environment, the following XQuery generates a parallel plan as it scans the XML data.
SELECT title FROM Books WHERE Books.Contents.exist(N'/item/isbn[.=7646309]') = 1
With the addition of the computed column defined by the UDF udf_compcol, the same query generates a sequential plan, which is identical to the original plan except it is not parallel.
The reason that the query generates a sequential plan in this case is that when SQL Server compiles a query plan, it loads all referenced column expressions and determines if there is a UDF associated with any computed column(s). If there is, SQL Server disables the parallel plan and generates only a sequential plan. This is what happened with our query, which referenced a table containing a computed column defined with a UDF. Note the behavior is the same regardless of whether computed column is persisted or not.
In this case, we were able to tune the query by promoting the “/item/isbn” element out of the XML data to another relational column so the query didn’t need to scan the table in the first place.
Here are some general recommendations:
Short summary: don't use user-defined scalar functions in queries that involve tables, they're pure unadulterated evil unless performance is of no concern.
I'm exaggerating for effect, but only slightly. Even if you have a function that returns a scalar value, rewriting it as an inline table-valued function will yield better performance because such queries *are* parallelizable.
This is a shame because obviously scalar-valued functions are a really obvious tool if you're coming from a programming environment (modular approach and that). The engine team ought to consider the importance of letting people fall into the "pit of success" by making trivial UDFs (and they often are trivial) have equivalent performance to TVFs.
I've seen this behavior also when using an udf in a select statement. Will that mean that the usage of udf disable parallel query plans?
I know that when using udf with cross apply (or outer apply) you don't loose the ability to act in parallel, guess that's yet another reason to consider to use cross apply when working with udf.
In a case where a column (computed in a specific way) is expected to be used across views, then would'nt be wiser to use them as functions for ease of maintenance?
The computed column that references a T-SQL scalar function does not even need to be referenced to prevent a parallel plan being selected:
CREATE FUNCTION dbo.F (@i integer)
CREATE TABLE T1 (c1 integer NULL, c2 AS c1);
UPDATE STATISTICS T1 WITH ROWCOUNT = 1000000, PAGECOUNT = 100000;
-- Parallel plan
SELECT COUNT_BIG(c1) FROM T1;
-- Add a computed column that references the T-SQL scalar function
ALTER TABLE T1 ADD c3 AS dbo.F(c1);
-- Serial plan, even though computed column not referenced in the query
Thank you for this to both the CAT team for this article and to Paul for his comment... this is a problem I couldn't understand recently. I was attempting to tune a query and I know that a scalar function would likely remove a parallel plan and therefore I had removed that column from my query to see the impact it was having, but was still getting a sequential plan and couldn't see why. I tried every trick in the book... but now I'll be going back and trying again with a copy of the computed column table... minus the computed column... in order to get my benchmarks.