(I didn’t write this, so no Conor vs. today, folks :)
http://msdn.microsoft.com/en-us/library/dd425070.aspx
This is a good read on the various ways to load lots of data into SQL Server. Enjoy!
Conor
I received this question from an internal (as in within Microsoft but not in the SQL Server team) user of SQL Server. The basic question has to do with how you set up a data warehouse and whether one should or should not create foreign keys in a data warehouse.
OK, so what’s a data warehouse? It’s a database that has a particular schema layout/pattern and a particular query pattern. I am making up this example, but let’s say that I want to track data about every newspaper issue article ever written.
| Paper | Date | Editor | Category | Pages |
| The Daily Prophet | Jan 25 2010 | Bob Jones | News | 23 |
| Dallas Cowboys Weekly | Jan 25 2010 | Conor Cuningham | Sports | 10 |
| Bug Collectors Tribune | Jan 25 2010 | Sarah Smith | Reference | 5 |
| The Daily Prophet | Jan 26 2010 | Bob Jones | News | 21 |
(I will apologize for the names and such, but it’s Friday and my brain does what it wants sometimes :))
With 3 or 4 rows, you can store this data in a text file and stop reading now. However, you can also put it in a database and forget about it as well – as long as the data is small, you are optimizing for your coding time instead of the specific performance of your queries.
However, Let’s say that you wanted to store LOTS of rows… For example, a few billion rows. All of a sudden, you really need to think about things because the storage for these fields can really start to add up quickly. If I have a row for every issue of a newspaper, each row could take hundreds of bytes. All of a sudden I am paying a lot of money for storage space. So, a data warehouse deals with this by normalizing pretty much everything it can from this table. So, instead of storing strings for each row, you create additional tables and have ID fields to link them all together. So, you have a new table like this:
| PaperID | Paper |
| 1 | The Daily Prophet |
| 2 | Dallas Cowboys Weekly |
| 3 | Bug Collectors Tribune |
(We repeat this for all data of any size in this table)
Your original table now looks like:
| PaperID | Date | EditorID | CategoryID | Pages |
| 1 | Jan 25 2010 | 1 | 1 | 23 |
| 2 | Jan 25 2010 | 2 | 2 | 10 |
| 3 | Jan 25 2010 | 3 | 3 | 5 |
| 1 | Jan 26 2010 | 1 | 1 | 21 |
This representation has far less data per row. So, if I want to have a few billion rows, my storage costs have been reduced because I’ve avoided unnecessary data duplication. The main table is called the “fact” table (each row contains a one or more facts, such as the number of pages), and the other tables are called “dimension” tables. This is a standard star schema (there are more complex warehouse schemas called snowflake schemas).
OK, now I can answer the original question. The connection between the fact and dimension tables are classic foreign key relationships. So, you can create a foreign key and get the database system to enforce the relationship for you. (The system can also leverage that information to sometimes better optimize queries, as I have described in an earlier post on foreign key join elimination). Now, why would a customer want to get rid of the foreign keys? Well, in SQL Server they are implemented using indexes. Indexes take space. If you build your data warehouse properly, you can guarantee that the relationship between the two tables is correct yourself (especially if the data is read-only, which many data warehouses can be). So, given that each index takes space, you might want to save some money on disks by just not defining those foreign keys. The queries still work, right? So, index space savings is the primary reason.
The particular optimization I referenced in the prior post dealt with existence checks against the foreign key – ie the system does the join to make sure that there is a valid row on the other side (example: when doing an insert to the fact table). This isn’t really the primary purpose of a data warehouse once it is built, so the specific optimization may not be that important to you. SQL Server does contain logic to understand that the rows in the fact table generally reference rows in the dimension table. Additionally, your average warehouse query is an aggregate over the fact table with filters on the dimension tables:
SELECT SUM(pages) FROM Fact INNER JOIN PaperDimension ON (Fact.PaperID = PaperDimension.PaperID) WHERE Paper = ‘The Daily Prophet’;
This kind of pattern can’t eliminate the join.
The FK is still very useful to the system when optimizing, but I hope this gives you insight into why a warehouse might choose to do without its benefits :)
Happy Querying!
Conor Cunningham
I apologize for those of you who have outstanding questions in my inbox. I have been quite busy at work these days working on what I think are awesome features for a future release of SQL Server. I resolve to be better in answering your mails on the blog :). I did finally get through my 700-message inbox and now am working at a more manageable level.
I will answer some of them now, but please don’t get alarmed if I haven’t gotten to yours yet.
I got this question/claim again this week, so I’ll write a blog post on it while I wait for my build to complete:
It usually goes something like “When I rewrite my query, SQL Server gives me a different plan. Therefore, I should rewrite all my queries until I find the best plan for each of them”. Essentially, the question is whether writing the query differently should impact the plan choice.
I’ll give two answers – a short one and a longer one containing some of the reasons.
Short answer:
No, it should not matter for SQL Server (any version since 7.0). We do a lot of work so that you don’t have to care about this. Sometimes you might get a different plan, but that may or may not be related to the fact that you wrote the query in two equivalent ways.
Longer answer:
Some other database engines used to include rule-based optimizers. So, these would run a set of heuristic rules over the internal query tree and then generate a plan. I believe there even were (are? I haven’t checked recently) 3rd party products you could buy which would try lots of rewrites to see which one performed the best. Additionally, procedural programmers who use SQL are accustomed to finding tweaks in procedural code to make it run faster, and this is a natural behavior of this species of SQL user.
Most modern optimizers are cost-based. Now, this doesn’t mean that the input tree does not matter at all. Philosophically, however, the goal of these optimizers is to find a good plan for the user instead of allow you treat SQL as a procedural language. So, SQL Server (I can’t speak for the other vendors) performs normalization steps for query trees that should usually map equivalent trees into a single starting point before the optimization process really begins. So, A JOIN B and B JOIN A would both be mapped to the same starting point (let’s say we did A JOIN B as the normalized form). This normalization step usually works well and gives good results in terms of plan quality and the customer experience we desire – we want you to have time to think of all of the other parts of the application instead of worrying about this.
So, I can hear you now – the next statement I usually hear is “But this one time, I rewrote this query and I got a different plan and it was faster!” I’m not doubting you, actually. It can happen. However, you are mistaking correlation for causality. It does not follow that rewriting the plan was the specific reason for this different plan choice, nor does it imply that it is a stable plan choice for the lifetime of your application as your data changes, nor does it imply that a future build of SQL Server will do things in such a way so that continues to hold true when you install the next service pack of the product.
I call this condition “luck”. You got lucky... for now. Don’t assume that it means you will be lucky tomorrow.
So, let’s go through the reasons why this is luck and not design:
1. Normalization – As mentioned, we have a series of early-phase rewrites in the Optimizer that will tend to collapse equivalent forms of a query into a single starting form for optimization to use. This logic can have some limitations – SQL Server does not perform every possible rewrite, and it also doesn’t perform every possible normalization step either – you would not like the compilation time. We do the ones most likely to be hit by our customers. So, we will take a series of inner joins and normalize them. We also will try to make it so that the order of predicates in a WHERE clause is handled automatically by the system. We have logic that can decorrelate APPLY (rewrite an APPLY to be a JOIN), combine UNIONs, and otherwise fix all sorts of funny SQL usages that we see in customer- or computer-generated queries. We don’t do every pattern or combination of these in any order, and there are other patterns that we don’t rewrite where perhaps we could. So, in these cases, the starting point of the optimization process may differ from your other written form. The problem with all of this is that you don’t really know – we don’t publish the internal trees. Most of the normal cases DO NOT matter as a result of this.
2. Optimization goals. One interesting part of optimizers is that they trade space and time for optimality on a query plan selection. So, SQL Server’s QP is actually not set up to give you the best plan. Instead, it is set up to give you a good enough plan quickly (where good enough is very close to “best”). In practical terms, if I can spend 1 second optimizing a query that takes 2 minutes to execute, that’s a good deal. If I spend 4 hours optimizing that same query to find a plan that executes in 1 minute 50 seconds, our average customer would be very angry. Now, for most OLTP queries the optimal plan choice is known, obvious, and we generate it all of the time. This characteristic tends to impact things more as you get into queries with longer runtimes.
To complete this thought, even if SQL Server started with two different (but equivalent) starting point trees, the searching process may actually investigate the same set of plans if it thinks it is worthwhile to do so. So, if you get two plans out of your experimentation that the optimizer considers to be roughly the same in cost, it doesn’t matter to the optimizer that they are different. Furthermore, if the tables get bigger, the optimizer may decide to search longer and eventually find the same plan.
The search process is dynamic and pruned based on estimated cost improvement.
3. Non-equivalent rewrites – I know this can’t be you, since all of you are above average. Let’s just say that some people occasionally rewrite their queries and they are actually changing the semantics of their query in doing so. That new query may perform better than the old one, but sometimes that is because the new one is asking a far easier question of the system. Even I screw this up occasionally and then have to go work backwards to figure out what I did wrong. Not all customers have the expertise to recite the nuances of the ANSI SQL specification or the details of SQL Server’s tree semantics. I can only tell you that this is harder than it looks.
4. Optimizer “missing feature”. There are some cases where I see that the Optimizer could have performed a rewrite but didn’t, causing two "semantically equivalent” trees to generate different plans that perform differently. I saw one last week where we didn’t convert WHERE 2009 = datepart(year, datecol) into an index seek, while writing the query as datecol BETWEEN ‘2009-01-01 00:00:00’ and ‘2009-12-31 11:59:59.997’. These are cases where a rewrite could potentially yield a better plan. In this case, the two forms are conceptually equivalent (at least for the datetime type). However, not all rewrites are provably equivalent in the optimizer. So, when you start doing rewrites that involve changing the question you ask to the optimizer, you have to know that what you are doing is actually valid (or else you get into case 3 – asking a different question of the data).
I usually see the most advanced SQL Server users (SQL Server MVPs, etc) able to perform this kind of rewrite. In order to do stage 4 reliably, you have to have a good understanding of what the set of supported transformations is so that you can understand why a query plan is generated (or not generated).
I hope this gives you a bit of insight into how SQL Server views equivalent SQL statements.
Happy Querying!
Conor Cunningham
I received a question from a friend in Brazil related to foreign keys and join elimination in SQL Server. Yes, SQL Server can detect that some joins are not needed to return results to the user and skip them. What is this evil magic, you ask? Well, let’s walk through it before we get out our torches and pitchforks…
What’s a Foreign Key? To most users, it is something that throws errors and blocks you from inserting data where you want it. However, to SQL Server, it is a glorious thing! It lets the Query Optimizer know that the data in one table absolutely correlates to the data in another table. (So, if I have a Customers table and an Orders table, a Foreign Key from Orders to Customers tells me that every Order has a Customer). Now, when one queries the join of Orders and Customers, it might be that you only really need columns from the Orders table. Now we’re in business – the Optimizer can detect this and see that the result of the query is the same if we just skip all of the work of doing random IO lookups into the Primary Key index on Customers to make sure that they are there.
So, you should think of Foreign Keys as a great way to give the Optimizer hints to speed up your query, in most cases. The only cases where I recommend that customers NOT define Foreign Keys is in very large data warehouses (which I will save for another blog post).
So, the question from my Brazillian friend is why we DO the elimination for single-column Foreign Keys and not for multi-column cases.
(I will post the mail I got from him here – so, his example + comments)
1:
2: IF OBJECT_ID('Tab2') IS NOT NULL
3: BEGIN
4: DROP TABLE Tab2
5: END
6: IF OBJECT_ID('Tab1') IS NOT NULL
7: BEGIN
8: DROP TABLE Tab1
9: END
10: GO
11: CREATE TABLE Tab1 (Tab1_Col1 Integer NOT NULL PRIMARY KEY, Tab1_Col2 Char(200))
12: CREATE TABLE Tab2 (Tab2_Col1 Integer NOT NULL PRIMARY KEY, Tab1_Col1 Integer NOT NULL, Tab2_Col2 Char(200))
13: ALTER TABLE Tab2 ADD CONSTRAINT fk FOREIGN KEY (Tab1_Col1) REFERENCES Tab1(Tab1_Col1)
14: GO
15:
16: -- Fine, the execution plan not use Tab1
17: SELECT Tab2.*
18: FROM Tab2
19: INNER JOIN Tab1
20: ON Tab1.Tab1_Col1 = Tab2.Tab1_Col1
21: --|--Clustered Index Scan(OBJECT:([dbo].[Tab2].[PK__Tab2__993AF6027FB886E3]))
22:
23: GO
24:
25: IF OBJECT_ID('Tab2') IS NOT NULL
26: BEGIN
27: DROP TABLE Tab2
28: END
29: IF OBJECT_ID('Tab1') IS NOT NULL
30: BEGIN
31: DROP TABLE Tab1
32: END
33: GO
34: CREATE TABLE Tab1 (Tab1_Col1 Integer NOT NULL, Tab1_Col2 Integer NOT NULL, Tab1_Col3 Char(200), PRIMARY KEY(Tab1_Col1, Tab1_Col2))
35: CREATE TABLE Tab2 (Tab2_Col1 Integer NOT NULL PRIMARY KEY, Tab1_Col1 Integer NOT NULL, Tab1_Col2 Integer NOT NULL, Tab2_Col2 Char(200))
36: ALTER TABLE Tab2 ADD CONSTRAINT fk FOREIGN KEY (Tab1_Col1, Tab1_Col2) REFERENCES Tab1(Tab1_Col1, Tab1_Col2)
37: CREATE INDEX ix ON Tab2(Tab1_Col1, Tab1_Col2)
38: GO
39:
40: -- Why not use the same behavior above ? Just because the multi column foreign key?
41: SELECT Tab2.*
42: FROM Tab2
43: INNER JOIN Tab1
44: ON Tab1.Tab1_Col1 = Tab2.Tab1_Col1
45: AND Tab1.Tab1_Col2 = Tab2.Tab1_Col2
46: --|--Nested Loops(Inner Join, OUTER REFERENCES:([dbo].[Tab2].[Tab1_Col1], [dbo].[Tab2].[Tab1_Col2]))
47: -- |--Clustered Index Scan(OBJECT:([dbo].[Tab2].[PK__Tab2__993AF602084DCCE4]))
48: -- |--Clustered Index Seek(OBJECT:([dbo].[Tab1].[PK__Tab1__6D1AC6A2047D3C00]), SEEK:([dbo].[Tab1].[Tab1_Col1]=[dbo].[Tab2].[Tab1_Col1] AND [dbo].[Tab1].[Tab1_Col2]=[dbo].[Tab2].[Tab1_Col2]) ORDERED FORWARD)
49:
50: -- Why not apply the predicate at Tab1.Tab1.Col2 too? We have the trusted foreing key to be able to do that.
51: -- If we do it, in that case we could use the PK of Tab1 and the Index ix of Tab2 just filtering by Tab1_Col1(JOIN) and Tab1_Col2(WHERE)
52: SELECT Tab2.*
53: FROM Tab2
54: INNER JOIN Tab1
55: ON Tab1.Tab1_Col1 = Tab2.Tab1_Col1
56: WHERE Tab2.Tab1_Col2 = 10
57: --|--Nested Loops(Inner Join, OUTER REFERENCES:([dbo].[Tab2].[Tab1_Col1]))
58: -- |--Clustered Index Scan(OBJECT:([dbo].[Tab2].[PK__Tab2__993AF602084DCCE4]), WHERE:([dbo].[Tab2].[Tab1_Col2]=(10)))
59: -- |--Clustered Index Seek(OBJECT:([dbo].[Tab1].[PK__Tab1__6D1AC6A2047D3C00]), SEEK:([dbo].[Tab1].[Tab1_Col1]=[dbo].[Tab2].[Tab1_Col1]) ORDERED FORWARD)
Before I answer, let me just say that I love getting questions like this – it shows that people really do like to deeply understand the engine and use what we put in it to its fullest.
So, the answer is that we did the single-column case many years ago since most FKs are single-column. We haven’t seen enough cases where the multi-column FK is used to make it more important than all of the other things that we’ve been requested to add to the product. I wish the answer had more intrigue, but it’s just a case where we have lots and lots of things to code and we haven’t had time to do this one yet. The technical side of me obviously wishes we had all of these things coded into the product. The other way to look at it is that people are upset if query compilation takes too long, so we have to find a balance between what is mathematically possible and what is useful to the broadest set of customers.
If we do find that this kind of construct becomes more common, we would likely go and add this support for a future release of the product.
For the last example, predicate duplication across multi-column joins is another one of these cases where it is possible to do but does not show up as being the dominant factor in queries often enough to enable by default.
One last note – we also do not do FK join elimination on tempdb, even if you have a FK defined.
Happy Querying!
Conor Cunningham
(Not a SQL Server post)
I have been meaning to install Windows 7, and I got around to doing it last night. I usually run Windows 2003/8 Server for work, so my interest in the consumer side is perhaps not as high as some others. Nevertheless, I like Windows and I wanted to try it out.
I did not try to upgrade from Vista x64. I just bought a new hard drive and figured it was time to do a clean install – funnily, a SQL Server setup issue has prevented me from installing SQL Server on my personal machine, so I wanted that fixed. An upgrade would copy the broken registry files. (I’ve knocked on a few doors about this while up in Redmond – this is an area where we need to do better in SQL).
Impressions:
Setup: I had one hiccup last night during installation. It stopped about 9-10% through Expanding files and returned some error about not being able to read something or other and was aborting the install. So, I tried again and it seemed to make it further the second time. Within about 30 minutes or so and after a few reboots, I had a Windows 7 logon screen.
Aero Shake: Very cool. I was opening up windows left and right and just love this. It is neat.
Aero Peek: This one is interesting, but I am not sure yet whether I like it. If you hover your mouse in the lower right part of the screen (to the right of the clock in the taskbar), it will make all of your windows translucent and you can see your background. While nice for looking at the background, it doesn’t accomplish as much organizationally as I want. There are a few very cool tools that help you find the right window (Switcher), and I can see myself using something like that more often than Aero Peek.
I guess these two features replace the “Show Desktop” taskbar button from Vista, which minimized all of the active windows and gave you a clean screen. I think that the Shake is a nice addition (close all but this one) but I often want to minimize everything and then go to the start menu to start a new program. This use case isn’t handled, so I will be looking for a Windows 7 Show Desktop button for my workflow.
Start Menu: This is a sore topic for me. I am pretty old school in my use of the start menu – I keep the “Classic” start menu with the smallest text/icons possible. I really like it to be small, fast, and efficient. The newer start menu leaves me staring at the thing trying to figure out where everything is and what designers were thinking by putting this thing in this place. Unfortunately, the Start Menu in Windows 7 doesn’t let me configure things like my Windows 2008 machines, and I don’t like having to switch back and forth. (Note to Windows designers: That does NOT mean I want Win2k8 to look like Win7). Now, if I accept that the “new” start menu is good and give it a try, there are a few things that are reasonable about it:
1. open it up and type “cmd” – immediately opens a command window. good. It is the most important thing I have to do, and it seems that each release people try to move it further and further down the tree of commands
2. Find a program to open. I actually had to sit there for a minute or two looking. The right side has no All Programs. I eventually found it on the left side. So, I have to click yet another thing to get to my programs. Conor not happy.
Taskbar: This is interesting. I think that the prior model of “everyone needs a little flag down near the clock” was getting to be pretty horrible. I actually never want to see most of them. This new model seems to force everything into the taskbar and makes the taskbar more powerful with icons and stacking and such. I think that this may be a big long-term win. When combined with the window thumbnail support (an overlooked gem in Vista, IMO), I think that this gets me a mechanism to navigate through my open windows fairly effectively. I am going to have to play with this more to see if I can master it.
I got as far as installing Office, Windows Live, and Visual Studio 2008 last night. They all installed just fine, fwiw.
I am enjoying watching the Apple commercials get more and more desperate each round.
I’m a PC!
Conor
I gave a talk to the Austin-area PASS group last night on B-Tree indexing in SQL Server, and I received a question about whether to make clustered indexes UNIQUE or not (assuming the data is unique on the key columns). The expectation was that this should perform better when the index was UNIQUE.
Let’s back up a minute and see things from the perspective of the Query Processor. Someone comes in and creates an index on col1 of table1. They don’t make it UNIQUE but they put data in the column that is unique. Then you compile the query “SELECT DISTINCT col1 from Table1”. The QP generates a plan that remains valid even if other data is inserted while that plan is in the procedure cache. So, it can’t assume anything about the uniqueness of that column because you could come in and insert a value that is non-unique. If, on the other hand, you created a UNIQUE constraint/index, then things are very different! Since the generated plan is valid as long as the schema definition for the table remains the same, the QP knows that the definition of the table precludes any duplicate values from being inserted. So, it can generate a plan that doesn’t bother to do any grouping (DISTINCT is implemented using the a group by operator in SQL). Go ahead and try it – you’ll see the plan difference.
In my view, this is really more of a question about your data model – if you know something is distinct, you can get the system to enforce this rule for you and make sure that your application behaves correctly. So, when defining your tables, you should have a holistic view of how the tables relate to each other so that you can identify where you should or should not have constraints (or indexes – however you want to define them).
One case where I see customers sometimes explicitly avoiding the definition of constraints is in data warehouses. If your data is read-only and you already “know” that your data is unique, you may want to skip the space overhead or time overhead of having indexes to enforce these constraints. Now, you still have the issue with the Optimizer not being able to do fancy query rewrites based on uniqueness, but your average data warehouse query is often expensive and rarely do these kinds of rewrites apply in any meaningful way that would impact query performance (other factors dominate).
Happy Querying!
Conor
As most of you have figured out by now, I work mostly on queries :). One of the things that I often need to do is to get a statement out of a batch so that I can go look more closely at a query plan (and, in my case, the code that generates it).
I use the a query like this to read through a few DMVs and emit the text for the plan I want:
select statement_start_offset, statement_end_offset, statement_text from (
SELECT
SUBSTRING(st.text, (qs.statement_start_offset/2)+1,
((CASE qs.statement_end_offset
WHEN -1 THEN DATALENGTH(st.text)
ELSE qs.statement_end_offset
END - qs.statement_start_offset)/2) + 1)statement_text,
*
FROM sys.dm_exec_query_stats qs
CROSS APPLY sys.dm_exec_sql_text(qs.sql_handle) st) a
ORDER BY statement_text, statement_start_offset
A few notes:
1. SQL 2005 and SQL 2008 sometimes show –1 for the end offset instead of computing the real offset.
2. the offsets are actually byte offsets within a UTF-16 string (so, since SQL Server doesn’t do anything special for the fancy > 2 byte characters, this just means that you divide the offsets by 2 to find the character offsets)
Now, the DMVs were initially intended as debugging tools. While they are very useful, they have become quite a bit more popular than I had originally intended (I built much of the internal framework for what became DMVs, but I didn’t build these specific DMVs). I was talking with a few of the other developers, and we were curious how people thought about this –1 end offset business. The original intent was that these DMVs should be able to change each release, but I’d expect that people could have dependencies on nuances like this in the DMVs. If you have thoughts on this, please feel free to mail them to me at (conorc (at) microsoft (dot) com).
Happy Querying!
Conor
It’s been a busy summer. I got back from vacation and have been hard at work on new features for a future version of SQL Server. You can always still send me questions about the existing product, of course, and I will answer them here (conorc (at) microsoft (dot) com).
Today is a question that one of the MVPs asked – How do I find out if a statistics object was created with the “don’t recompute” flag? Initially, I tried dbcc show_statistics, but it isn’t stored within the statistics blob itself and doesn’t show up there. Instead, you need to look at sys.stats (2005 and above):
1: use tempdb
2: go
3: create table t1(col1 int)
4: insert into t1(col1) values (1)
5: insert into t1(col1) values (2)
6: go
7: create statistics s1 on t1(col1)
8: create statistics s2 on t1(col1) with norecompute
9: go
Which yields:
name no_recompute
------------------------
s1 0
s2 1
(2 row(s) affected)
Funnily enough, the internal build on which I first tried this had a bug and did not ever return 1… It had already been fixed, but I had not installed that build on my machine. I spent an hour trying to grok the metadata code and figure out what had gone wrong :).
Now, the auto-stats logic in SQL Server is generally good for the vast majority of our customers. It updates statistics when the data has changed, causing plans referencing those statistics to recompile when they are next used. This works great – so, why would I not want that???
Well, there could be a few cases where you want to override the default behavior of the system. Perhaps you find that your application is heavily dependent on avoiding too many recompiles in the middle of the day (recompiles take CPU resources). If you wanted all statistics to not recompute, you can just change the database-level flag for auto-update stats. However, if you want to change only specific statistics, this norecompute flag is the way to do so. Usually, I would only recommend doing something like this if you know what you are doing and have a specific reason to change this, as you would want to pair something like norecompute with your own custome stats update job that runs at night, for example.
I hope that gives you some insight into why this flag is there.
Happy Querying!
Conor
I received a question from a customer today about query plan choice for a parameterized query - specifically that different plans get picked based on where the plan is located, and sometimes one might be better than another. Apparently, for their application, a particular query plan was “better” when compiled in dynamic sql instead of a stored procedure. Others say that query plans are just as fine when compiled within a stored procedure. Which is it?
(If I were still consulting, the answer to this is “it depends” ;))
There are a lot of things necessary to answer this question, as it really isn’t about the exact location, at least not as such. It has more to do with the way different batches/procedures are compiled and also about how parameters are sniffed during plan compilation.
Let’s start at the beginning: The model for T-SQL (the procedural part of the programming surface area) is compiled in batches in SQL Server. This means that if you submit a batch of 5 queries, SQL Server will generally try to compile them all before executing any of them. (There are exceptions)
Now, let’s mix things like "set @p=5” in that query batch, and let’s have a later query reference that parameter. In this example, the whole batch is “compiled” before it is “executed”. That means that the “set @p=5” does not actually set @p=5 until that statement is executed. That means that the later query that references that parameter will sniff a NULL instead of a 5.
I wrote about the basics of this issue a few years ago in the QO team blog:
I Smell a Parameter!
So, back to our question – which one is “better”. The answer is that this problem is about scopes and not procedures vs. dynamic sql. Basically, putting the parameterized query within a nested scope (dynamic sql or another procedure) happens to impact whether the parameter gets sniffed or not. It’s the sniffing that impacts the plan choice, not from where it is hosted in T-SQL.
Now, why did the plan become “good” or “bad”:
* Sometimes when you sniff a NULL, you get a perfectly reasonable plan (if NULL is represented in similar proportions to other possible values).
* Sometimes if you sniff a NULL, you can get an absolutely horrible plan.
Generally, you want to get the Optimizer to see a good average value for your parameterized queries so that it will put a plan that works well, on average, for the average parameter case. If you never call the procedure with NULL for a parameter AND it generates a different plan than if you had compiled it with an actual value you would use, then you probably want to look at that and see how to get it to sniff a more common value – you want to get that Optimizer working for _you_ and not for that dreaded NULL :).
So, we’re done, right? Unfortunately no – it gets more complex again. There are some cases where one plan may not perform great for all parameter values (so, there is either not a good average value or there is otherwise some large performance difference based on the parameter value for the “best” plan). Let’s say that you are querying some big table for the “USA” table during the day and the “Togo” (a small island nation) data at night. These plans may be very different because their filter’s selectivities will be different. We refer to this as a plan with parameter-sensitivity.
For these parameter sensitive plans, you can also see “good” vs. ”bad” performance. By that, the performance differs, perhaps even by a lot, based on the parameter value. If you have cases like this, you have a few options.
So, there are a few different ways you can see “good” vs. “bad” plans for parameterized queries. Let’s talk about some practical guidance for how to avoid these issues.
First, you can move things to a new scope – that’s what happens when you move to dynamic sql. This likely gets the parameter sniffed and almost always will give you a good plan (at least for that parameter value). Another option is to just use an OPTIMIZE FOR hint to pass a common value to be sniffed – this is a great way to not have to worry about whether the batch compiles before it executes, etc. Another option is to use OPTION(RECOMPILE). If you have a parameter-sensitive case where sniffing the value does not work well, then you will want to consider OPTION(RECOMPILE) – when the compilation cost is smaller than the performance difference from using a suboptimal plan for the wrong parameter value, then it is a very good idea.
I’ve given you the short version :) There are some additional details around “deferred compilation” where a batch is only partially compiled before parts of it gets executed. SQL 2005+ also has statement-level recompile, an this can change the sniffing logic as well. I am not trying to give you a complete answer other than to give you the tools to ask the right questions – this stuff is tricky!
Happy Querying!
Conor Cunningham
Architect, SQL Server QP
Thanks to those of you who replied to my previous post. I spent the last week or so talking with customers, coworkers, and looking at various applications to see how recompiles are modeled today in their applications. At a high level, SQL Server doesn’t really “require” you to recompile at all. The main set of people who seemed to think about this were people having trouble with plan selection. For example, if statistics were not updated frequently enough and users were querying new data on a time-series column, this could cause the automatic algorithm to not work well until statistics were updated on the queried column. So, this is usually a more advanced topic for DBAs to consider.
As you might expect, customers who do this usually manage a somewhat larger database or set of databases. Sometimes it is an application where there are very precise performance requirements and the queries are relatively small (for example, a query that is usually a “seek” against a table instead of a scan). When the query plan selected is a scan, the query performs much more slowly than the seek plan.
In terms of frequency of mechanism, this seemed to be all over the place – people use sp_recompile, freeproccache, and update statistics for different kinds of problems. I think update statistics has an edge since it usually addresses the problems that can be fixed with a recompile, but others would do freeproccache or sp_recompile as well.
I will convey a small fact to you about sp_recompile that you may not know. Internally, SQL Server has a timestamp that is used to determine whether plans in the procedure cache are still valid – as long as the timestamp on the referenced object is the same the current timestamp in the metadata for that object, the plan is “valid”. sp_recompile works by updating that metadata timestamp for the object passed as an argument to sp_recompile. So, you are effectively making a metadata change by doing an sp_recompile. DBCC FREEPROCCACHE will drop objects from the cache without a metadata change. update statistics works on a different, secondary metadata version with slightly different semantics.
So, if I had to make a recommendation to you, I’d say to just consider whether you need to use sp_recompile or not – it is somewhat more heavyweight than the other approaches. None of them are invalid, of course, but you might find that it is a bigger hammer than you need if you want to fix one query with a plan that is not what you want. Note that DBCC FREEPROCCACHE takes arguments, and you can limit what is evicted so that you can reduce recompile load to the subset you desire.
Happy Querying (and Memorial Day, for those of you in the USA)!
Conor
I have a question for you guys about how do you force a recompile for a single query today?
There are a lots of different ways that one could cause the system to recompile a query:
a) sp_recompile <object used in a query>
b)DBCC FREEPROCCACHE (with optional arguments to restrict the set of impacted queries)
c) detach/attach db (2008+), restart server, etc.
d) use a lot of memory to force plans out of memory
e) update statistics
f) change the text of the query (add a space), which works mostly except for trivial plans (which you may not care about recompiling anyways)
I am curious what other ways people use (or what you use most frequently). If you have any interesting scenarios about how/why you recompile a plan, I’d like to hear about them, so please send them my way (conorc @ microsoft.com)
If you never have occasion to recompile queries or think it’s not useful, that’s fine too – I want to hear about those cases as well.
If there are some interesting cases, I’ll post them up anonymously if you don’t mind.
Thanks,
Conor
I received a question this week about whether filtered statistics update as frequently as regular statistics. The right way to ask the question is “Do filtered statistics become invalid as frequently as regular statistics?”, as stats are recomputed based on the queries that consume them, while stats are invalidated based on changes to the underlying data.
The basic mechanism for stats invalidation is a per-column counter in the storage engine. It’s not transacted, and it is incremented on each INSERT/UPDATE/DELETE/MERGE. When a normal statistics object is created, the current “modcounter” for that column is stored in the statistics object. As changes are made to the table, the difference between the current modcounter and the one stored in the statistics object becomes greater. Ultimately, when the difference is large enough (I will slightly oversimplify and say that it is 20% of the table changed for this blog post), the statistics object is invalidated. When a query consuming the statistic is executed again, it will determine that the statistic is out-of-date and recompile it. Once done, the query is recompiled using the new statistics object. Unfortunately, many of the details of this logic are not exposed, and this can make it somewhat difficult to understand the mechanism.
Filtered statistics do not change the fundamental mechanism – there is still one set of counters in the storage engine. The thresholds for when filtered statistics become invalid, however, are modified from the normal case based on the estimated selectivity of the filter. So, a filter that only covers 20% of the domain is updated 1/5th as frequently. The calculation is done without knowledge to whether the changes being made to the table would actually cause the statistic object to be changed.
I’ve written up an example to show that filtered stats do indeed get updated even when all of the changes to the data are outside of the range of the filter on the statistics object. In this example, the filter selectivity is very high (~90%), and the “data change” is effectively not really changing the data, but I know that it is impacting the modcounter for the column.
1: use tempdb
2:
3: --drop table testfilteredstatsrefresh
4: -- create a table for our example with values 0...999 in col2
5: create table testfilteredstatsrefresh(col1 int identity, col2 int)
6: go
7: set nocount on
8: begin transaction
9: declare @a int=0
10: while @a < 20000
11: begin
12: insert into testfilteredstatsrefresh(col2) values (@a % 1000)
13: set @a+=1
14: end
15: commit transaction
16: go
17: -- create filtered stats over a portion of the domain
18: create statistics i1 on testfilteredstatsrefresh(col2) WHERE col2 > 100
19: go
20: select * from testfilteredstatsrefresh where col2 <> 55 option(recompile)
21: go
22: set nocount off
23: -- update the table with a series of no-ops on the section of the range
24: -- not covered by the filtered stats
25: update testfilteredstatsrefresh set col2=col2 %1000 where col2<100
26: update testfilteredstatsrefresh set col2=col2 %1000 where col2<100
27: update testfilteredstatsrefresh set col2=col2 %1000 where col2<100
28: update testfilteredstatsrefresh set col2=col2 %1000 where col2<100
29: update testfilteredstatsrefresh set col2=col2 %1000 where col2<100
30: go
31: -- run the query again - auto-stats fires again
32: -- so,
33: select * from testfilteredstatsrefresh where col2 <> 55 option(recompile)
34:
I turned on sql profiler and captured the “auto stats” event to watch when it is recompiled. You will notice that this happens on the last query (after the changes), even though there have been no changes in the part of the domain covered by the filtered statistics object.
Happy Querying!
Conor
For those of you waiting for the new "Inside SQL Server" book, it is actually out now. Kalen decided to give it a name change, and the actual title is "SQL Server 2008 Internals". I've written a chapter on how the Query Optimizer works. It is currently #1 in the 3 related database/sql server categories on Amazon, so I'll say "thank you" to those who have already ordered a copy ;). Like my blog postings, there are a lot of sample scripts you can try so that you can "see" what the optimizer is doing - I find this technique to be a very good way to learn what is happening, and I hope it works for you.
http://www.amazon.com/Microsoft%C2%AE-SQL-Server%C2%AE-2008-Internals/dp/0735626243/ref=pd_bbs_sr_1?ie=UTF8&s=books&qid=1237753244&sr=8-1
Enjoy!
Conor Cunningham