CSS SQL Server Engineers

This is the official team Web Log for Microsoft Customer Service and Support (CSS) SQL Support. Posts are provided by the CSS SQL Escalation Services

Case of using filtered statistics

Case of using filtered statistics

  • Comments 4
Technorati Tags:

SQL Server 2008 introduces a new feature called filtered statistics.  When used properly, it can dramatically improve cardinality estimate.   Let’s use an example below to illustrate how cardinality estimate can be incorrect and how filtered statistics can improve this situation.

We have two tables.  Region has only 2 rows.  Sales table have 1001 rows but only 1 row has id of 0.  The rest of it have id of 1’s. 

Table Region

id name
0 Dallas
1 New York

Table Sales

id detail
0 0
1 1
1 2
1 3
1 4
1 5
1 6
1 7
1 8
1 9
1 10
1 11
1 12
1 13
1 14
1 15
1 16
1 1000

 

Now let’s look at query “select detail from Region join Sales on Region.id = Sales.id where name='Dallas'”.    From human eye perspective, we immediately know that only one row would qualify.  If we look at Dallas there is only one row in Region table and one row in Sales table.   But from optimizer perspective, it does not know that when the query is compiled and before query is executed.  In order to know that, basically SQL would have to execute the query half way and filter out values for Dallas and take the id of 0 and  then evaluate how many rows are there in table Sales.  In other words, it would require incremental execution.

If you execute the query, you will get a plan like this.  Note that the nested loop estimated 500.5 rows but only 1 row actually was retrieved.

image

Now let’s see what happens if we create a statistics on Region.id but put a filter on name (“Dallas”).  Here is the statement “create statistics Region_stats_id on Region (id) where name = 'Dallas'”.

Now if you execute the same select statement (select detail from Region join Sales on Region.id = Sales.id where name='Dallas'), the cardinality estimate is correct as shown below for the nested loop join.

image

What happened here is the filtered statistics (create statistics Region_stats_id on Region (id) where name = 'Dallas') is used for optimization.  When SQL optimizes the query, it sees there is a statistics that matches the where clause.  It then discovers there is only 1 id of 0 and thus is able to do a correct estimate.

Correct cardinality estimate is very import for complex joins as it affects join order and join types dramatically.

Here is a complete demo:

drop table Region
go
drop table Sales
go

create table Region(id int, name nvarchar(100))
go
create table Sales(id int, detail int)
go
create clustered index d1 on Region(id)
go
create index ix_Region_name on Region(name)
go
create statistics ix_Region_id_name on Region(id, name)
go
create clustered index ix_Sales_id_detail on Sales(id, detail)
go

-- only two values in this table as lookup or dim table
insert Region values(0, 'Dallas')
insert Region values(1, 'New York')
go

set nocount on
-- Sales is skewed
insert Sales values(0, 0)
declare @i int
set @i = 1
while @i <= 1000 begin
insert Sales  values (1, @i)
set @i = @i + 1
end
go

update statistics Region with fullscan
update statistics Sales with fullscan
go

set statistics profile on
go
--note that this query will over estimate
-- it estimate there will be 500.5 rows
select detail from Region join Sales on Region.id = Sales.id where name='Dallas' option (recompile)
--this query will under estimate
-- this query will also estimate 500.5 rows in fact 1000 rows returned
select detail from Region join Sales on Region.id = Sales.id where name='New York' option (recompile)
go

set statistics profile off
go

create statistics Region_stats_id on Region (id)
where name = 'Dallas'
go
create statistics  Region_stats_id2 on Region (id)
where name = 'New York'
go

set statistics profile on
go
--now the estimate becomes accurate (1 row) because
select detail from Region join Sales on Region.id = Sales.id where name='Dallas' option (recompile)

--the estimate becomes accurate (1000 rows) because stats Region_stats_id2 is used to evaluate
select detail from Region join Sales on Region.id = Sales.id where name='New York' option (recompile)
go

set statistics profile off

Leave a Comment
  • Please add 8 and 7 and type the answer here:
  • Post
  • "Now let’s look at query “select detail from Region join Sales on Region.id = Sales.id where name='Dallas'”.    From human eye perspective, we immediately know that only one row would qualify.  "

    My human eys tells me Zero rows qualify..

    However if you said

    "Now let’s look at query “select detail from Region join Sales on Region.id = Sales.id where name='Texas'”.

    I would agree with you!    

  • Paul,

    thanks for pointing that out. "Texas" should have been "Dallas". it's corrected.

    [Intially the table Region refers a value of "Texas".  it was a typo. it should have been "Dallas".]

  • I was expecting the engine to perform less logical reads after the filtered stat. Am I missing something?

  • What I've found is useful is to look at filtered statistics with respect to partitions in a database.  First of all, most people partition incorrectly by date, when you should really look at the composite of all the common where clauses.  If you see two, three, or more columns which may include date regularly appearing in your where clause you should consider creating a surrogate key to combine them and use that surrogate key as your partition key. Don't fall into the trap of creating multiple data files per filegroup - try to keep them on a 1 for 1 ratio.  This reduces the need to split the I/O across several workers and increase CXPacket waits.  Note that you may have to have 15,000 partition support so you'll need SQL 2008 SP2 at the latest.  It will increase partition management overhead, and, from what I've found, you probably will have to create a pool of empty partitions during your maintenance cycle. Second, make sure all your indexes are aligned to each partition - I have rarely found a justifiable reason to have non-partition aligned indexes.  Watch your degree of parallelism - many times having MAXDOP set too high leads to increased threading and marshalling.  Also watch your cost threshold for parallelism.  The default is 5 seconds and I have yet to see a situation where this was useful set any lower than 20 - 30 seconds (you may need to test blah blah blah).

    Lastly but certainly not least, create filtered statistics to align with each of the ranges within a partition.  Interestingly enough partitioned indexes do not do this (and for the life of me I don't understand why its not an option in SQL Server when creating an index).

Page 1 of 1 (4 items)