Problem:

When I analyzed a performance problem in an ISV data warehouse solution running on SQL Server 2000 I figured out that the query plan of a star join (join between a huge fact table and a bunch of small dimension tables) didn't look right. Further investigation showed that certain optimizer estimates didn't correspond to the real values during runtime (e.g. number of rows in the result of a join). So I wondered if maybe the statistics information wasn't up to date. It looked fine for the big fact table and some of the bigger dimension tables. But it was missing on the very small dimension tables. The statistics info was just not there at all. The automatic statistics setting was turned on though and I was told by the customer that this was always the case.

Then I got the explanation from the Optimizer dev team that SQL Server 2000 ignores very small tables (normal tables not temp tables) for automatic statistics. The threshold is about 500 rows. Unfortunately this might happen quite often in relational data warehouse solutions which use star schemas. The effect of joining a few-hundred million rows fact table with some small dimensions the wrong way might be dramatic - in a negative sense.

Solution:

Run update statistics manually on the small tables. Depending on how static these tables are (frequency of updates) it might be enough to do it only once to avoid that the statistics are completely missing. Another option would be to upgrade to SQL Server 2005 where this issue is fixed related to the initial creation of statistics information.

You will find some additional and useful info about automatic statistics in the following article:

http://support.microsoft.com/kb/195565/en-us

What neither SQL Server 2000 nor SQL Server 2005 solve yet is to manage outdated or obsolete statistics on tiny tables (in terms of number of rows).

This is much easier to handle with huge tables. If you add one row to a 1-million row table it will very likely not matter regarding the data distribution. But if you add 1 row to a 1-row table you double the data. The key question is : how often should the statistics be updated ? If it happens too often (e.g. with every new single insert, update or delete) it might thrash the whole system.

Results:

Depending on the query and the amount of data the difference in response time could be almost everything from some percent up to factor 100 or even more. In my specific customer case the query response time went down from 30 minutes to three minutes after running update statistics on the small dimension tables manually.