Understanding SharePoint - Part 3 - Just because your database is larger doesn't necessarily mean that your index size will be

This is the third post in a series in regards to "Not being mislead by what your seeing :)"

DISCLAIMER: This post shows using Query Analyzer to issue statements against your SharePoint content databases.  By no means does this mean that you should change any thing in the database.   This is simply for "READING" values. And even this should be done during Off-Peak hours.


Uh, but my site has some really large documents in it, and the database is 3 times the size as a smaller site with thousands of files!  What do you mean my index isn't going to be larger??

This discussion came up a few days ago, and I felt it was worth posting externally as well.  I hope this posting helps you understand how to properly estimate your capacity planning as well as hardware requirements for your SharePoint environments.


The Scenario

You have a content database that has grown very large in size, or a content database with thousands of documents spread throughout, will the size of the full text indexes be equally larger compared to a small database and a small number of documents?

The answer is, no...Not necessarily.  You cannot determine the size of your full text indexes based on content database size nor document items alone, and I'll show you why.

The Test

Consider the following example data:

Note: Your results if you perform this test, might be slightly different.

  • I've created a second Windows SharePoint Services site called and located at simply http://www.tailspintoys.com/sites/smalfilesite

    • This site was placed into a content database simply named small_site_DB.

    • I created a unique index in Portal server, and a content source that crawls this site and places the crawled content in said index.  (We'll call it SmallFileFile_INDEX)

    • As a control value, I crawl this site now, and notice that it is ~1 MB in size, and indexed roughly 6 items (These are the default folders, pages for the site)

    • I uploaded exactly 50 word documents that have a size of roughly 1 MB each to a document library.  This equates to ~ 50 MB of content

    • From SQL Enterprise Manager, I can see that the total size of data in the database is ~80 MB (This is due to data+SQL index sizes)

    • I reset, and did another full crawl on this site and it jumps to roughly ~2MB in size, and shows 57 documents in the index. 

    • I also have SQL full text indexing enabled for this database, as they were enabled via the Windows SharePoint Services "Configure full-text search" in Windows SharePoint Services central Admin.

    • By viewing the the Full-Text properties for this database in SQL Server, I see that there is 204 items and the catalog size itself is  ~2 MB, with a unique key count of 1387.  The index size to database size ratio is 1:40

  • I've created 1 Windows SharePoint Services site called and located at simply http://www.tailspintoys.com/sites/largefilesite

    • This site was placed into a content database simply named large_site_DB.

    • I created a unique index in Portal server, and a content source that crawls this site and places the crawled content in said index.  (We'll call it LargeFile_INDEX)

    • As a control value, I crawl this site now, and notice that it is ~1 MB in size, and indexed roughly 6 items (These are the default folders, pages for the site)

    • I uploaded exactly 10 word documents that have a size of roughly 20 MB each to a document library.  This equates to ~200 MB of content

    • From SQL Enterprise Manager, I can see that the total size of data in the database is ~286 MB (This is due to data+SQL index sizes)

    • I reset, and did another full crawl on this site and stays at ~1MB in size, and only shows 7 documents in the index.

    • I also have SQL full text indexing enabled for this database, as they were enabled via the Windows SharePoint Services "Configure full-text search" in Windows SharePoint Services central Admin.

    • By viewing the the Full-Text properties for this database in SQL Server, I see that there is 124 items and the catalog size itself is  ~6 MB, with a unique key count of 1307. The index size to database size ratio is 1:47, roughly the same as the small database in size comparison.

The Portal full text index for the larger content database with large items, didn't grow at all, while the full text index for the smaller sized database with many smaller items, did.

The SQL full text index for the larger content database, grew to 6MB for only 124 items, while the SQL full text index for the smaller content database grew to 2MB with 204 items.

The Reason

For the portal full text, indexes, Portal Search products, two very important settings that determine how content is indexed, based on size.   These settings are called MaxDownloadSize and MaxGrowFactor, and are best explained by this article:

Managing Search Settings for the Portal Site

http://office.microsoft.com/en-us/assistance/HA011648411033.aspx

By default, SharePoint Portal Server will only crawl and filter a file with a size of up to 16 megabytes (MB) because of MaxDownloadSize. In our test above, only the files in http://www.tailspintoys.com/sites/smallfilesite are indexed, because their initial size does not exceed 16MB.

Another factor to consider from the above article is MaxGrowFactor.   If the file is downloaded for indexing because it does not exceed MaxDownloadSize, yet when extracted and expanded into raw text for filtering, it exceeds  MaxDownloadSizexMaxGrowFactor, only the first 64K of the file is actually indexed (Assuming the values are at their default) 

Thus only the files in http://www.tailspintoys.com/sites/smallfilesite are indexed.  So now consider where you have a very large content database with hundreds of sites, webs, document libraries, and many varying document sizes, it's quite possibly that your total index size will actually be smaller than a small content database with an equal number of documents, but all of a size below this threshold.

In Conclusion

Don't rely on database size to have any predictability on index size.   A follow up post will contain the reasons for the SQL full text indexes.  As well, we'll delve into the effect of SQL table indexes on the data based on size, counts, etc.

Hope this helps!!

 - Keith


Previous Posts on this series:

Understanding SharePoint - Part 1 - Understanding the SharePoint Portal Server Indexer
http://blogs.msdn.com/krichie/archive/2006/07/20/672755.aspx

Understanding SharePoint - Part 2 - The Infamous Query Plan Bug and The Origins of SPSiteManager
http://blogs.msdn.com/krichie/archive/2006/07/20/672755.aspx

Additional References:

Managing Search Settings for the Portal Site

http://office.microsoft.com/en-us/assistance/HA011648411033.aspx