This is the first in a series of posts in regards to "Not being mislead by what your seeing :)"
In the next post in this series, I'll talk about The Infamous Query Plan Bug, and the origins of SPSiteManager
Be careful not to misinterpret what your seeing in regards to the SharePoint Portal Server Indexer.
I recently worked with a customer who was experiencing "well appeared to be experiencing" poor crawl performance from SharePoint Portal Server 2003. I'll be following up to this posting with results from the MOSS 2007 implementation of Search, but I'm sure they results will be pretty much the same.
The key here is to ensure that you are not misinterpreting the data you are being presented with. Keep the following two items in mind when examining your own portal
Item 1: Certain IFilters can lead you to think that your crawler is not working effectively, yet some can actually have a dramatic impact on performance
Crawl rates can easily be misinterpreted when you have 3rd party iFilters installed. This section will show you some guidance we give in relation to the impact that certain iFilters can have on your environment. I am "NOT" saying that these iFilters are "BAD", and you should not misinterpret me saying that :) I'm just noting the impact that they can have on your environment that you need to be aware of :).
Tests were run using a standalone Dell Precision 470 Intel XEON Server with CPU running at 3.0 GHZ. Total physical memory on the Server was 3.0 GB. SharePoint Portal Server and Windows SharePoint Services were both installed at Service Pack 2. This is a single Server deployment of SharePoint with SQL Server 2000 at SP4 loaded on one Server.
Performance Counters Used
SearchGathererProjects\Processed Documents Rate (Hence known as PDR in this document)
This counter identifies “The number of documents processed per second”
Performance Conclusion
Item 2: Don't misinterpret your gatherer logs.
You may find many "Delete" entries in your gatherer log for URLs that no longer exist, but the delete entries happen all the time, and it appears to never remove the entry from the index.
By default, SharePoint will not automatically remove an index entry until 3 consecutive crawls have occurred (whether it be full, or incremental). The reason for this, is we don't want to just remove the entry, as there may have been a network issue temporarily, and if we did this for every crawl, it could mislead users. For instance, if you had an alert on a document library, the user would constantly see "New content found" alerts for the same document between crawls because of a flaky network connection. Thus, if we can't connect to the target source after 3 attempts, then we consider it a dead link.
What you may actually be seeing is hits to sites from the portals "Sites" content source. Regardless if there is an entry in the index or not, if you have a site listed in the site directory, it will "ALWAYS" be checked from the crawler. In this case, the site is gone, so every attempt is going to fail. Thus, after 3 times, it sends a Delete transaction to the index to remove it from the index. The same would be true to a content source that has a target url to a site that no longer exists. Until you get rid of that site reference, or the content source, you will continue to see those entries in the gatherer log.
I've added some new features into SPSiteManager just for this reason, which allow you to clean up dead entries from your site directory and list of sites to be crawled.
Hope this helps!
- Keith