Where to Start

I prefer to work with real time data but the same thing can be done with perf logs. In part 1 (see SharePoint Portal Server 2003 Crawl performance part 1 I listed each of the perf counters that should be logged and the ones that we are focusing on for this part is a subset of those same counters. In this discussion, I am making the assumption that the environment (network, disk drive, etc) is working without issue and that the suspicion is with the crawler.

I like to start with this subset and then move forward from there based on what is presented in these counters.

 

What are the counters trying to convey?

In the Search Gatherer object there are a couple of key counters that I always review.

  • Search Gatherer\Document Entries
    • This counter represents the current number of items (documents, lists, events,etc) that are in the queue to be crawled.
    • When this counter is > 0 the daemon(s) recognizes that there is work to be done and crawling starts.
  • Search Gatherer\Delayed Documents
    • This counter represents the number of Search Gatherer\Document Entries that exceed the Site Hit Frequency rule and need wait their turn to be crawled. This is a normal function of crawling.
    • This is more of an informational counter and initially I don't really take any action on it but it is good to know the number during the crawl. In a future post we will dive into this counter a little deeper to really understand what is represented here.
  • Search Gatherer\Documents Filtered and \Documents Filtered Rate
    • These two counters work together. The Documents filtered counter indicates the number of documents that were filtered by a daemon. This does not indicate success it only indicates that the document was marked as completed.
    • When an item is crawled it is typically processed by some sort of IFilter. Not all items have an IFilter but are processed by something like an IFilter. For example, when you import users from Active Directory, there is not a people IFilter so another piece of code handles this specific item. The bottom line is that when an item is completely crawled, this counter is incremented.
    • The Documents Filtered Rate is the number of Documents Filtered per second. This number is interesting when estimatating the duration of a crawl. For example, one could estimate the remaining crawl runtime by doing the following
      • Estimated Seconds Remaining = Search Gatherer:Document Entries / Search Gatherer:Documents Filtered Rate
      • Estimated Minutes Remaining = Estimated Seconds Remaining / 60
    • Using this estimation technique one could get a feel for the remaining time. For example using the data in the above graphic you would get the following
      • 3249 (Document Entries) / 4 (Documents Filtered Rate) = 812.25 (est. seconds remaining)
      • 812.25 (est. seconds remaining) / 60 (seconds in a minute) = 13.53 (est. minutes remaining)
    • I can't stress enough that this is an estimation. The reason is that the number of documents that the crawler can filter per second can vary wildly. I have seen some customers get 140-150 per second and then drop down to 2 per second. Another reason that this could change is if the cralwer encounters a document that has many links in it, this will cause the Search Gatherer\Document Entries counter to increase. So it is more of a guesstimation of the time it will take than a reliable time.
  • Search Gatherer\Documents Successfully Filtered and \Documents Successfully Filtered Rate
    • These two counters also work together. The Documents Successfully filtered counter indicates the number of documents that were filtered by a daemon and done so without an error being returned from the IFilter.
    • The bottom line here is that when an item is completely crawled and no error was returned, the Documents Successfully Filtered counter is incremented.
    • The Search Gatherer\Documents Successfully Filtered Rate is the number of Documents Successfully Filtered per second.
  • Search Gatherer\Heartbeats
    • This counter is the basic heartbeat of the search process. It is incremented every 10 seconds. I normally use this counter to determine how long search has been up and running as all of the counters are cumulative since the last time search was started

In a later post we will discuss additional counters and what they indicate.