IFilters
IFilters do a lot of the heavy lifting for the daemons during a crawl. The IFilter specification documents how to write an IFilter. Once an IFilter is written and deployed then it can be used by various crawling engines, including SharePoint.
In this post we will discuss how to determine if an IFilter is marked for use with a single-threaded daemon or a multi-threaded daemon.
First we need to understand why single-thread .vs. multi-threaded is important. It all comes down to how many of a certain type of document can be crawled by the gatherer at one time. When the gatherer first starts up daemons to do the crawling of items, it will start up what is called a multi-threaded (MT) daemon. This type of daemon can crawl up to 32 items at the same time. This means that many documents can be processed in a short amount of time. The other type of daemon is the single-threaded (ST) daemon. This type of daemon can crawl 1 document at a time.
When dealing with files, the extension will determine the specific IFilter that will be used. The IFilter's registration will determine the daemon that is chosen to execute it. If an IFilter is thread safe and can operate in a MT daemon and is registered appropriately then it will be loaded in a MT daemon. Otherwise it will need to be loaded in the ST daemon.
How to determine the threading model of an IFilter
When an IFilter is registered on a system, the threading model is specified so that it is published to all consumers if the IFilter is thread safe or not. You can inspect your own registry to determine the IFilter for a specific file type and if it is thread safe or not.
Example using the .DOC extension
For this example, I will look at the Microsoft Word document type(.DOC) but this method will work for all document types.
Real life
You will find that there are some IFilter DLLs that are marked as Both but should be marked as Apartment. The result of this incorrect assignment is unpredictable. In SharePoint, you may see the MT daemons crash frequently producing an event in the Application Event log with an ID of 1000. This is the first sign that there is a thread safety issue. I have seen other IFilters that simply produce different results each time a crawl runs. This typically will present itself when a user has an alert for a document change when no change was made to the document. This symptom does not always point to a thread safety issue with an IFilter but it can.
You should contact your IFilter vendor to determine if their IFilter is thread safe or not. You really want to have IFilters that are truly thread safe so that they can be used by the MT daemons. This in itself will help with your overall crawl performance. But, if an IFilter is incorrectly marked it can cause many performance problems due to crashes.
In a later post we will more items on tuning.