Welcome to MSDN Blogs Sign in | Join | Help

SharePoint Portal Server 2003 Crawl Performance Part 8

 

IFilters

IFilters do a lot of the heavy lifting for the daemons during a crawl. The IFilter specification documents how to write an IFilter. Once an IFilter is written and deployed then it can be used by various crawling engines, including SharePoint.

In this post we will discuss how to determine if an IFilter is marked for use with a single-threaded daemon or a multi-threaded daemon.

First we need to understand why single-thread .vs. multi-threaded is important. It all comes down to how many of a certain type of document can be crawled by the gatherer at one time. When the gatherer first starts up daemons to do the crawling of items, it will start up what is called a multi-threaded (MT) daemon. This type of daemon can crawl up to 32 items at the same time. This means that many documents can be processed in a short amount of time. The other type of daemon is the single-threaded (ST) daemon. This type of daemon can crawl 1 document at a time.

When dealing with files, the extension will determine the specific IFilter that will be used. The IFilter's registration will determine the daemon that is chosen to execute it. If an IFilter is thread safe and can operate in a MT daemon and is registered appropriately then it will be loaded in a MT daemon. Otherwise it will need to be loaded in the ST daemon.

How to determine the threading model of an IFilter

When an IFilter is registered on a system, the threading model is specified so that it is published to all consumers if the IFilter is thread safe or not. You can inspect your own registry to determine the IFilter for a specific file type and if it is thread safe or not.

  1. Start out on the Indexer machine and open the registry. You will need to select the HKEY_CLASSES_ROOT to begin. Since we are working with .DOC files, we need to navigate to the HKCR\.doc\PersistentHandler path. This path shows a (default) key that is of type REG_SZ. There is a GUID listed here. For the purpose of this discussion, I will simply refer to this a GUID1. You will need to write down this GUID for the next step
  2. Next, you will need to navigate to HKCR\CLSID\{guid1} path. Once at this location you will see a node below this level called PersistentAddinsRegistered. Beneath that level you will see another node that is a GUID. We will call this GUID2.
  3. Click on GUID2. This path shows a (default) key that is of type REG_SZ. There is yet another GUID listed here. We will call this GUID3. You will need to write this GUID down for the next step.
  4. Armed with GUID3, you will need to navigate to the path HKCR\CLSID\{guid3}\InprocServer32. Here you can see the following
    1. (default) which is the DLL that contains the IFilter implementation.
    2. ThreadingModel which is the thread safety designation
      1. Both indicates that this DLL can be used by both the ST and MT daemons
      2. Apartment indicates that this DLL can only be used by the ST daemon
      3. Free indicates that this DLL can be used by both the ST and MT daemons

 

Example using the .DOC extension

For this example, I will look at the Microsoft Word document type(.DOC) but this method will work for all document types.

  1. I navigated to HKCR\.doc\PersistentHandler and found that the (default) key had a value of {98DE59A0-D175-11CD-A7BD-00006B827D94}. This is GUID1.
  2. I navigated to HKCR\CLSID\{98DE59A0-D175-11CD-A7BD-00006B827D94}. I saw the PersistentAddinsRegistered and the GUID beneath it. The GUID beneath was {89BCB740-6119-101A-BCB7-00DD010655AF} and this is GUID2.
  3. I navigated to HKCR\CLSID\{98DE59A0-D175-11CD-A7BD-00006B827D94}\PersistentAddinsRegistered\{89BCB740-6119-101A-BCB7-00DD010655AF} and found that there was a (default) key that had a value of {F07F3920-7B8C-11CF-9BE8-00AA004B9986}. This is GUID3.
  4. I navigated to HKCR\CLSID\{F07F3920-7B8C-11CF-9BE8-00AA004B9986}\InprocServer32 and found two entries at this path. I found that the offfilt.dll is the DLL that has the IFilter implementation in it. I also found that the ThreadingModel is Both.

Real life

You will find that there are some IFilter DLLs that are marked as Both but should be marked as Apartment. The result of this incorrect assignment is unpredictable. In SharePoint, you may see the MT daemons crash frequently producing an event in the Application Event log with an ID of 1000. This is the first sign that there is a thread safety issue. I have seen other IFilters that simply produce different results each time a crawl runs. This typically will present itself when a user has an alert for a document change when no change was made to the document. This symptom does not always point to a thread safety issue with an IFilter but it can.

You should contact your IFilter vendor to determine if their IFilter is thread safe or not. You really want to have IFilters that are truly thread safe so that they can be used by the MT daemons. This in itself will help with your overall crawl performance. But, if an IFilter is incorrectly marked it can cause many performance problems due to crashes.

 

In a later post we will more items on tuning.

Published Monday, May 07, 2007 2:34 PM by tonymcin

Comments

No Comments
Anonymous comments are disabled
 
Page view tracker