Welcome to MSDN Blogs Sign in | Join | Help

SharePoint Brew

Russmax [MSFT] weblog
How to determine the number of changes an incremental crawl will process prior to initiating the crawl

I've had customers ask this question several times because they would like a good understanding of how long the incremental crawl is going to take.  The # of changes an incremental crawl must process has a lot to do with why one incremental crawl takes 10 minutes and another crawl takes several hours.  This isn't the only reason why an incremental crawl may take longer than expected but gives you some insight beforehand how many changes you are dealing with. 

Understanding the Basics:

The incremental process is dependent on the protocol handler being used.   This blog is solely focused on detecting changes against Sharepoint sites.  When detecting changes against Sharepoint sites, we use the "Sharepoint 3.0 (sts3) protocol handler".  

We will first attempt to get changes from the last crawl.  We do this through MSDMN.exe process and hit a webservice called sitedatawebservice.  The URL is:

http://servername/_vti_bin/sitedata.asmx

In my case the url is http://russmaxwfe/_vti_bin/sitedata.asmx

For incremental calls we will use the GetChanges method.  It uses a soap protocol and will gather last change ID and which we received during last crawl which will enable the webservice to return us a list of all new changes. 

 

How to detect changes before starting incremental crawl:

This can all be accomplished by a series of SQL queries.  I want to remind readers that performing updates/edits to the database directly are 100% not supported.  The first table you need to check is the MSSChangeLogCookies table within the Search database.  This table keeps track of the last change that the crawler processed for each content database.  You'll want to look at the ChangeLogCookie_new column and you'll see several rows but the output of each will look something like this:

 

1;0;888eef75-d584-4edf-b242-f5161d4c3c44;633579660402500000;2386

 

The GUID, 888eef75-d584-4edf-b242-f5161d4c3c44, is the actual database were crawling against.  The last value, 2386, is the latest change ID.   So first, we need to find which content database this row is referencing.  To do this, we take the Guid, 888eef75-d584-4edf-b242-f5161d4c3c44, and perform the following query against the objects table of the configuration database:

 

select * from objects with (NOLOCK) where ID = '888eef75-d584-4edf-b242-f5161d4c3c44'

 

This will output the name of the content DB.  So for my case, it's MOSS_ContentDB.  So at this point, we know that last change that was processed against the MOSS_ContentDB is 2386. 

Now we need to determine all of the changes from 2386 to latest from the MOSS_ContentDB.  The eventcache table within the content database contains all of the changes up to the most recent.   So in our example above, we need to know all of the changes greater than 2386 so we perform the following query:

 

 select * from eventcache with (NOLOCK) where ID > '2386'

 

The ID column will show you all changes after 2386.  The last row will be the latest change so in my case it's 2396.  So before starting the incremental crawl, I know that the crawler will process 10 changes against this content database.   After running an incremental crawl if I check the MSSChangeLogCookies table in the Search DB, I'll see the following:

ChangeLogCookie_old column will contain:

1;0;888eef75-d584-4edf-b242-f5161d4c3c44;633579660402500000;2386 

ChangeLogCookie_New column will now contain:

1;0;888eef75-d584-4edf-b242-f5161d4c3c44;633579795185130000;2396

And the process repeats itself...

Posted: Monday, November 17, 2008 6:09 PM by Russmax

Comments

Ron Grzywacz's Blog said:

I've had the question from customers come up, about why there's differences in their crawl times. It

# November 17, 2008 1:33 PM

Bob Quinn said:

great article and very straightforward way of calculating the number of changes!  are there any other components of an incremental crawl that will add measurable effort/time to the process?  sometimes we see an incremental crawl run for hours with only one change being picked up - we're thinking it has something to do with security changes within the content / groups / users...but haven't had success predicting the impact ahead of time yet.

# June 11, 2009 10:24 AM

rose said:

This is good explanation for Sharepoint sites.  How about if we are crawling non-sharepoint sites?  How do we the crawler determine the changes?

# November 17, 2009 7:20 PM
Leave a Comment

(required) 

(required) 

(optional)

(required) 

  
Enter Code Here: Required

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Page view tracker