Welcome to MSDN Blogs Sign in | Join | Help

Anatomy of Indexing

Those of you who follow my blog, know I'm a big fan and instigator of local crawling (WFE + Index server roles) also known as target boxes, often the target is separate from the index server, my recommendation is to make them the same.  Although it does take some thought to get this setup and doesn't feel very default, it is really the right thing to do in most situations where you have a medium farm or larger.

I in recent exchange on why are WFEs used in relation to indexing...here we have, from the mouth of Sid Shah,  some great insight into how crawling works with WSS and MOSS SharePoint content sources.

"The SPS3 protocol is used for crawling people profiles in a MOSS farm. For crawling SharePoint content, the indexer uses the STS3 protocol. The way the indexer crawls is that it communicates with a Web Service (sitedata.asmx) that runs on any server configured to be a WFE.

During the first full crawl, this web service enumerates the content in the Content Database and returns URLs and metadata to the indexer. The indexer subsequently issue http GETs to retrieve the page content and index it. During subsequent incremental crawls, this web service reads the change log through WSS OM and does the same enumeration and returns changed/added/deleted URLs and metadata to the indexer and the indexer issues http GETs to index the content.

I hope this sheds some more light into what happens.

Thanks, Sid"

First let me put this into steps for clarity:

Full or initial crawl:

1. Indexer communicates with a WFE web service the sitedata.asmx

2. Enumerates the URLs and gathers metadata

3. With the URLs it issues GETS for the content from the content database to retrieve the page content and subsequent documents and lists

Incremental

1. Indexer communicates with a WFE web service sitedata.asmx to read the change log through WSS OM 

2. Does enumeration and returns changed/added/deleted URLs and metadata to the indexer

3. Indexer issues http GETs to index the relevant content

Note for WSS only the STS3 protocol is used, but when profiles are there (unique to MOSS) the SPS3 protocol is used.

Let me pick out a line and elaborate.  He says, "The way the indexer crawls is that it communicates with a Web Service (sitedata.asmx) that runs on any server configured to be a WFE." 

So what this means is, by default if your index server does NOT have the WFE role (services) it will not hit the web service of itself.  It will do GETs against your WFEs.  Those of you who've had experience with the indexer know that it loves to use multiple threads and often even in large environments make up 50% of the traffic you see to your farm.  By ensuring the index server is indexing itself as a WFE meaning it is hitting the sitedata.asmx web service on itself, it will offload the performance that would be done to the WFE... Isn't that what you were trying to do when you offloaded the index server in the first place?

I saw the note as very informative and look forward to seeing it incorporated.

Also in that same thread, Neil another Search expert reinforces what I've been saying:

"The indexer does access the change log to get a list of content that has changed. It then access that content via the front end.

If you have a dedicated indexer in the farm you could enable the WSS service on that indexer and set it as the dedicated crawl target. This is still using a WFE as the crawl source but because it is a WFE that is not part of the load balanced set you will not suffer any WFE performance degradation during crawling."

 

Published Monday, January 28, 2008 5:49 PM by joelo
Filed under:

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

Monday, January 28, 2008 10:41 PM by MSDN Blog Postings » Anatomy of Indexing

# MSDN Blog Postings » Anatomy of Indexing

Tuesday, January 29, 2008 4:21 AM by Henrik Kim

# re: Anatomy of Indexing

We have tried to implement this in a "Medium Farm" configuration but had issues with the hosts file getting deleted. I found a few discussions on this topic but no solution. One stated that MOSS is actually writing to the hosts file but fails and in the process deletes the file. Can you confirm this behavior and do you by any chance have a fix?

Cheers, Henrik Kim, ProActive A/S

Tuesday, January 29, 2008 10:13 AM by smc750

# re: Anatomy of Indexing

Good post. I am curious which method on SiteData.asmx the crawler is calling. Also, Microsoft has always recommended making the index server the dedicated web front end for crawling. See the technet article from March 2007.

http://technet2.microsoft.com/Office/en-us/library/0cf7b3cd-090a-4c5e-b2c1-6272584ba2b21033.mspx. Unfortunately, not a lot of people are reading this article.

Tuesday, January 29, 2008 2:05 PM by Russ Houberg's SharePoint Blog

# Joel Oleson and the Anatomy of Indexing

Ok. So the real reason that I chose today to on-ramp this blog. I wanted to add a little something to

Tuesday, January 29, 2008 2:09 PM by SHAREPOINTBlogs.com Mirror

# Joel Oleson and the Anatomy of Indexing

Ok. So the real reason that I chose today to on-ramp this blog. I wanted to add a little something to

Wednesday, January 30, 2008 4:06 PM by David

# re: Anatomy of Indexing

(about the post)

Sounds great, but...

I understand (and please correct me if I'm wrong) the if I use one web server to index as well, all search queries will have to go through that server.

which means that this server is now very critical.

Is that  true?

Wednesday, January 30, 2008 4:51 PM by Sanjeev

# re: Anatomy of Indexing

I just came to know that there is a bug (not publicly documented) that the GetSite method of sitedata webservice return null webs if the number of sites in the site collection is more than 1000! So, is it the SharePoint will not crawl if a site collection has more than 100 sites? Joel, do you know about this issue?

Wednesday, January 30, 2008 11:08 PM by Greg

# re: Anatomy of Indexing

I have learned that the GetSite method of SiteDate only returns results when there are less than 1000 total sites.  If this object is used in the indexing process how can sites >1000 be handled?

Wednesday, January 30, 2008 11:28 PM by Adrian

# re: Anatomy of Indexing

Joel,

Is there a way to programmatically add an entry to an index, instead of waiting for the crawler to craw over that piece of SharePoint content ?  We need to make entries available to search resutls instantly (or almost anyway).

Cheers,

Adrian.

Thursday, January 31, 2008 2:05 PM by joelo

# re: Anatomy of Indexing

Henrick,

If you're loosing the host file config, or if it isn't correct.  My recommendation would be to manage the host file entries yourself and not use the feature to configure it.

Also SMC750, thanks for pointing out that article.  You'll see more content on TechNet on topics similar to this in the future as this content is incorporated.

Joel

Friday, February 01, 2008 12:55 PM by joelo

# re: Anatomy of Indexing

Adrian, due to change log based crawling for SharePoint Content sources, whatever has changed will be the first thing indexed.  As far as forcing a certain site indexed, not sure what the best method is for that, but it is easier now to remove entries from crawl results.

Friday, February 01, 2008 4:46 PM by joelo

# re: Anatomy of Indexing

From Sid,

This doesn’t impact WSS/MOSS search.

Search doesn’t use the GetSite() method to get anything from the WSS web service. We use GetChanges(), which in turn enumerates Virtual Servers and Sites through WSS OM.

----

This also answers the question about what method is used.

Monday, February 18, 2008 7:51 AM by Sizar

# re: Anatomy of Indexing

Joel, I'm unable to find answers about where are the data stored.

So the index server crawls the content then store it to where? (database?)

The query server has a copy of the indexed content but where is this stored? (file system?)

Monday, March 31, 2008 2:02 AM by Prashanthspark

# re: Anatomy of Indexing

I have inccurred crawling error, i tried create new database and assign webapplication..

Still error continues..

Hope someone who has faced this issue, can reply.

Message:

The start address <sts3://systemname:29153/contentdbid={0977e461-48ca-4701-b975-c4d1a02aca01}> cannot be crawled.

Context: Application 'Search index file on the search server', Catalog 'Search'

Wednesday, June 18, 2008 2:01 PM by Luis Du Solier G. - SharePoint en Español

# Algunos ligas interesantes respecto a la búsqueda y funcionamiento del Index de SharePoint

Revisando algún material y recursos adicionales para un curso que estoy impartiendo. Para el caso de

Wednesday, June 18, 2008 2:01 PM by SharePoint en Español - Luis Du Solier G.

# Algunos ligas interesantes respecto a la búsqueda y funcionamiento del Index de SharePoint

Revisando algún material y recursos adicionales para un curso que estoy impartiendo. Para el caso de

Saturday, May 02, 2009 8:05 AM by Russ' SharePoint Blog

# Joel Oleson and the Anatomy of Indexing

Ok. So the real reason that I chose today to on-ramp this blog. I wanted to add a little something to Joel Oleson's post yesterday regarding the Anatomy of Indexing. First of all, I'm sure anyone interested in my take on SharePoint is probably well aware

Leave a Comment

(required) 
required 
(required) 

  
Enter Code Here: Required
 
Page view tracker