Anatomy of Indexing
Those of you who follow my blog, know I'm a big fan and instigator of local crawling (WFE + Index server roles) also known as target boxes, often the target is separate from the index server, my recommendation is to make them the same. Although it does take some thought to get this setup and doesn't feel very default, it is really the right thing to do in most situations where you have a medium farm or larger.
I in recent exchange on why are WFEs used in relation to indexing...here we have, from the mouth of Sid Shah, some great insight into how crawling works with WSS and MOSS SharePoint content sources.
"The SPS3 protocol is used for crawling people profiles in a MOSS farm. For crawling SharePoint content, the indexer uses the STS3 protocol. The way the indexer crawls is that it communicates with a Web Service (sitedata.asmx) that runs on any server configured to be a WFE.
During the first full crawl, this web service enumerates the content in the Content Database and returns URLs and metadata to the indexer. The indexer subsequently issue http GETs to retrieve the page content and index it. During subsequent incremental crawls, this web service reads the change log through WSS OM and does the same enumeration and returns changed/added/deleted URLs and metadata to the indexer and the indexer issues http GETs to index the content.
I hope this sheds some more light into what happens.
Thanks, Sid"
First let me put this into steps for clarity:
Full or initial crawl:
1. Indexer communicates with a WFE web service the sitedata.asmx
2. Enumerates the URLs and gathers metadata
3. With the URLs it issues GETS for the content from the content database to retrieve the page content and subsequent documents and lists
Incremental
1. Indexer communicates with a WFE web service sitedata.asmx to read the change log through WSS OM
2. Does enumeration and returns changed/added/deleted URLs and metadata to the indexer
3. Indexer issues http GETs to index the relevant content
Note for WSS only the STS3 protocol is used, but when profiles are there (unique to MOSS) the SPS3 protocol is used.
Let me pick out a line and elaborate. He says, "The way the indexer crawls is that it communicates with a Web Service (sitedata.asmx) that runs on any server configured to be a WFE."
So what this means is, by default if your index server does NOT have the WFE role (services) it will not hit the web service of itself. It will do GETs against your WFEs. Those of you who've had experience with the indexer know that it loves to use multiple threads and often even in large environments make up 50% of the traffic you see to your farm. By ensuring the index server is indexing itself as a WFE meaning it is hitting the sitedata.asmx web service on itself, it will offload the performance that would be done to the WFE... Isn't that what you were trying to do when you offloaded the index server in the first place?
I saw the note as very informative and look forward to seeing it incorporated.
Also in that same thread, Neil another Search expert reinforces what I've been saying:
"The indexer does access the change log to get a list of content that has changed. It then access that content via the front end.
If you have a dedicated indexer in the farm you could enable the WSS service on that indexer and set it as the dedicated crawl target. This is still using a WFE as the crawl source but because it is a WFE that is not part of the load balanced set you will not suffer any WFE performance degradation during crawling."