Welcome to MSDN Blogs Sign in | Join | Help

SharePoint Portal Server 2003 Crawl Performance Part 7

 

Site Hit Frequency Rules

The Site Hit Frequency rule (SHF) is a very powerful tool for performance. As I have come to learn, many people do not really understand it or how to use it to help them out on performance tuning.

When tuning the gatherer it is important to understand that it is not always about speed. We would all like to have the crawl to complete in less than 1 hour but depending on the size of your corpus this is not always something that is achievable.

The SHF rule is something that will help with underpowered target servers. For example, one of the things that we crawl internally is a very underpowered machine that physically resides in Germany. This is a web server that is under someone's desk and it's probably running a 500-600 MHz processor. It does fine for what it is doing but it can't withstand the full impact of the gatherer crawling it. So a SHF rule was created to ease the impact on this machine.

The Request Frequency has three choices on the screen and each have their own specific use. The three choices are

  1. Request documents simultaneously
  2. Limit the number of documents requested simultaneously
  3. Wait a specified amount of time after each request

We will talk about these in reverse order. The option to Wait a specified amount of time after each request does just that. If you put 30 seconds in there, the gatherer will pull one document and then after it completes, the gatherer will wait for 30 seconds before getting the next document. This is the SHF rule that was created for the machine in Germany. It works out well because the machine eventually gets all of its content crawled and the machine is not over taxed.

The option to Limit the number of documents requested simultaneously again, does just that :> . If you set the number of documents to 3 then the gatherer will request 3 documents at the same time and no more. This limits the concurrent connections to this specific server. This is an important concept especially as we move to the 3rd and final option. You could use this to throttle down the number of connections to a server or remove the throttle altogether causing the gatherer to hit the server as hard and fast as possible. This is where the power starts to come into play as you will see shortly.

The option to Request documents simultaneously is the default and it can overpower some servers so use it sparingly. This item will pull as many items from the target server as it possibly can at any one time. Using this option will instantly increase the number of open connections to the server depending on the available filtering threads.

If you read part 4, then you remember the discussion on the Search Gatherer\Server Objects. The Sever object is the current governor. When a server object is created it will set a default maximum number of connections to the server itself. This number is calculated as 4 + the number of procs on your indexer. So in my case, I have a dual proc machine so if I am crawling my machine it will be 4+2=6. The server object will limit the number of concurrent connections to 6 at any one time. This was done for very valid reasons. The crawler wants to be a good corporate citizen and not overpower target machines during a crawl. This is a reasonable default to get the data quickly and not to overpower the target.

If you think about this, it can cause some interesting behaviors in the perfmon counters. For example, if I am only crawling my one server and  I have 32 filtering threads. Assuming all of my content was self contained on that machine, I would see 6 threads that were active and the others would be idle. As with all performance tuning, there is always a bottleneck, this time it is the default maximum connections in the server object.

The SHF rule can be used to override this maximum connection limit. Here is where the tuning becomes crucial. If your indexer is already peaking out on the CPU or disk utilization using the Request documents simultaneously option would be bad for you. The reason for that is that if you set this option, it will increase the stress on your indexer machine. This may or may not be what you want to do. I would suggest that you experiment and document what you do to understand what is best in your environment.

The way that I tested this was with Limit the number of documents requested simultaneously option. I chose to put a 10 in that SHF rule. Then I tested. I noticed that my machine could handle it just fine. So I increased it slowly up to about 32 without any difficulty. So, once I hit that number, I just converted it to the Request documents simultaneously option. When crawling a SharePoint site that is on a larger "beefier" machine it is ok to do this.

If you set the Request documents simultaneously option for a file server, you will notice that the impact on the file server and the indexer will be very noticeable. Again, you have to tune these in your environment. I was working with one customer that had very large machines that could handle the load but they were restricted by the default maximum number of connections that is imposed by the server object. They had recently migrated several different SharePoint farms into one farm. This meant that the connection limit imposed by the server object was too restrictive for their environment. We added a SHF rule and found that the gatherer started putting the idle filter threads to work. We also found that the gatherer was picking up the speed in which it crawled items. The machine was not greatly impacted at all but we needed to ensure that the other machines were still doing ok.

When the gatherer is crawling a SharePoint site, it is using web services on the Web Front End machine so you will need to watch that machine to ensure that the additional traffic is not too much for that machine to handle. Also, you are increasing the work that is being pushed to SQL server so you will need to monitor that as well.

If used properly the site frequency hit rule can be a great asset in performance tuning but it can also be a great enemy, so be aware of the impact to other machines when using this rule.

 

In a later post we will discuss additional items for use with tuning.

Published Monday, May 07, 2007 1:33 PM by tonymcin

Comments

No Comments
Anonymous comments are disabled
 
Page view tracker