Over the last couple of months I’ve been helping my customer work through MOSS Crawling Performance issues. The sad part of this task is, there isn’t an easy way to do it, because there are so many moving parts and variables. A couple of months ago, some good blog posts were put up on the Enterprise Search Blog, under the tag “Perf and Scale”. These gave some good insight as to what specifically needs to be monitored, but still required a lot of good old fashioned knowledge with tools like Windows Performance Monitor AKA Perfmon. A fellow PFE, Clint Huffman, has developed a tool that helps to automate the analysis of Perfmon log files which generates a nice "results" report. That tool, Performance Analysis of Logs (PAL), can be found on CodePlex as a free download. Realizing the gap for analyzing crawl performance data, and knowing the capabilities PAL provides, I decided to put both together.
The result is a PAL threshold file and a Perfmon Template file that lets you capture and analysis MOSS crawling performance. The next release of PAL should include these two files, but for those of you wanting to get a head start, I’ve attached them here. The threshold file (xml) needs to go in the PAL directory, and the perfmon template (htm) can go into the templates folder inside PAL directory. Once you have the proper files in place, you’ll need to fire up perfmon and start capturing data on your index server. Once that’s done, you’ll need to fire up PAL. On the counter log tab, point to the perfmon data file to analyze. Go to the Threshold file tab, and select the “Microsoft Office SharePoint Server – Search” option. Then answer each of the questions. This is important, because each answer factors into the analysis calculations. Once your done with the questions, proceed to the next tab. There you will have an option for the analysis interval. Choosing the right analysis interval will help you to better understand the data and what opportunities for improvement you have. Let me explain.
In an ideal scenario, you’ll have the crawler running constantly. You’d do this to keep your search index as fresh as possible. This is the typical scenario for customers with large amounts of data. With large amounts of data, there are usually multiple content sources. Each of these content sources plays into the crawl schedule, with usually limited number of content sources being crawled at once. Now, how does this play into our analysis interval? Ideally we’ll have a sampling of perfmon data that covers our whole crawl schedule, where each content source is crawled at least once. We’ll need knowledge of the crawl schedule as well. For this example, we’ll say that we have 20 content sources, and every hour a new content source crawl kicks off. We’d then want to set our analysis interval to 1 hour. This will cause our performance intervals to mesh with our crawl schedules. We can correlate suggestions in the PAL report with the good and bad time periods of the crawl schedule. We can then perform suggested actions and begin the analysis task again. Demo Screenshots. Please note, these may vary slightly from release names/titles, as they were taken during development.