Crawl Time Factors

Crawl Time Factors

  • Comments 4

Determining crawl times/crawl performance depends on a number of factors, here are some of the more important ones

  • Number of Indexing/Crawl threads
  • Size of documents, type of documents (mix), ifilters (single or multi threaded)
  • WAN - End to End Network Network bandwidth, latency, utilization and packet loss
  • Memory/CPU utilization/NIC utilization on the source and destination server
  • Destination server software (WSS 2.0, WSS 3.0 SPS 2003, MOSS 2007, File Shares, Web Sites, etc…) 
  • For example indexing a WSS 2.0 vs. WSS 3.0 is more efficient since it uses the change log)
  • BDC for Structured data uses dedicated crawl time so this should be factored in

So in a nutshell what are some broad estimates?

  • If it’s 10s to hundreds of MBs – measure it in minutes
  • If it’s 10s to hundreds of GBs – measure it in hours
  • If it’s 1-10TB+ - measure it in days to a week
  • 10-100TB – measure it in weeks

Sam says as far as docs/sec, he's seen 20 docs/second local performance during local good performance and 5 docs/sec when it's not optimum.  As above even averages vary from source to source.  Curious how many items IT is up to?  23+ Million on a single index box in one SSP.

The Search Performance and Capacity Planning document is currently in draft and changing daily and we plan to publish in the next month.

Additional References:

Plan performance and availability of search queries - not a lot here yet.  The performance data for Search on Technet should be published in the next couple weeks.


 

Leave a Comment
  • Please add 2 and 8 and type the answer here:
  • Post
  • Je relaie un billet de Joel Oleson sur ce thème qui me tient à coeur : la recherche en entreprise. Il

  • Joel,  Can you confirm if any of these documents will cover the performance of BDC indexing?

    I have posted an article on the MSDN Forums http://forums.microsoft.com/TechNet/ShowPost.aspx?PostID=1235666&SiteID=17 detailing some of the performance issues crawling the BDC.   Based on these we would be looking at 10s to 100s MB in measured in hours!

  • Je relaie un billet de Joel Oleson sur ce thème qui me tient à coeur : la recherche en entreprise. Il

  • The file reached the maximum download limit. Check that the full text of the document can be meaningfully crawled

Page 1 of 1 (4 items)