New Search Indexing Whitepaper from MS IT includes Best Practices and Optimization

New Search Indexing Whitepaper from MS IT includes Best Practices and Optimization

  • Comments 6

You may have seen this as it was recently featured on TechNet, but if not you have to check it out.  My good friends in MS IT land have shared some of their deployment.  The first part is a bit slow, so if you're looking for meat jump to the introduction.  I understand there is a lot of the paper quoted here this is done with permission.  I've given you some examples of the good stuff you can find...

Deploying and Supporting Enterprise Search

Here's a snippet on the environment and servers:

Table 1. Characteristics of the Enterprise Search Solution in Redmond
Characteristic Description

Index content

Approximately 27 million items are indexed.

Indexed items include content stored on Office SharePoint Server 2007 sites, SharePoint Portal Server 2003 sites, Windows SharePoint Services sites, file shares, Microsoft Exchange public folders, custom Web sites, and structured data sources.

Indexed content is gathered from more than 25 content sources, some with hundreds of individual start addresses (six of which are scheduled for daily incremental indexing).

User profiles

Imported from the Active Directory® directory service.

Imported from custom data sources through the Business Data Catalog feature in Office SharePoint Server 2007.

Profile data supplied by end users is replicated between each of three regions.

Integration

Additional Business Data Catalog connections to other line-of-business applications (like the Microsoft customer relationship management [CRM] system and the in-house library).

Database and index sizes

SSP search database is approximately 340 gigabytes (GB).

SSP profiles database is approximately 70 GB.

Search index is approximately 300 GB.

Volume of queries

Approximately 500,000 queries per month.

 

 

Table 4 describes the server infrastructure at Redmond for the Office SharePoint Server 2007–based solution.

Table 4. Redmond Office SharePoint Server 2007 Infrastructure
Server Number Description and server configuration

Web front end

Varies

Computers that host the Web sites belong to server farms that are separate from the SSP farm.

For example, MSW has two active computers as Web front-end servers. A third computer is used as a dedicated computer as a crawl target server that can be added into the Web server farm if required

 

Query server

3

Two processors (one 64-bit processor with two cores in each processor)

8 GB of memory

Disk configuration:

Operating system, 50 GB (redundant array of independent disks [RAID] 1)

Program files, 229 GB (RAID 1)

Index, 558 GB (RAID 1+0)

Index server

1

Eight processors (four 64-bit processors with two cores in each processor)

16 GB of memory

Disk configuration:

Operating system, 50 GB (RAID 1)

Program files, 18 GB (RAID 1)

Index, 300 GB (RAID 1+0)

SSP Dump, 600 GB (RAID 1+0) (used for backups)

Database server

2

Clustered through Windows Clustering

Eight processors (two 64-bit processors with four cores in each processor)

10 GB of memory

Disk configuration:

Operating system, 16 GB (RAID 1)

Program files, 9 GB (RAID 1)

Data, 300 GB (RAID 1+0)

Logs, 100 GB (RAID 1+0)

Temporary database (TempDB), 26 GB (RAID 1+0)

In addition to the production farm environment, Microsoft maintains a pre-production farm environment that has two query servers, one index server, and one database server. All servers have similar configurations to their production environment counterparts.

 

This section is GOOD stuff!

Optimizing Crawl Performance

Because of the scale and complexity of the content indexed in the Redmond SSP, Microsoft configured crawl-related configuration settings. The configuration settings in Table 7 helped improve the performance of content crawling at Microsoft.

Note: This information is shared as a reference on the internal deployment at Microsoft and may not be suitable for all deployments.

Table 7. Changes in Configuration Settings to Improve Crawl Performance
Configuration setting Description

Performance level

Configuration setting in the Shared Service Provider Central Administrative section that increases the responsiveness of the index service.

Microsoft IT changed this value to 5.

Connection time time-out

Configuration setting in the Shared Service Provider Central Administrative section that determines the number of seconds that the computer performing the crawl should wait before timing out when contacting the computer hosting the content being crawled.

Microsoft IT changed this value to 120 seconds.

Request acknowledgement time-out

Configuration setting in the Shared Service Provider Central Administrative section that determines the number of seconds that the computer performing the crawl should wait for a request acknowledgement from the computer hosting the content being crawled.

Microsoft IT changed this value to 120 seconds.

 
And finally their best practices, is straight from Lauri Ellis and Sam Crewdson... This is the best part, so don't miss this:

Best Practices

Microsoft IT has gained practical, real-world experience with designing, deploying, administering, and operating enterprise search by using Office SharePoint Server 2007 and SQL Server 2005. Because of this experience, Microsoft IT recommends the following best practices in the areas of deployment and architecture:

Implement fault tolerance for SQL Server by using Windows Clustering. Office SharePoint Server stores the search index in a database that SQL Server hosts. Ensure that there are no service outages by configuring two or more computers running SQL Server 2005 as nodes in a cluster. An administrator can use Windows Clustering to create a cluster.

Place search and TempDB databases on dedicated high-speed disk drives. Indexing and search queries create a large amount of disk activity on the disk drives where the search and TempDB databases are stored. To improve performance, place each database on a separate, dedicated high-speed drive.

Run index services, search services, and SQL Server on high-performance computers. Indexing and search create high processor and memory utilization on the computers that run them. To reduce the amount of time for indexing and search response, run index and search services in Office SharePoint Server 2007 and SQL Server 2005 on dedicated computers with sufficient processor and memory system resources.

Note: Although Office SharePoint Server 2007 can be deployed on 32-bit servers, we recommend employing 64-bit servers in Office SharePoint Server 2007 farm deployments. For more information, refer to "Estimate Performance and Capacity Requirements for Search Environments" at http://technet2.microsoft.com/Office/en-us/library/5465aa2b-aec3-4b87-bce0-8601ff20615e1033.mspx?mfr=true.

Dedicate a front-end server in a Web farm to be a crawl target server for large sites. Crawling content on large sites consumes a large amount of system resources on the server that hosts the sites. Dedicating one of the servers in the Web farm to be a crawl target server allows the index to be updated during peak periods of use.

Populate development, test, or pre-production environments by using backups of the SSP index in the production environment. Restore backups of the SSP index in the production environment for development and testing purposes. Although some indexing occurs in these environments after the backups are restored, restoring from the production index avoids the need to re-index the full set of content.

Ensure that all servers have the latest hotfixes, updates, and service packs. Installing the latest hotfixes, updates, and service packs on all servers helps to minimize any security threats and helps to minimize function-related or feature-related product problems. Ensure that all servers are running the same level of hotfixes and service packs, because running mixed levels can have unpredictable results.

Best practices in the area of administration include:

Minimize the number of content sources.Reducing the number of content sources reduces the administrative complexity. Fewer content sources equates to fewer entities to administer, manage, and monitor.

Use a separate content source for People Profiles.For large My Site deployments, creating a separate content source for People Profiles enables specific crawl configuration settings (such as the type of crawl or crawl frequency) for the My Site content.

Review start addresses.The larger the number of start addresses, the more content is indexed. Review the start addresses in each content source and eliminate any unnecessary start addresses.

Review and update crawl rules regularly. Crawl rules determine the content sources to crawl, file types to crawl, and other crawl-related criteria. Review and update the crawl rules to ensure that all content is included in the index.

Review and update Best Bets and keywords regularly. Best Bets and keywords determine the relevance of content. Review and update the Best Bets and keywords to help identify relevance for all content.

Review and update the search metadata property schema regularly. Crawled properties, managed properties, and the mapping between the properties determine the content metadata to index and include in search queries. Update the search metadata property schema to include the new metadata added to content.

Best practices in the area of content management include:

Include keywords within file names.The indexing service identifies content where keywords are included in file name within the URL paths as more relevant than content where keywords are not included in the file name. If possible, make the file name (the rightmost part of the URL) readable. Use spaces (%20) between keywords in the file names so that the search engine is able to identify keywords in the URL.

Include keywords or descriptive text within anchor text. Anchor text is the text between the <A HREF=> and </A> tags in hyperlinks (the text that appears on the Web page and that the user clicks to go to the content). Typically, this text is highly indicative of the content referenced by the link. As such, the search engine uses this text as a separate relevance ranking computation after indexing.

Anchor text often carries better information about a document than the document itself (in relationship to indexing). Search crawls the following elements in the anchor text:

HTML anchor elements

Windows SharePoint Services link lists

Office SharePoint Portal Server 2003 listings

Microsoft Office Word 2007, Microsoft Office Excel® 2007, and Microsoft Office PowerPoint® 2007 hyperlinks (only for files that use the new Open XML Formats)

Include keywords and descriptive text within titles. Make sure that the titles of Web pages or documents convey the content and are readable as a search results. Avoid titles like “Home” or “Index.” Instead, create titles that focus on the content covered on the page. Titles have a higher weight than the full text when ranked by the search engine.

Include relevant metadata in Microsoft Office documents and HTML pages. Files that Microsoft Office creates can contain information that describes attributes about the file (such as author, manager, keywords, or custom properties). The indexing service crawls the metadata, and the query processor uses the metadata to determine relevance of the file in a search query. Consider adding metadata, in the form of keywords, to Microsoft Office documents or HTML pages. This can be especially useful when the document is not rich in text (such as a spreadsheet or an image-rich document).

Include relevant metadata in files stored in SharePoint document libraries. Files stored in SharePoint document libraries contain information that describes attributes about the file (such as author, department, keywords, or custom properties). The indexing service crawls the metadata and uses the metadata to determine relevance of the file in a search query. For example, ensure that the author property reflects the person who actually authored the document, not the person who created the template or uploaded the document to a site.

Place important content higher in site hierarchy. Documents and Web pages that appear high in the URL hierarchy in an organization (with fewer slash marks) tend to be identified as more important. In instances where you can place a priority or importance on content, placing the sites that contain the more important content higher in the site hierarchy helps ensure that users will find the content that is of the most use to them.

Best practices in the areas of monitoring and operations include:

Monitor indexing and search services with Microsoft Operations Manager 2005. Download the Office SharePoint Server 2007 Management Pack, SQL Server Management Pack, Windows Base Operating System Management Pack for Microsoft Operations Manager 2005, and Internet Information Services (IIS) Management Pack for Microsoft Operations Manager 2005 to monitor the indexing and search services. These management packs can help identify when the services are not running and collect statistics about the health and status of the services.

Pause index creation instead of stopping an index build if crawling must be temporarily halted. Stopping an index build requires that a full update is performed the next time the index is built. Pausing an index build enables an administrator to resume the index build at the point where he or she paused the process.

Perform systematic backups of the index and SSP databases. This helps ensure a quick recovery from a potential disaster. For example, a full crawl of an index such as the one on the Redmond SSP can take a few weeks to complete.

Leave a Comment
  • Please add 3 and 4 and type the answer here:
  • Post
  • The link to source document is currently broken

  • I think your link should be:

    http://www.microsoft.com/technet/itshowcase/content/deployingsearchtwp.mspx

  • Not sure how it broke, but it's fixed now.  Thanks.

  • Great article but a "How To" section would be helpful on some of it.  Have a couple questions?

    Table 7, is confusing. - Where is performance level for indexing in SSP settings? Or is this the same setting at the search service level of Reduced, Partly Reduced, Maximum?  What is 5?

    Crawl exclusion rules - Good tip to exclude http:\\*.* but is there a list anywhere of other exclusions to add for a slim, trim index?

    Thanks,

    AB

  • On ne change pas une équipe qui gagne, mais il faut penser à l'optimiser Performance et capacity planning

  • Great insight. Must say, I am totally blown away at the firepower required to achieve indexing and query processing on 27 Million items. The overheads in SharePoint are rediculous. It is a great product but very bloated, mostly because of SQL. Why not store the information in XML? You might think I am mad, but, I have seen one XML engine provide ~60% compression over RAW XML and achieve query speeds which would eat SQL for breakfast, especially with structured/unstructured data like that held in SharePoint. I understand MS even offer a data storage interface for MOSS. I know what I will be doing!

    spud@circint.co.uk to know more.

Page 1 of 1 (6 items)