Announcing: Availability of Infrastructure Updates
15 July 08 12:08 AM | enterprisesearch | 10 Comments   

As announced on the SharePoint Team blog this morning we released to web three new important updates that affect SharePoint Server 2007, Windows SharePoint Services 3.0, Project Server 2007, Search Server 2008, Search Server 2008 Express and Project Professional 2007.

The Infrastructure Update for Microsoft Office Servers (KB951297) (Download X86, Download X64) is particularly important from a Search and SharePoint Server 2007 perspective as it contains the new Enterprise Search features that were shipped in Search Server 2008 and Search Server 2008 Express that were are not already in SharePoint Server 2007; this includes Federated Search capability, a unified administration dashboard and several Search core platform performance updates.

For an overview of the new federation features please check out this short video which covers how to configure a federated location and configure one of the new federated search Web Parts to show results from that location. 

There’s also a growing number of articles on TechNet and MSDN that cover configuring and troubleshooting federation and extending federation with Federated Search Connectors.

The screen capture below shows how federated search results show up on a results page – the results on the right hand side and top left are federated results, the ones at the bottom left are from the local index.

image

The new Search Administration Dashboard consolidates all of the Search related admin activities into a single place – there’s also some new functionality in the dashboard (There’s greater granularity for content source crawl history and a convenient list that shows currently running crawls and durations for example) and it makes the Search Administrators job much easier by keeping everything close at hand.

The UI looks like the screen capture below:

image

The update leaves the old Search admin pages intact, the links to them stay in Central Admin (Along with a new link to the new dashboard) so if you’ve made any changes to them or just prefer to use the existing admin pages you’re free to do so.

The other changes are all under the hood and improve Index and Query performance as well as fixing a few bugs.  Check out the KB articles below for more details.

Description of the Infrastructure Update for Microsoft Office Servers (KB951297)

Fixes Included in the Infrastructure Update for Microsoft Office Servers (KB953750)

Please read this post on the SharePoint Team blog and the installation instructions thoroughly before you install the Infrastructure Update for Microsoft Office Servers (KB951297) and the Infrastructure Update for Windows SharePoint Services 3.0 (KB951695) on SharePoint Server 2007 or Search Server 2008.

Install the Infrastructure Update for Microsoft Office Servers (Office SharePoint Server 2007)

Install the Infrastructure Update for Microsoft Office Servers (Search Server 2008)

Finally, if you’re wondering why the Infrastructure Update for Microsoft Office Servers (KB951297) and the Infrastructure Update for Windows SharePoint Services 3.0 (KB951695) applies to Search Server 2008 then you’re probably not alone!

There’s a very good reason for both – Search Server 2008 and Search Server 2008 Express are built on the Windows SharePoint Services 3.0 platform (hence the need for the Infrastructure Update for Windows SharePoint Services 3.0 (KB951695)) and secondly the Search features are from SharePoint Server 2007 (hence the need for the Infrastructure Update for Microsoft Office Servers (KB951297)).  The latter update includes a few bug fixes since Search Server 2008 and Search Server 2008 Express launched.

We strongly recommend that you install the updates that apply to you as soon as your patching and maintenance schedules permit.

Richard Riley
Senior Technical Product Manager
Microsoft Corp.

Announcing: conceptClassifier for SharePoint
07 July 08 02:41 PM | enterprisesearch | 3 Comments   

conceptClassifier for SharePoint adds automatic document classification and taxonomy management to Microsoft SharePoint and works without the need to build another search index. It is installed as a set of Features that, when activated, cause new columns to be displayed in the document library listings and new menu options appear that allow authorised users to edit the automatically generated metadata, if required.

Adding Taxonomy navigation to SharePoint

Classification results are saved directly into SharePoint Properties where Microsoft Enterprise Search can utilise the metadata for enhanced searching, such as faceted search or results filtering.

clip_image002

The accuracy of the automatic classification is driven by the underlying technology which is based on compound term processing. This means that the classification engine performs its matching  using multi-word concepts rather than simply looking for selected keywords or phrases. Taxonomy creation and maintenance is a simple process and is conducted using natural language rules, making it much simpler and quicker than alternative approaches.

More information about conceptClassifier for SharePoint can be found here:

http://www.conceptSearching.com

and a SharePoint demonstration can be seen here:

http://moss.conceptSearching.com

John Challis
CTO
Concept Searching Limited

Announcing: SharePoint Web Parts for FAST ESP
20 June 08 10:06 AM | enterprisesearch | 4 Comments   

It’s been around 45 days since the acquisition of FAST Search and Transfer closed and we’re moving quickly to provide interoperability for Microsoft customers between FAST ESP and Microsoft SharePoint Server.

The first deliverables from this work are a set of FAST ESP Search Web Parts for quickly integrating results from FAST ESP into SharePoint Server 2007 and a FAST ESP Search site template. 

Using these Web Parts and Site Template SharePoint administrators will be able to quickly and easily build FAST ESP-based search sites inside SharePoint 2007 by simply dropping in and configuring the appropriate components.

The Web Parts and Site Template are available as a free download (both compiled code and source code) from CodePlex at www.codeplex.com/espwebparts and are part of the Search Community Toolkit.

Some of the FAST ESP search capabilities that can be exposed within SharePoint Server 2007 using these Web Parts include:

Search Box Web Part -- Search box for query term submission and includes “did you mean” functionality for query correction

Result List Web Part -- Displays search results and supports sorting, pagination, and navigator-based filtering

Navigator Web Part -- Displays dynamic navigators that profile search results across a set of pre-defined dimensions and allow users to refine the search through navigation clicks

Breadcrumb Web Part -- Displays the search term(s) and list of navigators used to obtain the current result set

The FAST ESP Web parts are designed to be open and extensible, and we’re actively encouraging customers and partners to download them, customize them to align with their branding and extend them to fit their search and user experience requirements.

Expect the features, functionality and range of ESP Web Parts to grow through contributions from the search developer community as well as further contributions from the FAST & Microsoft Search Team!

FAST & Microsoft Search Teams.

Indexing Exchange Server 2007 Public Folders
06 June 08 05:56 PM | enterprisesearch | 8 Comments   

I've had several questions recently about how to index Exchange Server 2007 Public Folders with SharePoint Server 2007.

Unfortunately with the RTM versions of both products it's not actually possible due a couple of issues with both Exchange Server 2007 and SharePoint Server 2007.

The good news however, is that everything is back in working order if you install Exchange Server 2007 Service Pack 1 *and* your SharePoint Server also has Service Pack 1 installed.

The RTM versions of Search Server 2008 and Search Server 2008 Express are unaffected (As they already include the Service Pack 1 changes), so providing Exchange Server 2007 has Service Pack 1 installed they will both work.

None of this affects Exchange Server 2003 which works with the RTM versions SharePoint Server 2007, Search Server 2008 and Search Server 2008 Express and SharePoint Server 2007 with SP1.

Hopefully this stops people scratching their heads...

Richard Riley
Senior Technical Product Manager
Microsoft Corp.

Introducing Protocol Handler.NET
04 June 08 11:13 PM | enterprisesearch | 2 Comments   

"Protocol Handler.Net is a set of .Net wrappers for the protocol handler interfaces that enable developers to create and deploy protocol handlers for SharePoint search and Search Server. 

Developers can index data and documents from any system they can connect to.

Much of the complexity and time around the development of protocol handlers, such as COM interoperability, are reduced and hidden in the wrappers themselves letting developers just concentrate on code to connect to a content source and pull data.

Protocol Handler.Net makes it possible to develop protocol handlers in C# or VB.Net and simplifies the handling of security, metadata, streaming content, deployment and management just to name a few things. It also comes with a help system and sample project to further help developers."

Big thank you to to Chris Gomez from http://www.FastSharePoint.com for creating and sharing these tools and samples!

They are available now on CodePlex at http://www.codeplex.com/phdotnet and are part of the Search Community Toolkit 

searchcomv2small

Announcing: Release to Web of Documentum and FileNet Indexing Connectors
27 May 08 05:45 PM | enterprisesearch | 4 Comments   

Today marks the release of 2 new Microsoft Enterprise Search Indexing Connectors (formally known as Protocol Handlers) for EMC Documentum 5.3 (Service Pack 4) and IBM FileNet P8 3.5.1 or 3.5.2.

The connectors are compatible with the 32bit English Language versions of SharePoint Server 2007 (Service Pack 1), Search Server 2008 and Search Server 2008 Express, and are available as FREE downloads from;

Enterprise Search Indexing Connector 2008 for EMC Documentum

Enterprise Search Indexing Connector 2008 for IBM FileNet

Installation and configuration documentation is included in the download and the release notes are available here;

Release Notes for Indexing Connector 2008 for EMC Documentum

Release Notes for Indexing Connector 2008 for IBM FileNet

A couple of overview videos to get you up and running with the connectors quickly are available through the links below;

Overview Video - Installing and configuring the EMC Documentum Connector

Overview Video - Installing and configuring the IBM FileNet Connector

Both of these connectors are fully supported through your existing service contract with Product Support Services or through the regular pay per incident channel.

Introducing the ExportCrawlLog STSADM Command Extension
26 May 08 08:46 PM | enterprisesearch | 2 Comments   

In versions of SharePoint prior to MOSS 2007 each time a crawl was executed, a new group of log messages were stored to the database.  Also, the name of the log itself was changed in the documentation and the user interface. Formerly this log was known as the Gatherer Log, but it is now called the Crawl Log. 

When troubleshooting problems with the crawl of a particular content source it was (and still is) sometimes useful to compare and contrast the messages logged between one crawl and the next.  In MOSS 2007, the storage of the crawl log messages has been optimized/minimized such that only the most recent message for a given URL is stored in the database.  As a consequence, the results from a prior crawl are overwritten by results from subsequent crawls.  In other words, you can only ever see the most recent log message for a given URL.

This is where the STSADM command extension “ExportCrawlLog” comes in. The motivation for preparing this tool is to provide a way to make a “snapshot” of the Crawl Log information at a point in time to facilitate post-mortem analysis of crawl problems.  As a bonus, in addition to extracting crawl log detail, it also provides some summary reporting features.  The goal of the tool is to provide a means of gathering data by which you can track and manage the health of your index over time.  For instance you could setup a scheduled task to run this command once a day and generate summary reports that can provide data for trend monitoring.

ExportCrawlLog uses only the published APIs of the SharePoint Object Model and must be run on the index server of your SharePoint Farm. ExportCrawlLog is available as source code on Codeplex at http://www.codeplex.com/ExportCrawlLog and is part of the Search Community Toolkit.

Please use the Discussion tracking and Issue tracking features of Codeplex to offer your feedback.

Larry Kuhn
Architect
Microsoft Consulting Services.

Understanding Total Hits & Paging in the MOSS 2007 Search API
22 May 08 11:45 AM | enterprisesearch | 3 Comments   

One of the more discussed topics I’ve seen (and struggled with myself) is around the concept of obtaining the total number of hits in a search results when working with the MOSS API. For instance, when I search for “sales forecast” in my SharePoint site, I want to not only see a set of paged results, 10 hits per page, but also see that my search found 127 matches. Those of you who’ve worked with the SharePoint Search Web Parts know this is a piece of cake using the Search Core Results, Search Paging and Search Statistics Web Parts.

But what if you need to roll your own solution? How can you get the same data out of your search query using the MOSS 2007 Search API? At first this can be a bit tricky but this post will hopefully show you how to knock it down to being a trivial task.

Executing a search query against the SharePoint API has you working with two objects that implement the abstract class Microsoft.Office.Server.Search.Query.Query: Microsoft.Office.Server.Search.Query.KeywordQuery and Microsoft.Office.Server.Search.Query.FullTextSqlQuery. The former KeywordQuery is useful for simple queries whereas the latter FullTextSqlQuery is much more powerful. Both implement the Execute() method which executes the defined query and returns back a collection of results as type Microsoft.Office.Server.Search.Query.ResultTableCollection. Using this object, you can get the specific results you are interested in. For instance to get the relevant results use the following to get an instance of a specific Microsoft.Office.Server.Search.Query.ResultTable:

using (FullTextSqlQuery query = new FullTextSqlQuery(SPContext.Current.Site))

{

query.QueryText = "SELECT Rank, Title Url FROM Scope() WHERE FREETEXT(defaultproperties,'sales proposal') ORDER BY Rank Desc",

ResultTableCollection results = query.Execute();

ResultTable relevantResults = results[ResultType.RelevantResults];

// do work with the results

}

Simple enough, but the project requires much more than that as usual. What we need to do is page the results to show only 15 items per page. No problem… let’s just modify that query a bit to set the Query.StartRow & Query.RowLimit properties of the query to say what page we’re on and tell SharePoint how many results we want to get back. Take for instance if we’re on page 2 of the results… we want to start with the 16th hit as 1-15 were on page 1:

using (FullTextSqlQuery query = new FullTextSqlQuery(SPContext.Current.Site))

{

query.StartRow = 16;

query.RowLimit = 15;

query.QueryText = "SELECT Rank, Title Url FROM Scope() WHERE FREETEXT(defaultproperties,'sales proposal') ORDER BY Rank Desc",

ResultTableCollection results = query.Execute();

ResultTable relevantResults = results[ResultType.RelevantResults];

// do work with the results

}

Again… pretty straight forward. Now is where it gets a bit tricky. You need to show links to provide paging… but in order to do that you need a good idea what the total reset set of your search query because if there were only 43 hits, you don’t want to shot options to jump to page 9. The property that gives you the number you’re looking for is ResultTable.TotalResults. Now there’s something special about this guy: he doesn’t give you an exact number… he gives you an estimate. Why an estimate? Quite simply, with all the security trimming and other complex logic inherit to search algorithms, it’s just too expensive to get a specific number. Sites like Live.com can do this because they don’t have to concern themselves with the security trimming of hits.

But this is not all… there’s another property you should pay attention to: Query.TotalRowsExactMinimum. This property tells SharePoint this is the minimum number of hits to be included in the search. It’s used to generate the estimate of total results. Think of it like a hint to search… saying “you only have to work this hard on this query.” Most search implementations only show the next few paging options… they don’t show ALL the options. For instance, if you’re on page 5, your paging control may show the following:

«Previous« 2 3 4 5 6 7 8 »Next»

In this case, you don’t need for search to find ALL the results… you only need it to determine how many more page options you want to show to see if you’re going to show too many or too few. In the above example, you have an additional 3 pages of results you want to show. Continuing on this example, you have a result set of 15 and you have an additional 3 pages you want to show, the Query.TotalResultsExactMinimum property would be 45 as it already is going to factor into the equation the Query.StartRow property:

using (FullTextSqlQuery query = new FullTextSqlQuery(SPContext.Current.Site))

{

query.StartRow = 16;

query.RowLimit = 15;

// TotalRowsExactMinimum = [number of pages to show] * [page size]

query.TotalRowsExactMinimum = 45;

query.QueryText = "SELECT Rank, Title Url FROM Scope() WHERE FREETEXT(defaultproperties,'sales proposal') ORDER BY Rank Desc",

ResultTableCollection results = query.Execute();

ResultTable relevantResults = results[ResultType.RelevantResults];

// do work with the results

}

That’s all there really is to it! One parting word of advice: use the Query.TotalRowsExactMinimum property with care as the higher its set, the greater performance impact there will be on each search query executed.

A special shout out & thanks to Puneet Narula @ Microsoft for helping uncover this very helpful nugget of info.

Andrew Connell (blog)
Microsoft MVP

System Center Data Protection Manager & Search
21 May 08 10:44 AM | enterprisesearch | 0 Comments   

Microsoft® System Center Data Protection Manager (DPM) is designed to help you take backups and restore data easily. For a Microsoft SharePoint® farm, DPM understands the objects within the farm and backs up the most relevant data with least amount of user intervention. DPM 2007 backs up:

1. Configuration database which stores most of the farm settings.

2. Administrator content database which stores the content of the central admin website.

3. Individual content DBs that store information about specific sites, their subsites, document libraries and documents.

Apart from this, DPM customers have highlighted a need to back up SharePoint Search service as well – both Windows SharePoint Services (WSS) Search and Microsoft Office SharePoint Server (MOSS) 2007 Search.

Backing up search is not a trivial problem as the Search service is continuously running background processes (like crawlers) on the SharePoint farms and other locations to index the files in the Search catalog. These programs keep updating the Search service data on continuous basis. Moreover, the search data is not restricted to one data source. It is spread across multiple data sources which may in turn be located across different servers. The following table shows the list of data sources used by the Search service:

Data Source Data Type WSS 3.0
Search
MOSS 2007
Search
Search database SQL database X X
Index Files Files X X
Shared Service Provider (SSP) database SQL database   X

DPM uses the VSS’ snapshot technology that ensures consistency of data in a data source at a point in time. However, even this technology fails when data is being written across multiple data sources causing inconsistent backups resulting in corrupt data at the time of recoveries.

To ensure a good backup of SharePoint Search service, all background activity or processes updating the search data across its multiple data sources must be paused. The steps involved in backing up are:

1. Pause all the background activity and processes updating the search data sources

2. Create VSS snapshots of the data sources and send them to backup application (in this case DPM)

3. Resume the background activities and processes.

DPM has the infrastructure to enable such backups and ensure that the MOSS 2007 Search service is backed up in a consistent manner. The steps to backup and restore MOSS 2007 Search service using DPM 2007 can be found at WhitePaper: MOSS 2007 Search Backup for DPM 2007

Saurabh Bansal
Program Manager
Microsoft India.

Search training material available on TechNet!
20 May 08 05:09 PM | enterprisesearch | 3 Comments   

We recently published ~17 hours of training videos for Enterprise Search on TechNet.

The 14 recorded presentations are based on training modules from a three-day, in-person training course ‘Implementing and Deploying an Enterprise Search solution’.

The presentations provide details about key Enterprise Search capabilities in Microsoft Office SharePoint Server 2007.

The content is recorded in English.

Michal Gideoni
Senior Product Manager
Microsoft Corp.

SQL Monitoring and I/O
19 May 08 08:45 PM | enterprisesearch | 2 Comments   

Hello again, the interest in the existing posts is starting to climb. As always feel free to post comments or questions to any of the posts and I will either answer your questions directly or add them as topics for the next set of posts.

This week I wanted to call out some information around monitoring SQL from a Search perspective and provide some guidance around configuration of the SQL machine. There are many great articles out there that already document much of this information. So I am attempting to consolidate this information and provide you with my "reading list." I'll also summarize some of this data in the key areas you should focus on when working with the SQL box that Search is using.

Primary list of documents to read through, ordered in recommended reading order:

1. Planning and Monitoring SQL Server Storage for SharePoint - This is a good document to start with as it discusses many of the topics at a high-level. However, it does not discuss much about the SQL utilization that Search has.

2. SQL Storage Top 10 Best Practices - A great primer for issues to consider when building out hardware for the SQL box in a Search deployment.

3. Optimizing tempdb Performance - This is very applicable to Search and I strongly recommend you follow the guidance of this article. The tempdb is used in every end-user query executed. Plus, the crawler makes reasonable use of the tempdb. Making sure the tempdb is performant will directly impact the throughput and latencies of end-user queries.

a. Working with tempdb in SQL Server 2005 - A more detailed description of how and when the tempdb is used. Useful as supportive documentation as to why you should optimize the tempdb performance.

4. SQL Predeployment I/O Best Practices - This is a great article discussing the steps you should take to validate the I/O system of a SQL box prior to going into production.

5. Troubleshooting Performance Problems in SQL Server 2005 - This is a great paper discussing all of the various bottle-necks that SQL can have.

As hinted at above Search uses SQL in a very I/O intensive fashion.  It is sensitive to I/O latencies on the TempDB and the Query and Crawl file groups.  The basic recommendation from the SQL team is to keep your latency (Avg. Disk sec/Read and Avg. Disk sec/Write) less than 20 ms for OK performance.  For Search I would strongly recommend:

  • 10ms or less for TempDB.  Both Search and content hosting make heavy use of Temp DB, at this scale point it is recommended that you split the content away from the SSP/Search.
  • 10ms or less for  the Query file group
  • 20ms or less for  the Crawl file group
  • 20ms or less for  the and database Log file

As you'll note below SearchBeta is currently unable to do this, we are in the process of obtaining new hardware (more disks) to rectify this.  Knowing I had limited hardware going into this project I allocated more spindles to the TempDB and Query file-group than the other files.  See SearchBeta Hardware definition for more details.

With our hardware we are close to meeting our crawl freshness goals of less than 24 hours for the high-value repositories.  We see 24 hours fresh for the smaller sites and around 50 hours for the bigger ones.  Our query latencies tend to be in the 2 second range for 95% of our measured queries.            

IOPs, Throughput and Latencies:

Drive IOPs Read (max) IOPs Write (max) Ratio Read/ Write Throughput Read (bytes) Throughput Write (bytes) Latency Read (sec) Latency Write (sec)
Search DB Logs 14.67 1,777.29 0.01 901,126 64,557,167 0.3060 0.8550
Temp DB 1,110.98 1,492.01 0.74 72,808,827 97,770,866 1.6870 3.5660
Query file group 3,507.26 1,631.96 2.15 148,370,386 126,034,214 3.4360 3.2140
Crawl file group 3,043.93 371.65 8.19 104,533,884 10,261,624 15.0840 15.8720

For comparison purposes I have included the current IOPs, throughput and latency numbers that I am getting on SearchBeta. These numbers are for comparison purposes and should be useful as a starting point for configuring your hardware.  However, you should make sure that you test you pre-deployment environment to verify that you are within the recommended latency ranges. It also recommended that you baseline your production system and periodically spot check to verify that you have not deviated away from your baseline.  Overtime you will see growth in the amount of content you are indexing and the volume of queries being executed.  Having a baseline in place and a process to verify the live system latencies will allow you detect problems sooner.      

I've dedicated most of this post to I/O intentionally as this will be the key bottle-neck that you will want to architect the system for.  The white paper Planning and Monitoring SQL Server Storage for SharePoint discusses memory a little more.  But in general; if you are deploying a larger scale (> 10 million indexed documents) MSS deployment then your SQL box should be OK with 16GB of RAM.  However, if you are deploying a MOSS environment where you will be hosting People, Usage Analysis and other MOSS features you will want to start with 32GB.

That is all for this post.  I know you are all waiting for information on SQL maintenance and information on how to create the Crawl and Query file group.  These are the next items on my priority list and I hope to have the details out soon    

-Thanks

Dan Blood
Senior Test  Engineer
Microsoft Corp

Search Relevance Tuning
14 May 08 05:37 PM | enterprisesearch | 2 Comments   

Changing the advanced relevance settings for Search Server 2008 and SharePoint Server 2007 (such as the global ranking parameters, property weights and property length normalization) isn't something that you'll want to do without a lot of thought, planning and testing.

And in the vast majority of cases you'll never need to go near the settings, but in certain specialized deployment scenarios it might be something you want to consider.

This MSDN article and TechNet White Paper give you a very detailed review of the relevance knobs and wheels you can turn along with how to structure and plan your testing to see if you're achieving your desired end result.

To help you on your way, Christopher Even (http://www.sharepointsearch.com) has written and shared on CodePlex a small utility that allows you to change all of the advanced relevance settings quickly and easily.

The tool is here and is part of the Search Community Toolkit.

This tool comes with a relevance health warning: Read the MSDN and TechNet content before you start, use the tool with care, make a note of all of the values that you change before you change them and ensure you're consistent and diligent in your relevance testing after you make any changes!

There's nothing that the tool does that can't be done though the search object model, it just saves you having to write code to tune the various advanced parameters.

Richard Riley
Senior Technical Product Manager
Microsoft Corp

Crawling Novell Netware with SharePoint Server 2007 and Search Server 2008
12 May 08 03:07 PM | enterprisesearch | 0 Comments   

Search in SharePoint Server 2007 and Search Server 2008 can crawl a variety of content sources. For several of the “out of the box” content sources security information in the form of ACLs are collected as part of the indexing process.  At query time, search results are trimmed based on the identity of the user submitting the query, and the ACLs collected from crawling.

In certain scenarios, the built-in security trimming is not sufficient for your requirements or the indexing connector isn’t able to collect ACLs at crawl time - in this scenario you'll need to implement a Custom Security Trimmer. For example, content in Novell Netware file shares can be crawled by SharePoint Server 2007 & Search Server 2008(Using the built-in Fileshare Indexing Connector along with Windows Services for Netware) but the indexing connector doesn’t know how to collect ACLs from Netware.

For security trimmed results, that scenario will require a custom security trimmer (CST) to be developed. We recently released a Novell Netware CST on CodePlex as part of the Search Community Toolkit.  This can be installed and configured in less than 30 minutes and has been tested against Netware 6.5 although it should work with earlier versions too.  It's compatible with SharePoint Server 2007, Search Server 2008, and Search Server 2008 Express.

It includes documentation on how to configure a content source to crawl Netware and how to install the CST.  We’ll post the source code for the CST soon once we’ve finished some final testing and the code will be commented and include a doc that provides an overview of the structure of the code and explains where you might want to make your own modifications.  This will be great example of how to build your own CST if you’re looking for a good sample to start from. 

The initial version was released with an open source license at http://www.codeplex.com/sctcstn and includes the following features:

  • Netware Custom Security Trimmer Binaries
  • Documentation Providing Click Through Guidance
    • For Crawling Netware Content
    • For Installing the Binaries
    • For Security Trimming Netware Results
  • Semi-Automated Installation In Under 30 minutes
  • Forms Authentication Support
  • 64 bit Platform Support
Mitch Prince
Architect
Microsoft
John Kelly
Consultant
Microsoft
Creating crawl schedules and starvation - How to detect it and minimize it.
09 May 08 01:20 PM | enterprisesearch | 1 Comments   

Hello again, Dan Blood here.  For this post I want to discuss starvation as it relates to the Search crawler.  One of the more difficult tasks that the Search admin is going to face is figuring out how to build out the myriad of crawl schedules needed to keep your content freshly indexed.  When you are building out these schedules you will want to keep a close eye on the system using the monitoring information below and slowly add new schedules to minimize starving the crawl of resources while maxing out the utilization of the crawler.  Starvation for Enterprise Search is defined as the crawlers inability to allocate another thread to retrieve the next document in the queue of work.  Taken broadly this can be caused by resource (I/O) contention on the SQL machine, too many hosts concurrently participating in the crawl, "hungry" hosts that do not quickly relinquish a thread and finally back-ups (since crawls are paused during this time).

To make this conversation a little more tractable I need to define what I mean by a "hungry" host.  Hungry hosts are defined as hosts that lock up resources on the crawling side in one or more of the following circumstances:

  • Slow hosts: This is the obvious case where a host being crawled does not have the capacity to service all of the requests that the crawler is sending to it.  Sending more concurrent requests to this server can cause it to slow down even further.   
  • Hosts requiring extra work for incremental crawls: The primary example of this is SharePoint 2003.  This store tends to have a high rate of security changes and the crawler processes the entire document when a security change is detected.  Basic HTTP crawls are partially in this bucket since each document requires a round trip to the server, but the modified date is checked prior to downloading the entire document.
  • Hosts and content that is rich in properties: You will see this more commonly with the following content store types: BDC, People Import and People crawls, but any store can exhibit this trait.  BDC, People Import and People crawls by default have an large number of properties per document which causes the SQL machine to do more work than average.  

The most efficient type of crawls are:

  • SharePoint 2007:  These content stores store a log of changes that have been made allowing the crawler to be very selective what content to download for incremental crawls.
  • File Shares:  Detecting if a document has been changed still requires a round-trip but the check can be done at folder levels allowing an entire folder hierarchy to be skipped if nothing lower in the hierarchy has changed.
  • Exchange Public Folders: These crawls behave just like File Share crawls.

Using the above as a guideline when you start building out your content sources and crawl schedules you should use the following guidelines:

1. Minimize the number of content sources that you have.  Grouping hosts of the same repository type and similar size into individual content sources.  The intent here is to reduce the overall count of crawls that your system will have.       

2. Crawl your large SharePoint 2007 data stores first and do so until you reach steady state

For SearchBeta this typically means crawling the large repository for approximately 7 days.  Then 3 to 4 incremental crawls are required after this to clear out any timeout errors seen in the initial full crawl.  Keep an eye on your error count per crawl, when this number is low relative to the amount of content in the crawl and does not change from one crawl to the next you have reached steady state   Once this state is reached incremental crawls of very large repositories can take only a couple of hours if the change rate in the store is relatively low. 

One trick that I commonly use for the initial crawl of these sites is to start with a schedule that starts an incremental every 30 minutes.  This allows the successive incremental crawls to start in the middle of the night when you are not around to see the crawl complete.

3. Do not schedule more than one (1) "hungry" content source at a time.

4. Start with a minimum of 4 concurrent crawls.  This is your starting point, use the data below to determine if your system has the head-room to add additional concurrent crawls. 

5. If you reach a starved state it is best to pause your "hungry" crawls to let the remaining crawls complete. 

 

Determining if the crawler is in a starved state

The following data should be periodically analyzed during different periods of building  and maintaining your crawl schedule(s).  You will want to look through the data below several times during this process.  Initially you will use this information to create your content sources and crawl schedules.  Verifying that you are not starved before adding the next crawl schedule.  Then you will want to look at this data during different times for the crawl, paying specific attention to the beginning and end of crawls containing a large amount of data.  Finally you will want to look at this data on a periodic basis.  The content you are crawling will change and grow.  This growth may be enough to drive you into a starved state and thus miss your freshness  goals.

First you need to understand how many Crawl threads are used for your hardware and the maximum number of threads that can be used per host.  This number of is based on of the number of processors that the indexer has and the Indexer Performance setting in the Configure Office SharePoint Server Search Service Settings UI.   You can also modify the number of Crawl threads per host via Crawler Impact Rules.  

These threads are the critical resource that will get starved.  The goal of minimizing starvation is to make sure you are not constrained on these resources, while maximizing their usage.  As such you want to avoid having  more hosts in the crawl than you have threads to support and you want the majority of these threads to spend a small amount of time accessing a single document in the crawl  

The number of threads the system will use is based on the settings you make to the Indexer Performance setting and will be as follows.  In large scale environments it is recommended that you set this to Maximum, keeping in mind that you can use Crawler Impact Rules to reduce/increase the number of threads per host to reduce the load you are placing on each repository :

  • Indexer Performance - Reduced
    • Total number of threads: number of processors
    • Max Threads/host:  number of processors
  • Indexer Performance - Partially reduced
    • Total number of threads: 4 times the number of processors
    • Max Threads/host: number of processors plus 4
  • Indexer Performance - Maximum
    • Total number of threads: 16 times the number of processors
    • Max Threads/host: number of processors plus 4

There is a hard-coded max on the number of crawl threads of 64.

Monitoring

1. The first thing to look at and the most common bottle-neck are the two performance counters below for the Archival Plugin.  If they are both consistently at 500 for the Portal_Content instance or 50 for the ProfileImport instance, then you are in a starved state and you are likely bottle-necked in SQL for I/O on the Crawl DB drive.  Look into tuning SQL for better I/O.  (an upcoming post will cover diagnosing SQL I/O bottle-necks and recommended practices for configuring SQL)

  • The counters are in the object Office Server Search Archival Plugin
    • Total Docs in first queue
    • Total docs in second queue

2. Assuming you are not bottle necked in the Archival Plugin The following data is used to determine if you are in a starved state.  Crawl threads can be in one of 4 stages: non-existent, idle, on the network, or in a plug-in.  You can see what state they are in via Performance Monitor.  Note that these counters change rapidly so it is advisable to look at them over time in a chart to see trending and averages.  Also a thread will not stay in the idle state for an extended period, if there is consistently no work for a thread to do it will be terminated.

  • The counters are in the object Office Server Search Gatherer
    • Idle Threads – These threads are not currently doing any work and will eventually be terminated.  If you consistently have a more than Max Threads/Hosts idle threads you can schedule an additional crawl.  If this number is 0 then you are starved.  Do not schedule another crawl in this time period and analyze the durations of your crawls during this time to see if they are meeting your freshness goals.  If your goals are not being met you should either reduce the number of crawls.
    • Threads Accessing Network – These threads have sent or are sending their request off to the remote data store and are either waiting for a response or consuming the response and filtering it.  You can distinguish the difference between actually waiting on the network versus filtering the document by looking at a combination of CPU usage and Network usage counters.  If this number is consistently high then you are either network bound or you are bound by a "hungry" host.  If you are not meeting your crawl freshness goals.  You can either change your crawl schedules to minimize overlapping crawls or look the remote repositories you are crawling to optimize them for more throughput.
    • Threads In Plug-ins – These threads have the filtered documents and are processing it in one of several plug-ins.  This is when the index and property store are created.  If you have a consistently high number for this counter check the Archival plugin counters mentioned above.

3. Given the above information you know how many threads can be active at a given time and the maximum number of concurrent hosts that can be crawled at one time.  With this information and the performance counters above you will see starvation occur in four different ways:

a. Starved by time spent in the Archival plug-in.  The only way to fix this is to improve I/O latency on your SQL machine.  Notably with the spindles hosting the Query portion of the SharedServices_Search_DB database.  Stay tuned for a white paper discussing how you can separate the Query data away from the Crawl data into separate file-groups within SQL.  Thus allowing you to individually tune the disks behind these two key pieces of data.

b. Starved by a "hungry" data store(s).  The crawler has a limited set of threads that it can allocate to perform a crawl, having a single "hungry" host that is being crawled does starve the gatherer slightly as threads in use for this host are not quickly made available for the next item in the queue.  However, the problem is dramatically worse with multiple "hungry" hosts.  It is recommended that you identify your “hungry” hosts (see discussion above for key type of "hungry" stores) and build out your crawl schedules such that you never have more than a single big "hungry" host being simultaneously crawled.    

c. Starved by a large number of hosts.  Again there are limited number of Crawl threads, this coupled with the number of threads per host sets a very hard limit on the number of hosts that can be concurrently crawled.  If the crawler is maxed out on the number of hosts; adding another host to crawl will not only starve this host in the crawl but it will also starve all other hosts in the crawl.  Thus making the overall duration of all of the concurrent crawls be increased and reduce the likelihood that they system will be able to maintain a steady state.  Recommended solution for this is to reduce the number of concurrent crawls.

d. Starved by a crawl queue predominantly filled with items from a single host.  This is a state caused by a host that contains a lot of content which is laid out in manner that is very wide and not very deep.  All types of data stores can exhibit this behavior, but it is easiest to describe with the file system.  If you have a directory system that has a single folder with a hundred thousand documents within it you will see this type of starvation.  Effectively the crawl queue is filled with these 100k items, the first 8 threads (number is hardware dependant) are able to do work on these items, but due to the threads/host limit and the availability of another host the remaining threads will not do any work.  Three types of stores always exhibit this: BDC, People Imports , and People Crawls as they are all flat containers.  The recommended solution here is to consider these type of stores as “hungry” stores and follow the recommendation of limiting the number of concurrently crawled “hungry” stores to one.

This post took a lot of effort from other members of the team.  I would like to explicitly thank: Sid, Mircea and Joe for their help in putting this together.

Thanks and I look forward to speaking with you all in a few weeks.  The next post should cover SQL monitoring

Dan Blood
Senior Tester
Microsoft Corp

SearchBeta Hardware Configuration
03 May 08 05:50 PM | enterprisesearch | 3 Comments   

Hello again, Dan Blood here.  As I layout some of the lessons I have learned hosting SearchBeta  I thought it would be beneficial to let you all know what kind of hardware I am using to support this environment.  Be aware that I am not using the most optimal hardware for the task and in some instances I have too much hardware for the job.  You should not take the hardware I have listed below verbatim and implement your solution on top of it.  If I were to rebuild SearchBata from scratch with purpose purchased hardware I would do it differently.  I've highlighted some of the changes I would make below.  As these postings progress it is my intent that you will be able to use all of them as a starting point for hardware and monitoring decisions.            

SearchBeta is a 3 box farm with one server each for the three main roles: Indexer,  SQL and a machine with the Query and Web Front End roles combined .  The first thing I would change about this configuration is the number of boxes.  We should really have 2, if not 3 Query/Web Front End machines to allow for fail-over and high availability.  As it stands now the service is unavailable when I apply OS updates or other server maintenance activities.  The second thing I would change is to mirror SQL with a second machine, allowing periodic maintenance and updates without any impact to the service.          

This farm is currently running MOSS bits, however it is only using Search functionality.  There is no content on the farm, nor do I have Usage Analysis or People import features enabled in the farm.  As a result the SQL box is optimized for the MSS feature set,  below I have called out how I would change this if I were taking full advantage of the MOSS feature set.

The machines in the farm are defined below:

Query/Web Front End

  • X64 CPU - total of 8 cores
    • Two - Intel(R) Xeon(R) CPU E5345 @2.33GHz
  • RAM
    • 8GB
  • Disk
    • OS -- Raid 1 / Two - 146GB SAS / 15k RPM
    • Index -- Raid 1+0 / Four - 146GB SAS / 15k RPM 

Indexer

  • X64 CPU - total of 4 cores
    • Two - Dual-Core AMD Opteron™ Processor 2218
  • RAM
    • 16GB
  • Disk
    • OS -- Raid 1 / Two - 146GB SAS / 10k RPM
    • Index -- Raid 1+0 / Four - 146GB SAS / 10k RPM

SQL

  • X64 CPU - total of 4 cores
    • Two - Dual-Core AMD Opteron™ Processor 2220
  • RAM
    • 16GB
  • Disk
    • OS  -- Raid 1 / Two - 146GB SAS / 15k RPM
    • Search DB Logs  -- Raid 1+0 / Four - 300GB / 15k RPM
    • *Query File Group -- Raid 1+0 / Six - 300GB / 15k RPM
    • *Crawl File Group -- Raid 1 / Two - 146GB SAS / 15k RPM
    • Other  -- Raid 1 / Two - 300GB SAS / 15k RPM
    • Temp -- Raid 1+0 / Four - 146GB SAS / 15k RPM

*SearchBeta is running with a pre-release configuration allowing two SQL file groups to be used.  One supports the Crawling tables, while the other contains the tables that are used during the end-user Query.    Do not try to do this on your own.  Wait for us to publish guidance on how to explicitly do this.

How big is the Data on SearchBeta?

  • Index: ~126GB.  Remember you need to allocate 2.8 times the size of the index of usable space to account for Master Merges and prevent out of disk space issues.  For example the drive hosting the index for both the Query and Index box on SearchBeta must have ~353GB of space allocated for the index. 
  • Crawl DB : ~117GB *(see note above about file groups)
  • Query DB: ~261GB *(see note above about file groups)
  • Search DB Logs: ~83GB

Content Crawled (~28 million documents  total)

  • SharePoint 2007: ~25.5 million documents.  Most of this content comes from 3 large sites containing  11, 8 and 4 million documents
  • SharePoint 2003: ~1.4 million documents
  • Exchange: ~300 thousand documents
  • HTTP: ~263 thousand documents
  • People:  ~160 thousand documents
  • File Shares:  ~91 thousand documents

Query/Web Front End

Even though I am running both the WFE and the Query role on this box it is still has excess CPU capacity.  If I were to replace this box, I would go with the exact same config, except I would only use 4 cores.  Because the sole purpose of this machine is to respond to Search queries I am able to get away with only 8GB of RAM; using the farm with full MOSS features would require the RAM to be increased.      

Indexer

If I were to replace this box I would not change much.  I could live with a little less RAM (12GB ) and I would like to see how it performs with 8 cores versus 4, but this out of curiosity not necessity.   The majority of the time this box is under 70% CPU utilization.  There are cases when the filter daemon (mssdmn.exe) consumes a lot of CPU and the box spikes at 100%, but this is rare.  Adding more CPU capacity may improve crawl speeds. 

The performance of this machine does vary quite a bit based on the type of content you are indexing.  Specifically around the  file format and filter you are using.  Deb Haldar covers a lot of details about the different Filters available on his blog (http://blogs.msdn.com/ifilter/default.aspx.)  I recommend reading through this if you are installing the Filter Pack or a third party filter.  You may want to consider going with 8 cores on the Indexer if you are using some of the more expensive filters, but you will want to investigate and validate how expensive the filter is with your content before making this decision.      

Finally the content that SearchBeta crawls is primarily English, we know that Japanese, other non-white space breaking languages and German word breakers consume a lot of additional CPU.  Keep this in mind and consider 8 cores if the majority of your content is in one of these locales. 

SQL

This is the machine that I would change quite a bit if I were to replace it.  Regular operation reveals that our initial disk configuration could be improved.  Both the Crawl and Query file groups are overly I/O bound and we know that the bottle-neck seen on the Crawl file-group limits the I/O pressure on the Query file group.  We want to bump the spindle count up to 10 for both the Crawl and the Query File groups, but there is a concern that by unblocking the Crawl spindles we may need to increase the spindle count even more for the Query drive.

Note also that the "other" drive mention above contains the SharedService_DB,  Config,  Admin Content, and the corresponding log files for these DBs.  This is not an optimal config and a reasonable I/O load on these databases will not perform well.  However, this works well on SearchBeta because it is not hosting content, Usage Analysis, People Import or other MOSS features.  If I were to host the MOSS feature set on the site I would need to run with 8 cores, 32 GB of Ram, and build-out an additional R1+0 drive for the SharedService_DB.      

Look for another post to detail SQL optimization, planning and maintenance in the future... 

Backup

There is one final note around hardware and backup that I would like to call out.  During a backup the crawls are paused, the backup for SearchBeta takes approximately 14 hours which is a significant chunk of time out of your active crawl times.  So anything you can do to optimize the speed of the backup and reduce its duration will directly benefit the freshness of the data in the index.  SearchBeta is currently backed up to a remote file share, so both the databases and Index need to be written across the network before backup completes.  A more optimal solution is to backup to a drive that is local to the SQL box, allowing the biggest chunk of data to bypass the network.  Ultimately this will reduce the duration of the backup providing more time to crawl your content.  

Network Gigabit versus Megabit

SearchBeta is running on a Megabit network.  In general we do not see the network continuously bottle-necked.  There are a small number of peak periods where the Indexer is bounded by the network during a crawl (network card performance counter showing "Output Queue Length" greater than 2 ).  We might see a slight crawl performance improvement in our crawls if we upgraded to a Gigbit network.  This improvement would be very difficult to measure and would not be enough to justify re-cabling your lab to do so.  However, as mentioned above reducing the backup duration is something you should pursue. SearchBeta is definitely Network bound when it is backing up.     

Thanks and I look forward to speaking with  you all soon.  The next posting is targeted at building out Crawl schedules and maximizing the crawling that you are doing.  If you have a specific topic that you would like to see more information about please post comments to the thread and we will look at getting it into a future article.   

Dan Blood
Senior Tester
Microsoft Corp

More Posts Next page »
Page view tracker