Partner Post: One Stop Search from the Microsoft Office Research Task Pane
26 September 08 01:04 PM | enterprisesearch | 0 Comments   

Since the release of Microsoft Office 2003, Microsoft desktop applications such as MS Word, PowerPoint, Excel, Outlook and Internet Explorer have contained an internal federated or meta-search capability known as the ‘Research Pane’. To see this in action in office 2003 (see link for instructions for Office 2007), select (i.e. highlight) a word or phrase within MS Word or MS Outlook, and on PC’s right click on the highlighted word, pull down to the “Lookup Up” option and click. Another way to do this is to hold down the ‘Alt’ button while left-clicking on a highlighted word (in Macs use a command-click). The Research Pane should then open up in the application window and execute a search on the highlighted section. Out of the box, MS Office ships with several research sources such as the Microsoft Encarta Dictionary, Microsoft Live Search, MSN Money and some third party offerings from Factiva and Thomson Gale among others. Here is a screenshot of content returned from three enterprise search engines as well as from some public biomedical websites.

clip_image002

The list of sources that can be searched from the Research Pane is expandable by adding connections to Research Pane service providers. Armed with a URL to a Research Pane “registration service”, a user can install the source into their MS applications using the “Research options…” link. This potentially gives users access to a large set of data sources to choose from. Once a source is installed, the user can select the source from a dropdown list (which causes the search to be executed) or can select a set of sources based on certain pre-defined categories.

Raritan Technologies specializes in Federated Search solutions and has created an array of search connectors to a number of web sites, web services, search engines and databases and directory services (to name a few) using our Search Integration Framework Toolkit (SIFT) and Federation Manager. We and our partner in this effort, New Idea Engineering, have also provided a number of ways to deploy these federated search connectors to web applications and within web services such as SOAP and Open Search. We have recently added to this list by providing a MS Research Pane service ‘front-end’ to our federated connectors. This enables connections to search engines such as Autonomy IDOL, K2 or Ultraseek, Dieselpoint, Endeca, Exalead, Fast, Lucene, Mark Logic (and others) as well as Sharepoint (out of the box) SQL databases, LDAP directories, SOAP and OpenSearch web services, Z39.50 sources, Internet web sites that have search boxes (a very large list that includes general web search engines and specialized sites such as news or research sites) and Content Management Systems such as Alfresco, Documentum and eRoom, and Archival Systems like Symantec Enterprise Vault to be ‘plugged-in’ to any MS Office application. The modular design of the Raritan Search Integration Framework enables other connectors to be added to this list and as this happens, these new sources will automatically be available to users of the Research Pane once configured as a service.

The ability to combine internal content sources from content management systems, enterprise search engines, databases and directory services with external content from subscription or public web sites and web services into MS Office applications provides a huge potential for search integration at the “tip of the sword” where thought and knowledge are combined to create new content.

For more information on the Raritan Technologies “Research Pane Integration” or to arrange for a trial connector please visit http://www.raritantechnologies.com/ResearchPane.shtml.

Barry Freindlich
President Raritan
Technologies, Inc.

Filed under:
How to: Customize the Thesaurus in SharePoint Search and Search Server
23 September 08 02:49 PM | enterprisesearch | 2 Comments   

The thesaurus is an xml file that provides users with a means of automatically expanding or rewriting their queries to include synonyms, acronyms, etc. For example, in a chemical company, product ID 1234, oxygen, O2 and LOX could all refer to the same item.

A SharePoint Search administrator can modify the thesaurus file to substitute all these words at search query time. This document explains how to set up a thesaurus and where to find the relevant files.

Supported Thesaurus Syntax:
To use the sample files provided by the product, you need to remove the comment beginning (<!--) and ending lines (-->) from the xml file.

Explanation of terms:

Term Meaning
thesaurus marks beginning (and end) of thesaurus
diacritics_sensitive

Diacritics are marks, such as accents that are added to letters that change their pronunciation. For example, the acute accent over and e gives you: é.
0 – ignore diacritics
1 – respect diacritics

expansion A list of alternative forms each marked by <sub> by the sub keyword
sub One of several alternatives in an expansion
replacement Several patterns will be replaced with a substitution.
pat A pattern to be replaced
sub Item to be substituted

Example:

<XML ID="Microsoft Search Thesaurus">
  <thesaurus xmlns="x-schema:tsSchema.xml">
    <diacritics_sensitive>0</diacritics_sensitive>
  <expansion>
    <sub>Internet Explorer</sub>
    <sub>IE</sub>
    <sub>IE5</sub>
  </expansion>
  <replacement>
    <pat>NT5</pat>
    <pat>W2K</pat>
    <sub>Windows 2000</sub>
  </replacement>
</thesaurus>

The example means:

  • We have elected to ignore accents, etc in the thesaurus
  • Queries containing IE, or any other one of the <sub> clauses will also contain “internet explorer” and “ie5”.
  • If a query contains terms “NT5” or “W2K”, they will be replaced by “Windows 2000”.

How to Customize the Thesaurus:

  1. Find the appropriate thesaurus file in the config folder contained in the registry key: [HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Office Server\12.0\Search\Global\Gathering Manager]"DefaultApplicationsPath”
  2. Update the thesaurus file(s) for each appropriate language for each desired <expansion> or <replacement>.
  3. Replace the file(s) on each index, query and web frontend server for each search application path:
    %programfiles%\Microsoft Office Servers\12.0\Data\Office Server\Applications\[GUID]\Config 
    Note index propagation does not sync these files on all the servers in the farm.
  4. Stop and restart search service (this is needed to load the new thesaurus files). E.G., in a console window, run “net stop osearch & net start osearch” without quotes, or launch Programs\Administrations Tools\Services then right click Office SharePoint Search Service then choose restart.

Notes:

See “Finding Important Files” below for a summary of where to find the key files to manage your thesaurus.

  1. (optional) If you want to have the same thesaurus files apply to all newly created SSPs, put your thesaurus files under the main config folder
    (e.g., %programfiles%\Microsoft Office Servers\12.0\Data\config).
  2. If there is a syntax error in the thesaurus file, all expansions and replacements will be ignored.
  3. If a word in the thesaurus file matches a stop word in the stop word file, it will be ignored.   To avoid this, remove it from the appropriate stop word file.
  4. Thesaurus terms are broken into words at query time.  Add words you do not want to be broken into the custom dictionary file customLANG.lex (see Finding Important Files for more details).
  5. Search first applies the thesaurus, and then expands words into their alternate forms, when “stemming” functionality is turned on.   Care should be taken to avoid expanding into too many unnecessary forms as this may harm search performance and accuracy.
  6. The “All words” option on the Advanced Search page might no longer work when using multiple term substitution with the thesaurus. This is because an implicit “+” is used between every term.  For example, if we used our example thesaurus above and typed E.G., “browser ie” in the “All words” field, it would look for “+browser +ie” – it would no longer allow “Internet Explorer”.
  7. There is a 10,000 term limit per language in thesaurus.

Finding Important Files:

The following are the most important files used to manage your thesaurus.

There are 50 default stop word files and 48 thesaurus sample files for the languages we support.

The search service install path can be located by examining registry key [HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Office Server\12.0\Search\Global\Gathering Manager]"DefaultApplicationsPath”

The default location of the thesaurus files (for each index, query and web frontend server) is:
%programfiles%\ Microsoft Office Servers\12.0\Data\Office Server 
When a search application is created, a copy of the thesaurus file will also be placed under: %programfiles%\Microsoft Office Servers\12.0\Data\Office Server\Applications\[GUID]\Config

Stop word files for each language can be found as noiseLANG.txt, where LANG is the 3 letter acronym for that language. For example, US English is noiseENU.txt, and the language neutral list is noiseNEU.txt.

To find the appropriate acronym for your language(s), please look them up under: http://www.microsoft.com/globaldev/nlsweb/default.mspx.

Ping Lin
Senior Test Lead
Microsoft Corp.
Victor Poznanski
Senior Program Manager
Microsoft Corp.
SharePoint Image Search
19 September 08 03:03 PM | enterprisesearch | 2 Comments   

Matthew McDermott, a SharePoint MVP, has written a great 4 part blog post on how to make SharePoint 2007 search (and Search Server) render image results in a way that looks very similar to http://images.live.com.

Not only does this make searching images much easier, it’s also a very thorough step-by-step tutorial on how to customize results using the built in Web Parts and XSL – it’s well worth a read.

SharePoint Image Search (Part 1)

SharePoint Image Search (Part 2)

SharePoint Image Search (Part 3)

SharePoint Image Search (Part 4)

The end result makes SharePoint Image results look like the screencap below.

isearch

Richard Riley
Senior Technical Product Manager
Microsoft Corp.

SQL File groups and Search
16 September 08 03:17 PM | enterprisesearch | 1 Comments   

This article has been a long time coming, but it is finally here.  In the post below I will cover how to configure the Search database to span multiple filegroups.  First I'll cover a little about the benefits of doing so:

General references on what SQL file groups are:

The method that we have chosen to implement filegroups on the Search database is one of segregation.  We have identified all of the tables and indexes within the database that are solely used for crawling and not used at all to satisfy end-user queries.  The remaining tables and indexes are used for end-user queries.  However, the nature of the Search and indexing problem still dictates that the "query" tables are written to during a crawl.  The crawl only tables and indexes are isolated into their own filegroup.  With the crawl and query centric filegroups identified you can now ensure that the IO intensive process of crawling has a reduced impact on the IO subsystem that is hosting the query filegroup by ensuring that these filegroups are on separate spindles.

The whole goal of using filegroups is to improve the performance of the system.  This is done by providing an additional file.  This file must be placed on a different set of spindles to see any kind of performance enhancement.  If your SQL machine is not IO bound for the Search database then implementing filegroups will not provide you with any benefits. 

To make the migration process easier we did not actually create a query filegroup.  We simply created a new filegroup called "CrawlFileGroup" and moved the crawl tables out of the PRIMARY filegroup.  Such that PRIMARY effectively becomes the query filegroup.  This migration process is one that can be quite expensive to complete and could take hours to finish.  Keep this in mind when scheduling this on your production servers.  Because the move involves dropping and recreating numerous clustered indexes you should assume that the DB is offline during this move as many long running locks will be taken to recreate the index.  

Issues and concerns with using filegroups:

Back-up and Restore

One concern that you will need to be aware of in you planning for deploying filegroups on the Search database is that your restore process will be slightly impacted.  Out of the box Search restore is unaware of the filegroup that will exist within the backup image.  Because of this there is no way to indicate where this file should be restored to.  As a result the restore process is going to try and place the crawl filegroup file onto the same drive that it existed  on when you ran the back-up.  Once you enable filegroups you will be committed to making sure that all future machines that you restore your back-up to have a drive with the same drive letter that you initially created the filegroup on.   

Future upgrades, Service packs and Hot fixes

Each Hotfix, Service Pack and update that you apply to the server has the potential to modify the index that was moved into the CrawlFileGroup or add an new index to one of the tables moved to the filegroup.  When/if this happens the index will be moved back or created in the primary filegroup.  Updates will also clean out any non-product sproc.  Because of the risk of index modification with updates applied you will need to reinstall the stored proc and run the scripts again after each update applied.

The risk of a new index being added or modified quite low at this time.  We have confirmed that this does not occur if upgrading from RTM to SP1.  But, it does happen when upgrading from SP1  to the Infrastructure Update.  Future Updates are less like to modify the set of indexes.

However, the risk still exists and you will want to re-run the scripts below after each update that you apply to your system.  In the case when you apply an update and the index did not change running the script is a no-op and nothing gets moved.  So it is very cheap to run the script on a system that already has the indexes moved. 

SQL 2005 and greater

The script that is moving the indexes is utilizing new features that were released in SQL 2005.  As such you cannot perform this optimization with SQL 2000. 

Step- by-Step instructions for applying filegroups to your environment.

To deploy this you will need to manually create a file group on the Search database.  To do this execute the following steps:

a. Go to the Filegroups section of the Search database properties within SQL Server Management Studio.

b. From the Filegroups section click add and fill in the name "CrawlFileGroup." The scripts are written assume the filegroup has this name, failure to use this name will result in early failures  in the script

clip_image001[1]

c. Once you have a new filegroup with the name CrawlFileGroup you need add a file into this group.  To do this select the Files section of the database properties dialog and add a new file into the CrawlFileGroup.  Be sure that you place this file onto a separate drive with isolated spindles.

clip_image002[1]

d. Next you need to install the stored proc that will move the indexes and tables to the new filegroup.  Open the script named  MoveTableToFileGroup.sql within Management Studio and execute it; ensuring that you are working with the Search database  This will create a stored proc named proc_MoveTableToFileGroup.  Confirm that this sproc does indeed exist within the Search database.

e. Open and execute the second script named   MoveCrawlTablesToFileGroup.sql, this is the script that does all of the work by calling proc_MoceTableToFileGroup for each table that is dedicated for crawling. 

That is all there is to it.  You have now moved you crawl tables on to a separate set of spindles. 

Thank you for your time and as always I welcome any feedback or questions

Dan Blood
Senior Test  Engineer
Microsoft Corp

Filed under: ,
Partner Post: Announcing conceptClassifier for SharePoint – Automatic Classification within Office
02 September 08 05:10 PM | enterprisesearch | 0 Comments   

Enterprise customers are increasingly struggling with how to apply policy and governance at the desktop. End user adoption is cited as the single most critical barrier to success in ECM and Records Management initiatives. Using Concept Searching’s unique compound term processing conceptClassifier for SharePoint can now be used to automatically classify content from Microsoft Office Applications, upload the documents directly to SharePoint, store the metadata in SharePoint properties and write back the classifications to the custom properties of the document for use within knowledge and workflow applications or enterprise applications such as ECM, Document Management, Records Management, or eDiscovery.

The classification can take place automatically without end user intervention. Optionally, Subject Matter Experts can be granted the authority to manually adjust the classification based on the taxonomy. A ribbon bar has been added to the familiar Office interface enabling automatic classification of content. When the end user classifies a document the system will retrieve existing concepts as an aid to the classification process as shown below. Subject Matter Experts also have the ability to add or delete classes in the taxonomy.

clip_image002

Documents are uploaded to SharePoint and the classification metadata is stored in the properties fields. The classification status automatically reflects the manual classification so as to not overwrite the classification classes the Subject Matter Expert entered. The systems administrator features currently enabled include the ability to edit the classifications, classify the document, a batch of documents or the full library. This metadata can now be used by Microsoft Enterprise Search to improve identification of relevant documents when searching.

clip_image004

For more information visit www.conceptsearching.com or click here to view a webcast demo of the integrated technology.

Martin Garland    
President                                                                                                                   Concept Searching, Inc

SQL Index defrag and maintenance tasks for Search
02 September 08 04:47 PM | enterprisesearch | 0 Comments   

Hi all, this topic is an area that has caused me much pain and work.  My goal for this was to follow the recommended SQL guidelines while minimizing the impact that these maintenance jobs have on Crawling and Queries.  We know from the SQL Monitoring an I/O post that Search is extremely I/O intensive .  As it turns out so is all of the regular maintenance that SQL recommends, so finding the right balance between the two is an interesting scheduling task.

As a starting point much information about SQL maintenance and MOSS is covered in the following paper:

There are some key areas from the above paper that I would like to augment here.

  1. The stored procedure (proc_DefragIndexes) identified in this paper will work, but it is extremely expensive to run on the Search DB as it defrags all of the indexes in the table.
  2. Maintenance plans generated with the Maintenance Plan Wizard in SQL Server 2005 can cause unexpected results (KB 932744.)  While this was fixed in SQL 2005 SP2 these maintenance plans also do more work than is necessary to have a healthy functional system.   
  3. Shrinking  the Search DB  should not be a necessary task that you need to perform.  The process of Shrinking the database does not provide a performance benefit.  SQL best practices for DBCC SHRINKFILE suggest that this operation is most effective after an operation that creates lots of unused space.  Search does not regularly perform these types of operations.  The only time that a SHRINKFILE may make sense is after you have cleaned out your index by removing a Content Source.     
  4. Rebuilding an index can cause latency issues with SQL Mirroring if the SQL I/O subsystem is constrained.  If you are using SQL Mirroring, be sure you are following the SQL best practices and the SharePoint mirroring white paper.  Because Search, SQL Mirroring, and defrag are all very I/O intensive you will want to be extra cautious with your deployment plan for this defrag script and make sure you test the script prior to going into production.

DBCC CHECKDB

DBCC CHECKDB is a command used to check the logical and physical integrity of all the objects in a database.  SQL Best practices recommend that you run DBCC CHECKDB periodically.  For a Search deployment we would recommend that you run DBCC CHECKDB WITH PHYSICAL_ONLY on a regular basis.  The PHYSICAL_ONLY option will reduce the overhead of the command.  However, due to the cost of running this you should schedule it during off-peak times.  The frequency of execution depends on your business needs, but a good place to start is once a week just prior to your back-up.  You still need to run DBCC CHECKDB, but less frequently also based on business needs.  Perhaps every other or every third back-up.  

When running these commands make sure that you have a monitoring process in-place.  DBCC only reports errors, it does not fix them unless explicitly specified by other options.  You either want to archive the output of the DBCC command for post processing or make sure you have event log monitoring set-up (for example MOM) to check for DBCC errors.

In very large environments you can run DBCC on an off-line (sandbox) copy of the database.  This will be less intrusive to end-users and the crawl.  In this scenario you would restore your back-up to a separate sandbox and run DBCC CHECKDB in the restored  environment.        

Fragmentation and index statistics freshness

We started with the proc_DefragIndexes script mentioned above.  After running it became obvious that the script was just too expensive to run on a regular basis.  To reduce the load placed on the I/O system we took a look at all of our indexes in the Search DB and defragged them one-by-one to measuring query performance along the way.  Doing this we were able to identify the indices that provided a performance benefit to the system when they were defragmented.  These indexes are listed below:

  • IX_MSSDocProps
  • IX_MSSDocSdids
  • IX_AlertDocHistory
  • IX_MSSDEFINITIONS_DOCID
  • IX_MSSDEFINITIONS_TERM
  • PK_Sdid
  • IX_SDHash
  • IX_DOCID

Optionally there are two additional indexes that you may want to include in your defrag maintenance plan.  These indexes do not see much use in typical out of box situations and are commented out in the script.  But if your environment is built on a custom UI or makes extensive use of the Advanced Search UI you will see improvements in query latencies if you defrag them.

  • IX_int -- defrag this index if you have a lot of queries that using numeric properties in the property store.  The classic case is date rage queries.
  • IX_Str -- defrag this index if you have a lot of queries that using string properties in the property store.  There is not a common case for this but if you have made changes to your managed properties and are driving your search UI off of exact matches for a string based property you will want to regularly defrag this index.

Once we knew which indexes to defrag we looked at the duration it took for the index to reach a 10% defragmentation rate.  From this we adjusted the FILLFACTOR so we could maintain a longer period of time between actually needing a defrag.  At this point we are seeing a duration somewhere around 2+ weeks between defrags.  Do note that by increasing the FILLFACTOR we did grow the size of the database slightly, the growth rate on SearchBeta was not that large.

We then looked at the cost/benefit of doing a Reorganize versus a Rebuild.  This was a interesting discovery for us.  Initially we had a script in place similar to proc_DefragIndexes that would choose to Reorganize or Rebuild based on percent fragmentation with 30% being the decision point (IE greater than 30% would do a Rebuild).  What we found was a Reorganize was taking over 8 hours with a 10% fragmentation rate and during this time end-user queries suffered dramatically.  Out of curiosity and desperation we tried a Rebuild which is supposed to be the more expensive of the two operations.  The Rebuild operation is completing in approximately 1 hour while the Reorganize takes as long as 8 hours.  The Rebuild operation is more expensive in the sense that you will see some failed queries during the hour that it runs, where as the Reorganize doesn't have as drastic of an effect on the queries, but the overall cost is much higher since you have an 8 hour window where the query performance is degraded.  UPDATE STATISTICS:  In the experiments we ran we found that simply doing the rebuild (which also updates statistics) that it was not necessary to regularly use this command.

Finally we deployed the script into an environment that utilized SQL Mirroring.  Unfortunately this didn't work out very well.  The mirror got so far behind that we eventually had to disconnect the mirror and stop the defrag.  Going through an analysis of this it became clear that the root cause was that the environment was heavily I/O bound and the defrag script generated more I/O than the system could keep up with.   While the mirror was behind end-user query latencies suffered dramatically.  To recover from this we ultimately had to improve the hardware by increasing the number of spindles. 

To mitigate this we have added a parameter to the script that allows you to reduce the MAXDOP used in the index rebuild.  Setting this parameter to 1 on a SQL box that is minimally I/O bound helps, but it may not be enough depending on how constrained the system is.  If you are in an environment  that is I/O bound (with or without SQL Mirroring) we strongly recommend that you go through a test of the defrag before you go live with the deployment.  The easiest thing to try is the following SQL statement:

ALTER INDEX IX_MSSDocProps ON [dbo].[MSSDocProps]

REBUILD WITH (MAXDOP = 1, FILLFACTOR = 80, ONLINE = OFF)

The statement above rebuilds the largest index using the lowest possible MAXDOP, this index must be rebuilt OFFLINE so you will need to run this on a test system or during a maintenance window.   While this command is running keep an eye on the state of your mirroring with:

  • The duration of the command.  Will it complete within your service window?  For comparison purposes this command completes in under an hour on the SearchBeta hardware
  • SQL I/O latencies
  • If you have mirroring in place
    • The Database Mirroring Monitor
    • Send and Redo Queues  within perfmon.  The monitor above will tell you if mirroring is too far out of sync, but these counters are useful for comparison if you start changing the MAXDOP parameter.

Bottom line we feel the rebuild is a much better operation to run and recommend that you:

  1. Run the script on a regular basis; once a night or on the weekends depending on your service windows.
    • Weekends or weekly - reduce the fragmentation rate (sproc parameter) to 5.0 or lower to prevent missing the defrag due to a fraction of a percent (IE - 9.5%)
    • Nightly - use the defaults for fragmentation rate. The largest index (MSSDocProps) gets rebuilt approximately every 2 weeks on SearchBeta. Running the script nightly will ensure that your indexes are up to date more often, but gives you less control over the exact time that the index rebuild occurs.
  2. Before running the script the first time test out how your system will behave when rebuilding MSSDocProps.
  3. Reduce MAXDOP - If your environment shows poor I/O response time or unacceptable durations (cannot complete a defrag inside your service window) reducing the MAXDOP value may reduce the duration of the script and put less pressure on the I/O system.  Reducing the MAXDOP will not help enough if the system is very I/O bound. 
  4. SQL Mirroring - SQL mirroring is sensitive to I/O latencies, adding the defrag may be too much I/O for the system handle.
  5. Poor I/O latency - You should focus on improving the I/O subsystem of your SQL environment before you begin running this script.    

Stored Procedure syntax:

exec proc_DefragSearchIndexes [MAXDOP value], 
[fragmentation percent]
  • MAXDOP value - Integer value. Default is 0  which means that all available CPUs will be used.
  • Fragmentation percent - decimal value. Default is 10.0.  This value was explicitly chosen because we able measure query latency improvements on SearchBeta when defragging at the 10% boundary.  

-Thanks

Dan Blood
Senior Test  Engineer
Microsoft Corp

Filed under: ,
Attachment(s): DefragSearchIndexes.sql
How to: Mine the ULS logs for query latency
02 September 08 04:38 PM | enterprisesearch | 0 Comments   

Tracking query latencies can be made easier through the use of the products ULS logs.   Below you will find information on how to enable the specific ULS traces as well as information for how to parse the logs.  The primary usage of this information is to monitor the ongoing health of your system.  It is one tool in the toolbox to make sure that the system is running in a viable state.  It is also necessary when you are making small changes to your environment so you can measure the benefits or detriments of the changes made.  Another key usage of the query latency ULS logs is the ability to where the larger portions of time is being spent in the query.  For example you can see the time spent in SQL improve after doing index defrags.      

ULS logging

Making changes to ULS log settings can impact performance and cause more disk space to be consumed when.  However, the category and level changes mentioned below are what SearchBeta is running with and the cost of this is negligible given the benefit it provides.  Just make sure your logs files are not on a drive that is tight on disk space.

You will need to change the following ULS settings to get the events that we need traced. 

From "Central Admin.Operations.Diagnostic Logging" set the following;  
Category: "MS  Search Query Processor"
Least critical event to report to the trace log: "High"

LogParser

There are a number of interesting traces that you get with the above setting.  To really look at this data you will need to use some kind of log parsing utility to strip out the interesting traces and perform some additional post processing.  I recommend that you use logparser.exe to do this parsing.  Below I give examples of Log Parser queries to get at the data. Additionally you should provide the following input parameters to logparser.exe since the ULS log files are Unicode, tab separated text files. 

  • -i:TSV -iCodepage:-1 -fixedSep:ON

Traces

With the above ULS trace settings you will get the following messages in the log (location of these log files can be found in the above UI for changing the logging level):

  • Completed query execution with timings: v1 v2 v3 v4 v5 v6
    • The 5 numbers v1,v2,v3,v4,v5, and v6  are time measurements in milliseconds
      • v6 = Cumulated time spent in various  calls to SQL  except the property fetching
      • v5 = Time spent waiting for the full-text query results from the query server (TimeSpentInIndex)
      • v4 = Latency of the query measured after the joining of index results with the SQL part of the query. This includes v5 and the time spent in SQL for resolving the SQL part of advanced queries (e.g. queries sorted by date or queries including property based restrictions like AND size > 1000).
      • v4-v5  =  Join tim
      • v3 = Latency of the query measured after security trimming. It includes V4 plus retrieval of descriptors form SQL and access check.
      • v3-v4 = Security Trimming tim
      • v2 = Latency of the query measured after the duplicate detection.
      • v3-v2 = Duplicate detection tim
      • v1 = Total time spent in QP. (TotalQPTime)
      • v1 -v2= Time spent retrieving properties and hit highlighting . (FetchTime)
  • Join retry v1 v2 v3
    • Retry caused because there were not enough results from SQL that matched the results returned from the full-text index.
  • Security trimming retry v1 v2 v3
    • Caused by the user executing a query the returns a number of results that they do not have permission to read.  The query is retried until the enough results are available to display the first page of results.
  • Near duplicate removal retry v1 v2 v3
    • There were so many virtually identical documents that were trimmed out that the query processor did not have an adequate number of documents to display
    • The 3 numbers v1,v2 and v3 are counts of documents.  If you see one of these messages in the log it means that the query processor was unable to satisfy the requested number of results on the first attempt and had to execute the SQL portion of the query a second++ time with a larger number of requested results.  The numbers here are not excessively useful and most of the analysis you will do is around the existence of this trace.   This and the relative frequency of each of the retries allows you to determine why so much time is being spent in a given phase of the query.   
      • v1 is the current upper bound on the number of documents to work with (this will go up on subsequent retries)
      • v2 is the number of documents before the operation that caused the retry
      • v3 is the number of documents after the operation that caused the retry.

Where is all of the time being spent for the queries executed in the system?

The answer to this question is primarily within the "Completed query execution…" trace.  The number of retries  help explain why the time spent in any one location is so high.   Given all of the timing information that you can get from a single query and the fact that this data is available for each and every query executed, the problem becomes more of an exercise in figuring out how to store the data and provide a mechanism to summarize or chart it.  Without doing this there is just too much data to try and interpret.  The solution we have on SearchBeta is to collect the data on a regular basis (hourly) and import it into a SQL reporting server that is segregated from the SQL machine hosting the Search farm.

Once the data is in SQL we have created a number of Excel spreadsheets that query the data directly from SQL and chart it using Excel Pivot Tables/Charts.  We have also gone further to provide a set of dashboards within a MOSS system that use Excel Server to provide up to date reports on the health of the system that are available for anyone to look at.  

Once you have the basics of this system set-up there are a multitude of other reports and health monitoring that are possible; from collecting performance counters to mining IIS logs.  The IIS logs provide a key piece of information about query latencies that is missing from the ULS trace.  Primarily answering the question of how much additional time is spent rendering the UI.

A sample of one of the charts that we are able to produce with the ULS log data is below:

clip_image001

The log parser query that we use to mine the ULS logs is below.  Note there are number of output options for LogParser, I am using a simple CSV file below.  But you can also import the data directly into SQL.

*remember the numbers in the log are in milliseconds, the query below translates the time into seconds.

Select  Timestamp
      , TO_INT(Extract_token(Message,7, ' ')) as TotalQPTime
      , TO_INT(Extract_token(Message,8, ' ')) as v2
      , TO_INT(Extract_token(Message,9, ' ')) as v3
      , TO_INT(Extract_token(Message,10, ' ')) as v4
      , TO_INT(Extract_token(Message,11, ' ')) as TimeSpentInIndex
      , TO_INT(Extract_token(Message,12, ' ')) as v6
      , SUB(v4, TimeSpentInIndex) as JoinTime
      , SUB(v3, v4) as SecurityTrimmingTime
      , CASE v2
            WHEN 0 THEN 0 
            ELSE SUB(v2, v3) 
        End as DuplicateDetectionTime
      , SUB(TotalQPTime, v2) as FetchTime
INTO QTiming
FROM \\%wfeHost%\ULSlogs\%wfeHost%*.log
WHERE Category = 'MS Search Query Processor' 
      AND Message LIKE '%Completed query execution with timings:%' 

*FYI -- Prior to the MSS release and Infrastructure Update updating MOSS with the MSS changes, the first two "tokens" (QueryID: XXX.) at the beginning of the trace did not exist.  So you will need to subtract 2 from the second parameter of each "Extract_token" predicate in the above SQL command.

What is the percentage of retries that the system has?

To get an idea for how many "retries" are occurring you need to correlate the number of retries with the number of queries executed and calculate a % of total retry values for each type of retry.  The timing data above does include time spent in a retry.    

Log Parser queries:

  • Total number of queries executed
SELECT count (Message) 
FROM *.log 
WHERE Category = 'MS Search Query Processor' and Message 
like '%Completed query execution%'
  • Total number of retries due to Security trimming
SELECT count (Message) 
FROM *.log 
WHERE Category = 'MS Search Query Processor' and Message 
like '%Security trimming retry%'
  • Total number of retries due to Join retries
SELECT count (Message) 
FROM *.log 
WHERE Category = 'MS Search Query Processor' and Message 
like '%Join retry%'
  • Total number of retries due to Duplicate Removal
SELECT count (Message) 
FROM *.log 
WHERE Category = 'MS Search Query Processor' and Message 
like '%Near duplicate removal retry%'

Thank you for your time and as always I welcome any feedback or questions

Dan Blood
Senior Test  Engineer
Microsoft Corp

Search Server 2008 Express Redistribution Rights
21 August 08 04:26 PM | enterprisesearch | 0 Comments   

If you’re interested in using Search Server 2008 Express in your application or shipping it on hardware then take a look at the redistribution license page on the Enterprise Search site.

This redistribution license agreement grants you the right to redistribute Microsoft Search Server 2008 Express with your software application or hardware.

To obtain a Search Server 2008 Express redistribution license, you must:

  • Review the Search Server 2008 Express Redistribution End-User License Agreement (EULA).
  • Print and retain a copy of the Search Server 2008 Express Redistribution EULA for your records.
  • Register for Search Server 2008 Express redistribution rights.

The license is applicable for all 37 Search Server 2008 Express languages.

 

image

Announcing Faceted Search v2.5
12 August 08 02:50 PM | enterprisesearch | 5 Comments   

Starting Faceted Search 2.5, the solution relies on Microsoft Enterprise Library to address common software requirements in caching, logging, exception handling, policy injection etc., etc. More importantly, the 2.5 is a ground breaking release that is setting new targets for the Faceted Search. So, what’s new?

image

New Features

1. Caching – dramatically improves performance and decreases the load on the search engine

The solution uses 2 mechanisms for manageable cache: quick and long. I built the caching logic on assumption that user knows what he/she is looking for. The Search Facets web part will cache original result set and use it for the search refinement, paging and other postbacks. If the initial result set doesn’t provide full coverage of the search, the smart 2nd thread will run against real-time data providing adjustment to the cached match.

2. Synchronization with Core Search Results web part

The MOSS search is adjusted by several parameters that designer can set for the Core Search Results web part itself. These include remove duplicates, enable trimming, permit noise words. When you drop the Search Facets web part to the search results page, it will find the Core Search Results, read its parameters and sync the search query parameters to exactly match ones used by the Core.

image

3. Support for advanced search

It was the most wanted feature since Faceted Search 1.0. With 2.5, the Facets are rendered for advanced search although do not extend yet to ranges. The functionality is accomplished by extending SearchQuery structure to accommodate POST requests and sync back to GET query.

image

4. Match of search counters

This release introduced an updated search syntax that is design to provide matching counters to the core search. In fact, the new search query is using both KeywordQuery and FullTextQuery through the use of generics.

public class GenericQuery<T> : IDisposable where T : Query
{
    private EventHandler _customLogic;

    public ResultTableCollection Execute(EventArgs args)
    {
        _customLogic(_query, args);
        return _query.Execute();
    }

    ...
}

Additionally, the WHERE clause of the search query was modified to provide closer match to the Core counter.

5. Introducing Parent-Child relationships

By design, the facets can support only 2 levels. This release extended the Facets schema to allow management of the nested layers. That eases the pain of displaying complex hierarchies such as geography, or org chart etc. Parent-Child relationship can be set by facet name and facet value, or just by facet name.

<Column Name="BDCCity" DisplayName="City" ParentName="BDCState" />
<Column Name="BDCState" DisplayName="State" >
  <Mappings>
    <Mapping Match="Alberta"  ParentName="BDCCountry" ParentValue="Canada"/>    
    <Mapping Match="Manitoba" ParentName="BDCCountry" ParentValue="Canada" />
    <Mapping Match="Ontario"  ParentName="BDCCountry" ParentValue="Canada"/>
    <Mapping Match="Quebec"   ParentName="BDCCountry" ParentValue="Canada"/>
  </Mappings>
</Column>

In the configuration above, the City facets will display only after the user chose the State. The State itself will match the country of origin.

6. Extending search to logical “OR” queries

Original facets always represent “AND” queries. That implies ability to narrow the search results by adding extra criteria. In this release I prototyped the way to expand the search by adding additional matches to the criteris. This in fact resulted in rewamped the Bread Crumbs UI. Proviuded now out-of-the-box support for languages is a good example of how “OR” queries empower the search.

7. Simplified web part properties

The 2.5 release is friendly to modifications of the web part properties. I have all properties classified and broken down to groups for each of the web parts.

image

8. Other

There are lots and lots of numerous fixes and enhancements, including improved security validation, code refactoring, extending facet sorting, support of quoted search and duplicates etc., etc.

What’s next

It’s my privilege to say that we have a team now that helps to shape new releases and brainstorm the furutre of the Faceted Search. In present we are looking at AJAX and SilverLight and hopefully you’ll start seeing more and more power of Facets in the near future.

Leonid Lyublinski
Senior Consultant
Microsoft Consultancy Services

Announcing: Availability of Infrastructure Updates
15 July 08 12:08 AM | enterprisesearch | 12 Comments   

As announced on the SharePoint Team blog this morning we released to web three new important updates that affect SharePoint Server 2007, Windows SharePoint Services 3.0, Project Server 2007, Search Server 2008, Search Server 2008 Express and Project Professional 2007.

The Infrastructure Update for Microsoft Office Servers (KB951297) (Download X86, Download X64) is particularly important from a Search and SharePoint Server 2007 perspective as it contains the new Enterprise Search features that were shipped in Search Server 2008 and Search Server 2008 Express that were are not already in SharePoint Server 2007; this includes Federated Search capability, a unified administration dashboard and several Search core platform performance updates.

For an overview of the new federation features please check out this short video which covers how to configure a federated location and configure one of the new federated search Web Parts to show results from that location. 

There’s also a growing number of articles on TechNet and MSDN that cover configuring and troubleshooting federation and extending federation with Federated Search Connectors.

The screen capture below shows how federated search results show up on a results page – the results on the right hand side and top left are federated results, the ones at the bottom left are from the local index.

image

The new Search Administration Dashboard consolidates all of the Search related admin activities into a single place – there’s also some new functionality in the dashboard (There’s greater granularity for content source crawl history and a convenient list that shows currently running crawls and durations for example) and it makes the Search Administrators job much easier by keeping everything close at hand.

The UI looks like the screen capture below:

image

The update leaves the old Search admin pages intact, the links to them stay in Central Admin (Along with a new link to the new dashboard) so if you’ve made any changes to them or just prefer to use the existing admin pages you’re free to do so.

The other changes are all under the hood and improve Index and Query performance as well as fixing a few bugs.  Check out the KB articles below for more details.

Description of the Infrastructure Update for Microsoft Office Servers (KB951297)

Fixes Included in the Infrastructure Update for Microsoft Office Servers (KB953750)

Please read this post on the SharePoint Team blog and the installation instructions thoroughly before you install the Infrastructure Update for Microsoft Office Servers (KB951297) and the Infrastructure Update for Windows SharePoint Services 3.0 (KB951695) on SharePoint Server 2007 or Search Server 2008.

Install the Infrastructure Update for Microsoft Office Servers (Office SharePoint Server 2007)

Install the Infrastructure Update for Microsoft Office Servers (Search Server 2008)

Finally, if you’re wondering why the Infrastructure Update for Microsoft Office Servers (KB951297) and the Infrastructure Update for Windows SharePoint Services 3.0 (KB951695) applies to Search Server 2008 then you’re probably not alone!

There’s a very good reason for both – Search Server 2008 and Search Server 2008 Express are built on the Windows SharePoint Services 3.0 platform (hence the need for the Infrastructure Update for Windows SharePoint Services 3.0 (KB951695)) and secondly the Search features are from SharePoint Server 2007 (hence the need for the Infrastructure Update for Microsoft Office Servers (KB951297)).  The latter update includes a few bug fixes since Search Server 2008 and Search Server 2008 Express launched.

We strongly recommend that you install the updates that apply to you as soon as your patching and maintenance schedules permit.

Richard Riley
Senior Technical Product Manager
Microsoft Corp.

Announcing: conceptClassifier for SharePoint
07 July 08 02:41 PM | enterprisesearch | 5 Comments   

conceptClassifier for SharePoint adds automatic document classification and taxonomy management to Microsoft SharePoint and works without the need to build another search index. It is installed as a set of Features that, when activated, cause new columns to be displayed in the document library listings and new menu options appear that allow authorised users to edit the automatically generated metadata, if required.

Adding Taxonomy navigation to SharePoint

Classification results are saved directly into SharePoint Properties where Microsoft Enterprise Search can utilise the metadata for enhanced searching, such as faceted search or results filtering.

clip_image002

The accuracy of the automatic classification is driven by the underlying technology which is based on compound term processing. This means that the classification engine performs its matching  using multi-word concepts rather than simply looking for selected keywords or phrases. Taxonomy creation and maintenance is a simple process and is conducted using natural language rules, making it much simpler and quicker than alternative approaches.

More information about conceptClassifier for SharePoint can be found here:

http://www.conceptSearching.com

and a SharePoint demonstration can be seen here:

http://moss.conceptSearching.com

John Challis
CTO
Concept Searching Limited

Announcing: SharePoint Web Parts for FAST ESP
20 June 08 10:06 AM | enterprisesearch | 5 Comments   

It’s been around 45 days since the acquisition of FAST Search and Transfer closed and we’re moving quickly to provide interoperability for Microsoft customers between FAST ESP and Microsoft SharePoint Server.

The first deliverables from this work are a set of FAST ESP Search Web Parts for quickly integrating results from FAST ESP into SharePoint Server 2007 and a FAST ESP Search site template. 

Using these Web Parts and Site Template SharePoint administrators will be able to quickly and easily build FAST ESP-based search sites inside SharePoint 2007 by simply dropping in and configuring the appropriate components.

The Web Parts and Site Template are available as a free download (both compiled code and source code) from CodePlex at www.codeplex.com/espwebparts and are part of the Search Community Toolkit.

Some of the FAST ESP search capabilities that can be exposed within SharePoint Server 2007 using these Web Parts include:

Search Box Web Part -- Search box for query term submission and includes “did you mean” functionality for query correction

Result List Web Part -- Displays search results and supports sorting, pagination, and navigator-based filtering

Navigator Web Part -- Displays dynamic navigators that profile search results across a set of pre-defined dimensions and allow users to refine the search through navigation clicks

Breadcrumb Web Part -- Displays the search term(s) and list of navigators used to obtain the current result set

The FAST ESP Web parts are designed to be open and extensible, and we’re actively encouraging customers and partners to download them, customize them to align with their branding and extend them to fit their search and user experience requirements.

Expect the features, functionality and range of ESP Web Parts to grow through contributions from the search developer community as well as further contributions from the FAST & Microsoft Search Team!

FAST & Microsoft Search Teams.

Indexing Exchange Server 2007 Public Folders
06 June 08 05:56 PM | enterprisesearch | 8 Comments   

I've had several questions recently about how to index Exchange Server 2007 Public Folders with SharePoint Server 2007.

Unfortunately with the RTM versions of both products it's not actually possible due a couple of issues with both Exchange Server 2007 and SharePoint Server 2007.

The good news however, is that everything is back in working order if you install Exchange Server 2007 Service Pack 1 *and* your SharePoint Server also has Service Pack 1 installed.

The RTM versions of Search Server 2008 and Search Server 2008 Express are unaffected (As they already include the Service Pack 1 changes), so providing Exchange Server 2007 has Service Pack 1 installed they will both work.

None of this affects Exchange Server 2003 which works with the RTM versions SharePoint Server 2007, Search Server 2008 and Search Server 2008 Express and SharePoint Server 2007 with SP1.

Hopefully this stops people scratching their heads...

Richard Riley
Senior Technical Product Manager
Microsoft Corp.

Introducing Protocol Handler.NET
04 June 08 11:13 PM | enterprisesearch | 2 Comments   

"Protocol Handler.Net is a set of .Net wrappers for the protocol handler interfaces that enable developers to create and deploy protocol handlers for SharePoint search and Search Server. 

Developers can index data and documents from any system they can connect to.

Much of the complexity and time around the development of protocol handlers, such as COM interoperability, are reduced and hidden in the wrappers themselves letting developers just concentrate on code to connect to a content source and pull data.

Protocol Handler.Net makes it possible to develop protocol handlers in C# or VB.Net and simplifies the handling of security, metadata, streaming content, deployment and management just to name a few things. It also comes with a help system and sample project to further help developers."

Big thank you to to Chris Gomez from http://www.FastSharePoint.com for creating and sharing these tools and samples!

They are available now on CodePlex at http://www.codeplex.com/phdotnet and are part of the Search Community Toolkit 

searchcomv2small

Announcing: Release to Web of Documentum and FileNet Indexing Connectors
27 May 08 05:45 PM | enterprisesearch | 5 Comments   

Today marks the release of 2 new Microsoft Enterprise Search Indexing Connectors (formally known as Protocol Handlers) for EMC Documentum 5.3 (Service Pack 4) and IBM FileNet P8 3.5.1 or 3.5.2.

The connectors are compatible with the 32bit English Language versions of SharePoint Server 2007 (Service Pack 1), Search Server 2008 and Search Server 2008 Express, and are available as FREE downloads from;

Enterprise Search Indexing Connector 2008 for EMC Documentum

Enterprise Search Indexing Connector 2008 for IBM FileNet

Installation and configuration documentation is included in the download and the release notes are available here;

Release Notes for Indexing Connector 2008 for EMC Documentum

Release Notes for Indexing Connector 2008 for IBM FileNet

A couple of overview videos to get you up and running with the connectors quickly are available through the links below;

Overview Video - Installing and configuring the EMC Documentum Connector

Overview Video - Installing and configuring the IBM FileNet Connector

Both of these connectors are fully supported through your existing service contract with Product Support Services or through the regular pay per incident channel.

More Posts Next page »
Page view tracker