Open Source Faceted Search for MOSS 2007 and Microsoft Search Server 2008 - Part 1 of 2
This first of two guest blog entries, written by Neil Hodgkinson, a Microsoft Premier Field Engineer based in the UK, will cover the "why we did it" aspect of the open source faceted search solution for MOSS 2007 and MSS 2008 that has been released on CodePlex at http://www.codeplex.com/facetedsearch. The second guest blog entry, scheduled to be posted within a couple of weeks, will be written by Leonid Lyublinski, a Microsoft Consultant based in Ohio, USA, and will cover the "how we did it" aspect of the solution.
<Lawrence />
Background
Metadata is information that has been gathered in addition to the resources made available to a user to locate. Classically, it can be defined as information about information, but more precisely, it's structured information about resources. For companies that have large data libraries or repositories for their corporate information, this metadata is oftentimes much more than a simple hierarchical set of subject labels. Typically, the metadata has several facets -- that is, multiple attributes assigned to the resource being indexed.
Examples of faceted metadata include:
- Music catalog: songs have attributes such as artist, title, length, genre, date.
- Company white pages: directory of people with names, department, role, region.
- Recipes: cuisine, main ingredients, cooking style, holiday.
- Travel site: articles have authors, dates, places, prices.
- Regulatory documents: product and part codes, machine types, expiration dates.
- Image collection: artist, date, style, type of image, major colors, theme.
In all of these cases, there is no single way to provide navigation for everyone because users have disparate needs. One person might want to look through all the albums created by one band; others might be more interested in particular musical genres or instruments.
With traditional parametric searching techniques, users are expected to provide from one to several parameters in order to describe the object being searched for. The drawback with this approach is that by requiring the user to choose parameters, valid results may be excluded because the search criteria have been too confining.
An alternative to parametric searching is doing full text searches, which while valid in their own right, there is a certain loss of refinement when using this approach. To a full text search engine, the fact that a recipe contains a particular ingredient is irrelevant as the context of the use of the ingredient has not been preserved.
Faceted Metadata Search Solution
A good solution to these problems involves exposing the facets in dynamic taxonomies so that the user can see all of the refinement options at any time. The user can easily switch between a search based approach vs. metadata browsing, using a familiar terminology while recognizing the organization and vocabulary of the data.
Key features for metadata search include:
- Displaying aspects of the current results set in multiple categorization schemes.
- Showing only categories that have a result set, no dead-ends (links leading to empty lists).
- Displaying a count of the contents of each category; lets the user know what size of result set to expect if they choose that facet.
- Generating groupings on the fly, such as size, price or date.
- Drill down by facet, so a record enthusiast could choose genre, artist, title, year.
- Adding special facets within categories -- e.g. a Yellow Pages site would want to show cuisine and location for restaurant listings but not plumbers.
Implementing Faceted Search in MOSS 2007 and MSS 2008
The solution started in June 2007 as a field research project for one of Microsoft's customers. Leonid Lyublinski, a Microsoft Consultant, delivered the architectural design and development of a Faceted Search solution as an add-on to MOSS 2007 and MSS 2008. The initial version was released with an open source license at http://www.codeplex.com/facetedsearch and has been very well received. A second major version was released just last week and includes the following features:
- Support of all content sources, BDC, file shares, web sites, and SharePoint lists.
- Asynchronous processing based on flexible number of facets.
- Support of choice, lookup, and lookup with multiple selection fields.
- Sorting of facet by name, hits, and max.
- Configurable display name, icon per facet.
- Adjustable facet exclusion based on wildcard match.
- Client-side collapse/expand option.
- Crop with tooltip for cropped values and quick info for the Facet.
- Customizable styles consistent with SharePoint.
Here are screenshots of a couple of example implementations:
Another major version of Faceted Search is scheduled for release within the next few week, and it will encapsulate foundational changes in the design and code that will provide a balance between search accuracy and performance. Key enhancements will include:
- Multi-thread processing. 1st thread runs for up to 500 facets synchronously, while the 2nd thread is running asynchronously against up to ~30,000 facets.
- Client side refresh (not AJAX) that updates only facets web part without page refresh.
- Web part connections to pass Facet settings to the bread crumbs.
- Extended facet schema now supports:
- Facet icons -- default icon per facet name complimented by an icon per facet value .
- Friendly names for facet values.
- Exclusions -- allow exclude facet when values match pattern.
- Built-in wildcard match, especially useful for exclusions.
- Improved search syntax, added supports for sentences and quoted phrases.
This new version will also include numerous bug fixes and be complemented by updated documentation for installation, configuration, and styling. It will be first demonstrated by Leonid and me at the Office Developer Conference 2008 in San Jose, California on February 10-13 and then released on CodePlex shortly thereafter.
Neil Hodgkinson, Microsoft PFE