So I have been getting some questions around how MOSS does Word Stemming, so here is what I have been able to gather/research:
MOSS Search Stemming:
The word Stemming refers to the process of stripping off endings of words at query/index time so that different search terms will match and retrieve documents containing related words in the index. A hypothetical example of this would be the reduction of the related words “diet”, “diets”, “dieting”, “dieted”, “dietary”, “dietician”, “dieticians”, etc to a single stem “diet” which would allow a query for any one of these words to be matched against documents containing any of the other words. Computational implementations of stemming are typically not based on linguistic notions of stem and affix, but rather just frequently occurring character sequences.
The correct linguistic terminology for the process of stemming described above is morphological processing, meaning the linking of a given word form to its base form and other related word forms.
Morphological processing has two main aspects:
1. Morphological Analysis
2. Morphological Generation.
Morphological Analysis refers to the process of analyzing a given word form by assigning its correct morphosyntactic data (such as person, number, gender, tense, etc) and identifying its internal structure in terms of base form and any prefixes and suffixes which can be inflectional (and do not change the part of speech or meaning of the word: e.g. diet, diets) or derivational (which do change the part of speech or meaning of the word, e.g. diet, dietary).
Morphological Generation goes ones step further. In addition to analyzing an inflected form down to its base form, it also generates all inflected forms which are related to the same base form. In this case, the inflected verb form “loved” is reduced to the base form “love” by Morphological Analysis and then Morphological Generation generates all related inflected forms, i.e. “love”, “loves”, “loving” “loved” which are then matched against the document index and all matches are retrieved and displayed to the user (normally in search engines exact matches are given a higher weighting than the related words and are displayed at the top of the results list).
The decision as to whether to apply Morphological Analysis or both Morphological Analysis and Morphological Generation in Search depends on two major considerations: (a) whether there are any restrictions on the size of the search index, and (b) the level of morphological complexity of the language.
(a) If we do not want the size of the index to increase as a result of this morphological processing, then we will leave the index unchanged and use Morphological Analysis, and Morphological Generation at query time to expand a given search query term into a list of inflectionally- and/or derivationally-related word forms which are all then matched against the search index and return a list of results in which exact matches are listed first. If the size of the index is permitted to increase then we can use Morphological Analysis at index time to store a base form along with each inflected form in the index. At query time we will also use Morphological Analysis to attach a base form to the query term and then search on both items. All matches for both query term and its base form will be retrieved and any exact matches on the query term listed first.
(b) If the morphology of a particular language is very rich then a very large number of query terms could be generated at run time. This could have a profound effect on processing efficiency. In languages such as Arabic and Hebrew therefore which have a very large number of forms of a single word (the number can get into the thousands), it is therefore preferable to avoid morphological generation and just use Morphological Analysis at both index and query time. This increases the size of the index but has the advantage that the number of forms of a word stored in the index and generated from the query will not exceed 2 (the word itself and its associated base form).
PS: For Arabic related issues and questions, please see:
Arabic specific MOSS http://www.microsoft.com/middleeast/arabicdev/office/office2007/SharePoint.aspx , Arabic MOSS search http://www.microsoft.com/middleeast/arabicdev/office/office2007/Search.aspx , Arabic 2007 office servers http://www.microsoft.com/middleeast/arabicdev/office/office2007/office2007server.aspx
The key point of difference between wild card searching and word stemming (or morphological processing) is that the former is just string based and allows the user to find in some cases both inflectional and derivational variants of the query term. The Stemming approach deals very well with inflectional variants but we currently don’t handle derivational morphology for most languages. In languages where we do have some limited treatment of derivational morphology such as Arabic and Hebrew, this treatment is limited to high frequency terms.
Thanks for Ian Johnson from the Natural Language Group at Microsoft for providing his feedback on this.
Stay tuned for Part 2, where I will explain how MOSS does Stemming in more detail.
Hope that helps