The thesaurus is an xml file that provides users with a means of automatically expanding or rewriting their queries to include synonyms, acronyms, etc. For example, in a chemical company, product ID 1234, oxygen, O2 and LOX could all refer to the same item.
A SharePoint Search administrator can modify the thesaurus file to substitute all these words at search query time. This document explains how to set up a thesaurus and where to find the relevant files.
Supported Thesaurus Syntax: To use the sample files provided by the product, you need to remove the comment beginning (<!--) and ending lines (-->) from the xml file.
Explanation of terms:
Diacritics are marks, such as accents that are added to letters that change their pronunciation. For example, the acute accent over and e gives you: é. 0 – ignore diacritics 1 – respect diacritics
Example:
<XML ID="Microsoft Search Thesaurus"> <thesaurus xmlns="x-schema:tsSchema.xml"> <diacritics_sensitive>0</diacritics_sensitive> <expansion> <sub>Internet Explorer</sub> <sub>IE</sub> <sub>IE5</sub> </expansion> <replacement> <pat>NT5</pat> <pat>W2K</pat> <sub>Windows 2000</sub> </replacement> </thesaurus>
The example means:
How to Customize the Thesaurus:
Notes:
See “Finding Important Files” below for a summary of where to find the key files to manage your thesaurus.
Finding Important Files:
The following are the most important files used to manage your thesaurus.
There are 50 default stop word files and 48 thesaurus sample files for the languages we support.
The search service install path can be located by examining registry key [HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Office Server\12.0\Search\Global\Gathering Manager]"DefaultApplicationsPath”
The default location of the thesaurus files (for each index, query and web frontend server) is: %programfiles%\ Microsoft Office Servers\12.0\Data\Office Server When a search application is created, a copy of the thesaurus file will also be placed under: %programfiles%\Microsoft Office Servers\12.0\Data\Office Server\Applications\[GUID]\Config
Stop word files for each language can be found as noiseLANG.txt, where LANG is the 3 letter acronym for that language. For example, US English is noiseENU.txt, and the language neutral list is noiseNEU.txt.
To find the appropriate acronym for your language(s), please look them up under: http://www.microsoft.com/globaldev/nlsweb/default.mspx.