Before reading this blog please check the following links, which will help you to better understand search relevance in sharepoint.
Enterprise Search Architecture: http://msdn.microsoft.com/en-us/library/ms570748.aspx
Building Search Queries http://msdn.microsoft.com/en-us/library/ms470199.aspx
SharePoint Search SQL Syntax http://msdn.microsoft.com/en-us/library/ms443580.aspx
Now, we will try to understand How does Relevance work in MOSS Search?
When a search query is executed, the query engine passes the query through a language-specific wordbreaker. If there is no wordbreaker for the query language, the neutral wordbreaker is used, which does whitespace-style wordbreaking, which means that the wordbreaking occurs where there are whitespaces in the words and phrases. After wordbreaking, the resulting words are passed through a stemmer to generate language-specific inflected forms of a given word. The use of wordbreaker and stemmer in both the crawling and query processes enhances the effectiveness of search because more relevant alternatives to a user's query phrasing are generated.
When the query engine executes a property value query, the content index is checked first to get a list of possible matches. The properties for the matching documents are loaded from the property store, and the properties in the query are checked again to ensure that there was a match. The result of the query is a list of all matching results, ordered according to their relevance to the query words. Relevance is about how closely the search results that are returned to the user match what the user wanted to find. Ideally, the results that are returned on the first page are the most relevant, so the user does not have to look through several pages of results to find the best matches for their search.
The relevance ranking engine is based on information retrieval algorithms, adapted from Stephen Robertson’s BM25F algorithm. It is specifically tuned for the unique requirements of searching enterprise content. This approach orders results by decreasing probability of relevance to the query. Query terms describe the document and the query. Statistics about the terms and the result make up the ranking: the document length, the number of occurrences of the term in the document, and the number of documents in which each term occurs at all (this is repeated for each property). This is further enhanced by tracking body text and properties, such as title or author, individually. Yet, each enhancement to the model, adding features and facts about the document or the query, will contribute to better results.
Relevance Ranking:
Search queries return integer relevance values in the column named "rank".
Below are certain rules that are used to calculate ranking:
Improve Ranking
Ranking of the documents can improved by following the below methods.
Try to avoid doing the following in the Web pages:
· Naming all pages in the site with the same page title.
· Including a specific keyword or keywords or a phrase too often in the <meta> tags or content of a Web page, also called keyword stuffing. In this scenario, the crawler might determine that these keywords or phrases are suspect, and they might be discarded when the search engine is calculating relevance.
· Using hidden text to fill a page with keywords that a search engine can recognize but that are not visible to a visitor.
· Using complex URLs, in which a page is multiple levels deep in a site (for example, http://someserver.com/subsite/pages/somepage.aspx) might not be easily crawled. You can use a combination of a URL rewriter and a sitemap file to address this in Office SharePoint Server. In addition, this unwanted behavior on the part of the crawler highlights the importance of using proper site structure (which is part of the information architecture).
· Using temporary redirects (this can be a significant issue with a SharePoint landing page). For more information, see Welcome Page Redirect.
· Using complex pages. Take care to keep the pages in a SharePoint site simple, minimizing the use of items such as inline styles or ECMAScript (JScript, JavaScript). The more elements that a page contains, the more difficult it can be for the crawlers to process the page properly.
Components involved in Search Ranking
SharePoint performs two types of ranking, dynamic ranking and static ranking. Dynamic ranking, is something that happens on the Query Servers and depends on query and term matching whereas static ranking occurs at index time. Static ranking is query independent and is computed at index time. Lets dive deeper into each of these:
Dynamic Ranking:
This looks at the content or property values for a content item such as:
Anchor Text
This evaluates the text that describes a target. E.g. <A href=http://portal/site> Company Name Enterprise Gateway Portal</A>
Property Weighting
Property weighting infers that matches on a specific property value can be more relevant than other property values or in document’s body.
string strURL = "http://<SiteName>";
SearchContext myContext;
using (SPSite site = new SPSite(strURL))
{
myContext = SearchContext.GetContext(site);
}
Ranking ranking = new ranking(myContext));
foreach (RankingParameter param in ranking.RankingParameters)
RankingParameter RP= ranking.RankingParameters[param.Name];
Console.WriteLine(RP.Name + ": " + RP.Value);
Title Extraction
Title is a very important property of ranking and are often wrong (e.g. “Slide 1”, or “Word Template Name”)MOSS 2007 has an intelligent way of overcoming this problem. What is does, is use a text extraction algorithm that generates a shadow title. How does it find a shadow title if one does not exist? It uses the headings inside your document. These are normally displayed using text formatting such as Heading 1 or Heading 2.
Please note that this only works for Office file types, another words, the Office IFilter that MOSS 2007 search uses to pick up this information.
URL Matching
Name of a website is normally a common type of query. MOSS Search matches site name to URL equivalent.
Static Ranking
This describes the ranking that is not impacted by the content or property values for a content item.
File Type Biasing
In most search scenarios, certain file types are more relevant than others. This effects the MOSS Search relevance calculation ranks.
Automatic Language Detection
Foreign language results are less relevant than results in user’s language
Click Distance from authoritative pages
NOTE: the difference between Click Distance and URL Depth. Click distance is not based on URL depth but rather on the path the user takes through pages to get to information.
Authoritative Pages (Configured in SharePoint Central Administration):
URL Depth
Items with shorter urls are more relevant than items placed in longer URLs; E.g. http://msw/ vs http://portal/divisionalsite/ProjectSite1/MeetingSite/ .Short URLS are like prime real estate and organisations tend to allocate them to the most important content.
SharePoint object model supports modifying relevance parameters using the Ranking class. The Ranking class has got RankingParameters property, which represents the collection of all ranking parameters for the SSP. The parameter values can be edited but new parameters cannot be added, deleted or renamed.
Analysis of the Parameters that affect Search Relevance:
Crawled properties are discovered by search service indexer while crawling content. Later, administrators map these crawled properties to managed properties to use them in search. Both Crawled and Managed property have their own configurable properties, which impact the search relevance ranking. As mentioned earlier, when a query is executed with managed property it first goes to content index to get a list of possible matches to rebuild the query, which will be fired to get the search results. However, if the managed property is not scoped then it is directed towards the SQL and not to the content index, which will impact the performance of the search query. Queries that bypass SQL and directly hit the content index are much faster than the former approach. We can also modify the property Weight or the LengthNormalization for relevance tuning. The list of parameters that will affect the way relevance is calculated are listed below:
Types of parameters to customize
Parameter
Description
k1
Saturation constant for term frequency. This relates to how many times the query term was returned.
Kqir
Saturation constant for click distance.
Wqir
Weight of click distance for calculating relevance.
Kud
Saturation constant for URL depth.
Wud
Weight of URL depth for calculating relevance.
Languageprior
Weight for ranking applied to content in a language that does not match the language of the user.
Filetypepriorhtml
Weight of HTML content type for calculating relevance.
Filetypepriordoc
Weight of Microsoft Office Word content type for calculating relevance.
Filetypepriorppt
Weight of Microsoft Office PowerPoint content type for calculating relevance.
Filetypepriorxls
Weight of Microsoft Office Excel content type for calculating relevance.
Filetypepriorxml
Weight of XML content type for calculating relevance.
Filetypepriortxt
Weight of plain text content type for calculating relevance.
Filetypepriorlistitems
Weight of list item content type for calculating relevance.
Filetypepriormessage
Weight of Microsoft Office Outlook e-mail message content type for calculating relevance.
Using special Search predicates to influence relevance:
The Contains predicate is used for exact matches whereas the FreeText is used for finding items containing combinations of search words. Queries that contain only the CONTAINS predicate return results with unexpected rank ordering.
We can indicate a column or a column group on which FREETEXT will test. However, to get better relevance, if we do not want to specify a particular column then it is recommended that we use DefaultProperties as the column. When we specify DefaultProperties, all indexed text properties that have non-zero weight are searched. There are non-zero weighted properties non retrievable and that cannot be modified, taken into consideration when ranking an element.If we combine a Boolean restriction clause with a FREETEXT clause by using the AND operator in a query, we reduce the number of possible matches to a query without affecting the rank values that are calculated based on the FREETEXT clause of the query. If we do not specify a column reference, only the Contents column, which contains the body of the item, is searched which in turn might filter out certain results that the user might have been interested otherwise.
Each FREETEXT clause represents a separate query and ranking is done separately. Hence, it is not recommended to use more than one FREETEXT clause in a single query.
e.g. ...WHERE FREETEXT(DefaultProperties, 'hello') AND FREETEXT(DefaultProperties, 'world)...
To specify multiple query terms in a FREETEXT clause you can add all the terms in a single string
e.g. ...WHERE FREETEXT(DefaultProperties, 'hello world’)...
Like keyword is used to perform pattern matching comparison on the specified column. This keyword only works with single valued fields and not multi-valued. Also, the column should be available in the property store. The wildcards that can be used in Like keyword are “%”, ”_”, “[]”, “[^]”. Also, we can use multiple wildcards in a match string.
Near specifies that two content search terms must be located relatively close to one another to be recognized as matching by the CONTAINS predicate. When the words in the query joined by NEAR are found within approximately 50 words of one another in the column that is being searched, the NEAR term returns a match. The closer together the two words are, the higher the calculated rank for the NEAR term. The farther apart the two words are, the lower the rank. The number of words is approximate; it can be less than 50. If the match words specified with the NEAR term are both found in the column being searched, but are farther apart than 50, the result is still returned but has a rank of 0.
FormsOf is used in CONTAINS keyword which performs matches by using other linguistic forms of the word. There are two types of FormsOF word generation. INFLECTIONAL chooses alternative inflection forms for the match words. If the word is a verb, alternative tenses are used. If the word is a noun, the singular, plural, and possessive forms are used to detect matches. THESAURUS chooses words that have the same meaning, taken from a thesaurus.
IsAbout matches columns against a group of one or more search terms. IsAbout term can have one or more components. The columns specified in the CONTAINS predicate are tested against each component. The document is included with the results if at least one of the components matches. Commas separate multiple components.
If you want to retrieve the ranking values as computed by the server at run time, you make use of the following property : rankdetail. Hence, modify the query to : "select rank, rankdetail, title, description, size, path, hithighlightedSummary" you will receive the following xml formatted text:<?xml version="1.0"?><RankDetail xmlns="x-schema:RankDetailSchema.xml"><QIR WID="4" URLDepth="4" ClickDistance="5.333332" Language="1034" FileType="0" External="0" IsDemoted="No"></QIR><Rank DocId="4" Score="13165" NodeType="Prob" OriginalScore="23.714941" QueryIndependentRank="21.913649"><Term ID="0" Score="1.801292" TFW="95.807739" n="87" N="527" RW="1.801292" TFK="1.000000" FFK="1.000000"><Prop Pid="1" W="100.000000" TF="1.000000" TFW="100.000000" DL="971.000000" AVDL="905.000000" DL_AVDL="1.072928" B="0.600000" DLNORM="1.043757" TFN="95.807747"></Prop></Term></Rank></RankDetail>