Stemming Functionality in Full Text Search using CONTAINS

Stemming Functionality in Full Text Search using CONTAINS

Rate This
  • Comments 8

Many a times you may have encountered issues related to Full Text Search functionality in SQL Server wherein you won’t be able to search words with different verb forms.  These different verb forms are also called as Stemmers. For a given language, a stemmer generates inflectional forms of a particular word based on the rules of that language. Stemmers are language specific. When you use Full Text Search, most of the times you would use CONTAINS predicate to search for the words. Contains is a predicate used to search columns containing character-based data types for precise or fuzzy (less precise) matches to single words and phrases, the proximity of words within a certain distance of one another, or weighted matches. Now, the problem is whenever you try to do a search for a particular word SQL should have resolved the different forms of that word but practically which is not the case. You would think that there is a problem in the Full Text Engine and try to rebuild the Full Text catalog but still encounter the same issue. However, everything works well when you try using LIKE operator. So what is the problem? You would think as if this is an issue with your SQL Engine or Full Text catalog or the language that you are using for the full text indices but still you have no luck!

I also encountered similar kind of issues, did some research work and got explanations related to this problem.

The query I was using was like the following:

SELECT * FROM TableName WHERE CONTAINS (ColumnName, ' "word" ')
GO

Scenario:

In the table I had word "CHAINS" and whereas I was searching for word "CHAIN"! So, whenever I was searching for CHAIN it is not giving me expected result. I recreated the problem on sample database. I created a FT Index on Production.Product table on Name column.

Ran following command:

SELECT Name FROM Production.Product WHERE CONTAINS(Name, ' "Chain*" ');
GO

Results:

Name

--------------------------------------------------

Chains

(1 row(s) affected)

--This result was expected. Remember here I am using a "simple_term" along with "*" to define a phrase to for search condition.

NOTE: We are getting the expected result because "*" specifies that all columns in the table registered for full-text searching should be used to search for the given contains search condition and <simple_term> specifies a match for an exact word or a phrase.

Then I ran following command:

SELECT Name FROM Production.Product WHERE CONTAINS(Name, ' "Chain" ') -------- Removing *
GO

Results:

Name

--------------------------------------------------

(0 row(s) affected)

Why this result?? Are you really expecting an output after running the above statement??

So, the point is why FT engine is not using Stemming at all if we don't use "*" with simple_term in CONTAINS predicate. So, the explanation of this behaviour is that if you want to use CONTAINS predicate and want to make use of language dependent stemmer, you should use a parameter called INFLECTIONAL in the CONTAINS predicate. INFLECTIONAL would cause Full Text Engine to use language specific stemmer for the specified simple terms. Stemmer behaviour is defined based on stemming rules of each specific language. The neutral language does not have an associated stemmer. The column language of the column(s) being queried is used to refer to the desired stemmer. If language_term is specified, the stemmer corresponding to that language is used.

Now, if I modify my query like the following:

SELECT Name FROM Production.Product WHERE CONTAINS(Name, 'FORMSOF (INFLECTIONAL, Chain)');
GO

Results:

Name
--------------------------------------------------
Chains

(1 row(s) affected)

This is irrespective of whether I use asterisk "*" or simple_term. This is the result that I want!! But this result will not distinguish the difference between the use of INFLECTIONAL parameter and asterisk "*" with simple_term. So, you would obviously think that why should I use INFLECTIONAL when I am getting the same result using "*"? To verify my above theory, I tried one more test. I added 3 more rows in the table where Name= Ran\Run\Running. The intention behind this test is to find whether INFLECTIONAL is able to search the verb forms of the word as well or not and whether we get the same results if we use asterisk "*" with simple_term.

SELECT Name FROM Production.Product WHERE CONTAINS(Name, '"Run*"');
GO

Results:

Name

--------------------------------------------------

Run

Running

(2 row(s) affected)

--This is an expected result because we don't expect Full Text to give "Ran" as one of the outputs.

Then used following:

SELECT Name FROM Production.Product WHERE CONTAINS(Name, 'FORMSOF (INFLECTIONAL, Run)');
GO

Name

--------------------------------------------------

Ran

Run

Running

(3 row(s) affected)

WOW!! I got the verb forms for the word Run!!

So, here comes the difference between "*" with simple_term and INFLECTIONAL. When we use INFLECTIONAL we are we are actually searching the Verb and Noun forms of the word we are searching and hence utilizing the correct functionality of the Stemmer. Whereas when we use a given <simple_term> within a "*" will not match both nouns and verbs.

So, here INFLECTIONAL comes to your rescue :)

Another piece of information that you may be interested in would be related to usage of FREETEXT predicate.

FREETEXT is a predicate used to search columns containing character-based data types for values that match the meaning and not the exact wording of the words in the search condition. When FREETEXT is used, the full-text query engine internally performs the following actions on the freetext_string, assigns each term a weight, and then finds the matches"

1. Separates the string into individual words based on word boundaries (word-breaking).

2. Generates inflectional forms of the words (stemming).

3. Identifies a list of expansions or replacements for the terms based on matches in the thesaurus.

So, FREETEXT predicate will give you the same stemming functionality as that of INFLECTIONAL in CONTAINS predicate. If you use the above example using FREETEXT you will see same output as that of CONTAINS:

SELECT Name FROM Production.Product WHERE FREETEXT (Name, 'Run');
GO

Results:

Name

--------------------------------------------------

Ran

Run

Running

(3 row(s) affected)

Now, you may argue why one should use INFLECTIONAL in CONTAINS predicate as against FREETEXT?? So, the answer to this is: Full-text queries using FREETEXT are less precise than those full-text queries using CONTAINS. The SQL Server full-text search engine identifies important words and phrases. No special meaning is given to any of the reserved keywords or wildcard characters that typically have meaning when specified in the <contains_search_condition> parameter of the CONTAINS predicate.

Also, remember that stemming is fully dependent on language you have set on the column on which you have created the full text index [Language for Word Breaker]. This means if you have a Korean as a Language for Word Breaker for a column, the stemming would depend on Korean language and not on any other language.

For example I changed the Language for Word Breaker to Korean from English in the above example and the result of the query is:

SELECT Name FROM Production.Product WHERE CONTAINS(Name, 'FORMSOF (INFLECTIONAL, Run)');
GO

Result:

Name

--------------------------------------------------

Run

(1 row(s) affected)

Important point to notice here is to wait until the Full Population of the FT Index is completed after you make changes to the properties index. Else, you might see the same result that you may be expecting.

PS: When you use Neutral Language word-breaker as opposed to English (UK/US), then it will give you a different result set. You will see different items showing up in the FT output as Neutral does not understand Grammar – so there is no cause for alarm, its expected behavior.

Another thing that could affect the output is whether the catalog is accent-sensitive or insensitive. Class example would be: cafe vs. café

select fulltextcatalogproperty('Catalog_Name','AccentSensitivity')

I hope this would give some relief to you if you were stuck on a situation similar to this.

Sumit Sarabai
SE, Microsoft Sql Server

Reviewed by

Mukesh Nanda
TL,Microsoft Sql Server

&

Sudarshan Narshiman
TL,Microsoft Sql server

 

Leave a Comment
  • Please add 6 and 5 and type the answer here:
  • Post
  • Hi guys,

    Great post. I've got a question.

    I have a table column in which I am storing keywords so that I can get better FTE search results.

    However, some of my keywords contain multiple words i.e. soft drink, orange juice etc.

    Is there a way in which, in the mark up of my keyword I could hint to SQLServer FTE to treat these multiple word keywords as a single unit rather than multiple single words? i.e. like use "orange juice" or 'orange juice' or orange_juice or orange+ juice etc?

    i.e. I would like the document with a keyword of "Orange Juice" to come up only when I search for "orange juice" rather than "orange".

    In other words, my "orange juice" document should not appear when I search for "orange".

    Does this make sense?

    Thanks heaps,

    Devashi

  • Hi Devashi,

    Good Question!

    Full-text queries perform linguistic searches against the data, by operating on words and phrases based on rules of a particular language. Also, this linguistic analysis is done by the Word Breakers and Stemmers which again depends on different languages. If you search for "Orange" FT Engine will actually search for all the rows which contain Orange throughout the coulmn. If you have "Orange Juice" in one row it will be interpretated as "Orange" and "Juice" by the FT enginer due to Work Breaker functionality. Hence, whenever you try to search for word Orange it will actually search all the rows which have word Orange irrespective of the fact whether Orange comes first in the phrase or in the middle or at the last, which is an ideal behavior I suppose :)

    In contrast, LIKE predicate will only works on character patterns and does not do any linguistic analysis. As a result if you search for word Orange in LIKE predicate you will only get one row which have only "Orange" as the value.

    However, I am doing some more test on your question and will keep you posted.

  • Hi,

    I have a question. Say I'm searching for the term 'John' in my table named Persons, which has columns FirstName, LastName, address etc. So if I use:

    SELECT * FROM PERSONS WHERE CONTAINS(*, '"John*"')

    it returns all the rows where John appears, even returns people whose last name is 'JOHNSON' and address contains say Johns St. etc.

    However, if I use INFLECTIONAL formsOf with contains and search for 'John', it does not return rows that contain JOHNSON. I understand that Johnson is not a verb of John, but how do I get the same results as I get when I search for 'John*'?

    Thanks

  • Hi,

    I have another question. When I use the Inflectional formof with Contains and search for 'John' it does not return results which have 'Johns Hopkins' or 'Johns'. But if I use the same inflectional formOf with 'john' (lowercase), it returns  the rows which have 'Johns' or 'Johns Hopkins'.

    Am confused!

    Thanks

  • I have a problem with the word 'restore' and 'restoration'.  

    Am I expecting too much for the word 'restore' to return records containing 'restoration'?

    Thanks.

  • Hi Tim,

    Thanks for your question.

    Restoration is not a verb\noun form of Restore.

    When we use INFLECTIONAL we are we are actually searching the Verb and Noun forms of the word. In your case you are searching for Restore which will give you "Restore", "Restoring" and "Restored". But if you want to search for Restoration using Restore, I am afraid you can't search it.  

    Thanks.

    Sumit

  • Hi Swati,

    As you understand FORMSOF Inflectional will search for the verb form of the word, it will not show you Johnson when you use "John" as your search criteria. Now, when you use the statement like the one above <SELECT * FROM PERSONS WHERE CONTAINS(*, '"John*"') >, you are actually using "prefix_term" in your search condition. In this case, when you enclose a prefix term in double quotation marks ("") and add an asterisk (*) before the ending quotation mark, all text starting with the simple term specified before the asterisk is matched. The asterisk matches zero, one, or more characters. If the text and asterisk are not delimited by double quotation marks, so the predicate reads CONTAINS (*, 'John*'), full-text search considers the asterisk as a character and searches for exact matches to John*.

    Regarding your second question: I am looking after this issue as it was reproduced at my end as well. I am not sure if it is by behaviour or something else. I will post you with some updates sometime later. Do check this blog for updates.

  • I need to fetch data based on pattern search, not search condition is firts 3 charaters is 486 and fourth one is 8 or 3 and fifth digit between 1 to 9. for search exmaple string are 48685, 48629....

    I done this using like operator -- select col1 from table1 where col1 like '486[83][1-9]'. This same query I want to write using free text contains operator.

    Please post if you have any idea.

Page 1 of 1 (8 items)