Continuing on my last post which dealt with Chinese
characters, I will cover some user scenarios on Germanic characters like German/Swedish/Norwegian with SQL
FTS. Read on...
Mr Customer: Hey, we have some
full text search catalogs for German
words, and we noticed that FREETEXTTABLE
fails to find text clearly defined in the catalog!
Me: Hmm, can you give me more details on this
Mr Customer: Sure. Here is what I see happening,
E.g. in the German
catalog, define a column to be searchable for the table and set the word
breaker language to German and let’s assume that the column has a product name
Now, when I do a search for ‘d.3’ I got 0 rows returned back. What’s up with
Me: Ah, gotcha! Here is what is happening…I setup a sample repro query and as expected this did not return any rows.
FROM dbo.German AS FT_TBL
INNER JOIN FREETEXTTABLE(dbo.German, data,'d.3') AS KEY_TBL
ON FT_TBL.id = KEY_TBL.[KEY]
The word ‘d.3’ is a noise word (and Noise words are not
stored in a full-text index). A perfect example would be words like “an”, “the”, “but” etc. To
prove this, I again go back to the trusted full-text DMV’s.
SELECT * FROM sys.dm_fts_parser ('d.3', 1031, 0, 0) --1031 is the LCID for GERMAN
As you can see all the words when broken down are NOISE
WORDS. This means that they will not be stored in the full-text index, which is
why your queries do not return any results.
In fact, if you run this query from Management Studio and
in the results pane, switch to the Messages tab and you will see this
full-text search condition contained noise word(s).
Most people miss out on this indicator, since it’s a little tucked away.
The above message means that “some string” in the search query was a noise word
which was ignore by SQL Server when doing the search lookup.Okay,
so what do you do if you really really want to search for some keywords like “d.3”
For the above scenario#2, FREETEXTTABLE is the incorrect function to
use. FREETEXT predicate is to search for values that match the meaning,
but not the exact wording. If you want to find the exact match, you should use
either CONTAINS or CONTAINSTABLE.
With SQL Server 2008, by default the system Stoplist is used for
a catalog. So I created an empty stoplist and associated it with the above
full-text index. This time when I run the query it did return 1 row for the
id data RANK
--- ----- --------
1 d.3 186Note:
Stoplists/Stopwords are available starting from SQL Server 2008 onwards. Please
check this link for more information http://msdn.microsoft.com/en-us/library/ms142551.aspx.If you are using SQL 2005, you can manually edit the noiseDEU.txt in the
FTDATA folder and remove the alphabets and numbers. This will require a restart
of the full-text search service + full population for the changes to show up. The
disadvantage of doing this is that you full-text index gets bloated and becomes
larger in size.
Mr Customer: I have a Full-text
index created on a table that contains data in both Swedish and Norwegian.
Certain queries produce incorrect results and the behaviour is inconsistent.
Me: I am all ears!
Mr Customer: My Swedish users raised a point that they aren’t able to search using
some special chars like Ø. Here’s my
I have lots of rows in my table which contains the word OPPGJØR. Now when I run the following
query using a prefix search, it gives me 0 rows as output. (FYI - OPPGJØR translates to SETTLEMENT in Norwegian).
select * from TBL1 where contains(*,'"OPPGJØ*"')
But if I remove full-text and do a LIKE search I see the rows containing
OPPGJØR being returned.
select * from TBL1 where datacolumn like 'OPPGJØ*'
Even more strange, if I place the prefix search like this I see
the rows being returned.
select * from TBL1 where contains(*,'"OPPGJØR*"')
What is the difference in search between OPPGJØ*
vs. OPPGJØR?Me: Hmm! Let us breakdown the problem
and see why this is happening.
First off I see that the word Ø is being used in the search condition. This is an
accented word. So words such as café
and cafe are treated differently. I
checked the ACCENT SENSITIVITY property of the catalog and it was ON.I turned it off using the following command,
alter fulltext catalog CATALOGNAMEREBUILD WITH ACCENT_SENSITIVITY=OFF;
After this, the above query started giving me the rows for 'OPPGJØ*’. The
disadvantage of doing this is that, it can start to give you more result rows.
In fact it gave me results for OPPGJO since this
the treated the same as OPPGJØ. In fact
this customer of mine had other special Norwegian words like Æ and Å which are
Here is the breakdown using the same DMV as before.
Select document_id, occurrence_count, display_term from sys.dm_fts_index_keywords_by_document ( DB_ID('DBNAME'), OBJECT_ID('TBL1') ) Order by Keyword
This gave me the following output,
Next, I tried to pass in the LCID parameter to the search condition to
see if that helped.
select * from Item itwhere contains(*,'"OPPGJØ*"', LANGUAGE 1053)
select * from Item itwhere contains(*,'"OPPGJØ*"', LANGUAGE 1044)
1053 --> Swedish
1044 --> Norwegian
You can look up this information in sys.fulltext_languages.
But this also gave me 0 rows when accent_sensitivity was turned ON. Like I
mentioned earlier these word-breakers are shipped out the box with SQL Server
and NaturalLanguage6.dll is the one responsible for the above. More information
on this here.
I tested this same on a SQL Server 2005 SP3 instance since the word-breakers
were different back then (nls400.dll).
select * from dbo.swede where contains (text, '"OPPGJØ*"')
select * from dbo.swede where contains (text, '"åä*"')
And to my surprise, I got the expected output. So this looks like
an issue with the new SQL 2008 word-breaker for Swedish/Norwegian. To be more
specific this appears to be an issue in SQL 2008 specific to umlaut & other
Germanic alphabets. You can read about umlaut here.
I did prefix searches (word*) for the
following characters and they all returned me incorrect rows with sensitivity
Now, we are really getting somewhere close to nailing this issue. Since we know
the problem has been isolated to prefix search on accented characters, I did
some searching and found this KB article that talks about the exact problem
with “extended characters or accents in a prefix search”!
FIX: Prefix search
with extended characters or accents returns incorrect result in SQL Server 2008
The solution here was to apply SQL 2008 SP1 Cumulative Update 4 which
is described in the above article. After I tested this fix, the prefix searches
started returned me expected results similar to SQL 2005.Also to get better localized results, I changed the word-breaking language for
the column datacolumn in TBL1 to Norwegian.
& considerations for international languages
Windows LCID Information
International Considerations for
I hope this was useful in demonstrating some interesting case scenarios on
international characters and its usage with SQL Full-text search.
The ultimate search engine would basically understand everything in
the world and it would always give you the right thing.