Continuing on my last post which dealt with Chinese characters, I will cover some user scenarios on Germanic characters like German/Swedish/Norwegian with SQL FTS. Read on...


Scenario1 – German language 



Mr Customer: Hey, we have some full text search catalogs for German words, and we noticed that FREETEXTTABLE fails to find text clearly defined in the catalog!
Me: Hmm, can you give me more details on this behaviour.
Mr Customer:
Sure. Here is what I see happening,

E.g. in the German catalog, define a column to be searchable for the table and set the word breaker language to German and let’s assume that the column has a product name = ‘d.3’
Now, when I do a search for ‘d.3’ I got 0 rows returned back. What’s up with that?

Me: Ah, gotcha! Here is what is happening…

I setup a sample repro query and as expected this did not return any rows.

SELECT FT_TBL.id,FT_TBL.data,KEY_TBL.RANK
FROM dbo.German AS FT_TBL
INNER JOIN FREETEXTTABLE(dbo.German, data,'d.3') AS KEY_TBL
ON FT_TBL.id = KEY_TBL.[KEY]


The word ‘d.3’ is a noise word (and Noise words are not stored in a full-text index). A perfect example would be words like “an”, “the”, “but” etc. To prove this, I again go back to the trusted full-text DMV’s.

SELECT * FROM sys.dm_fts_parser ('d.3', 1031, 0, 0) --1031 is the LCID for GERMAN

 

keyword         

occurrence 

special_term    

display_term 

expansion_type

source_term

0x0064          

1

Noise Word      

d            

0             

d.3

0x0033          

2

Noise Word      

3            

0             

d.3

0x006E006E0033

2   

Noise Word      

nn3          

0             

d.3

 

As you can see all the words when broken down are NOISE WORDS. This means that they will not be stored in the full-text index, which is why your queries do not return any results.

In fact, if you run this query from Management Studio and in the results pane, switch to the Messages tab and you will see this warning,

Informational: The full-text search condition contained noise word(s).


Most people miss out on this indicator, since it’s a little tucked away. The above message means that “some string” in the search query was a noise word which was ignore by SQL Server when doing the search lookup.
Okay, so what do you do if you really really want to search for some keywords like “d.3” or “AT&T”?


Solution1
For the above scenario#2, FREETEXTTABLE is the incorrect function to use. FREETEXT predicate is to search for values that match the meaning, but not the exact wording. If you want to find the exact match, you should use either CONTAINS or CONTAINSTABLE.

Solution2
With SQL Server 2008, by default the system Stoplist is used for a catalog. So I created an empty stoplist and associated it with the above full-text index. This time when I run the query it did return 1 row for the string ‘d.3’

id    data  RANK
---    -----   --------
1      d.3    186

Note: Stoplists/Stopwords are available starting from SQL Server 2008 onwards. Please check this link for more information http://msdn.microsoft.com/en-us/library/ms142551.aspx.

If you are using SQL 2005, you can manually edit the noiseDEU.txt in the FTDATA folder and remove the alphabets and numbers. This will require a restart of the full-text search service + full population for the changes to show up. The disadvantage of doing this is that you full-text index gets bloated and becomes larger in size.


Scenario2 – Swedish/Norwegian characters



Mr Customer: I have a Full-text index created on a table that contains data in both Swedish and Norwegian. Certain queries produce incorrect results and the behaviour is inconsistent.
Me: I am all ears!
Mr Customer:
My Swedish users raised a point that they aren’t able to search using some special chars like Ø. Here’s my complete situation.

I have lots of rows in my table which contains the word OPPGJØR. Now when I run the following query using a prefix search, it gives me 0 rows as output. (FYI - OPPGJØR translates to SETTLEMENT in Norwegian).

select * from TBL1 where contains(*,'"OPPGJØ*"')


But if I remove full-text and do a LIKE search I see the rows containing OPPGJØR being returned.

select * from TBL1 where datacolumn like 'OPPGJØ*'


Even more strange, if I place the prefix search like this I see the rows being returned.

select * from TBL1 where contains(*,'"OPPGJØR*"')


What is the difference in search between OPPGJØ* vs.  OPPGJØR?

Me:
Hmm! Let us breakdown the problem and see why this is happening.


First off I see that the word Ø is being used in the search condition. This is an accented word. So words such as café and cafe are treated differently. I checked the ACCENT SENSITIVITY property of the catalog and it was ON.
I turned it off using the following command,

alter fulltext catalog CATALOGNAME
REBUILD WITH ACCENT_SENSITIVITY=OFF;


After this, the above query started giving me the rows for 'OPPGJØ*’. The disadvantage of doing this is that, it can start to give you more result rows. In fact it gave me results for OPPGJO since this the treated the same as OPPGJØ. In fact this customer of mine had other special Norwegian words like Æ and Å which are again accented.

Here is the breakdown using the same DMV as before.

Select  document_id, occurrence_count, display_term from sys.dm_fts_index_keywords_by_document ( DB_ID('DBNAME'), OBJECT_ID('TBL1') ) Order by Keyword
 

This gave me the following output,

1

oppgj

1

oppgj

1

oppgj

1

oppgj

1

oppgjor

1

oppgjor

1

oppgjor

1

oppgjor



Next, I tried to pass in the LCID parameter to the search condition to see if that helped.

select * from Item it
where contains(*,'"OPPGJØ*"', LANGUAGE 1053)
select * from Item it
where contains(*,'"OPPGJØ*"', LANGUAGE 1044)


1053 --> Swedish

1044 --> Norwegian

You can look up this information in sys.fulltext_languages. But this also gave me 0 rows when accent_sensitivity was turned ON. Like I mentioned earlier these word-breakers are shipped out the box with SQL Server and NaturalLanguage6.dll is the one responsible for the above. More information on this here.

I tested this same on a SQL Server 2005 SP3 instance since the word-breakers were different back then (nls400.dll).

select * from dbo.swede where contains (text, '"OPPGJØ*"')
select * from dbo.swede where contains (text, '"åä*"')


And to my surprise, I got the expected output. So this looks like an issue with the new SQL 2008 word-breaker for Swedish/Norwegian. To be more specific this appears to be an issue in SQL 2008 specific to umlaut & other Germanic alphabets. You can read about umlaut here.

I did prefix searches (word*) for the following characters and they all returned me incorrect rows with sensitivity ON.

ä
ö
ü
Ä
Ö
Ü
Ø
Æ


Now, we are really getting somewhere close to nailing this issue. Since we know the problem has been isolated to prefix search on accented characters, I did some searching and found this KB article that talks about the exact problem with “extended characters or accents in a prefix search”!

FIX: Prefix search with extended characters or accents returns incorrect result in SQL Server 2008
http://support.microsoft.com/default.aspx?scid=kb;EN-US;973090

 

Solution
The solution here was to apply SQL 2008 SP1 Cumulative Update 4 which is described in the above article. After I tested this fix, the prefix searches started returned me expected results similar to SQL 2005.

Also to get better localized results, I changed the word-breaking language for the column datacolumn in TBL1 to Norwegian.


Other tips & considerations for international languages

Windows LCID Information
http://msdn.microsoft.com/en-us/library/ms788149.aspx

International Considerations for Full-Text Search
http://msdn.microsoft.com/en-us/library/ms142507(SQL.90).aspx


I hope this was useful in demonstrating some interesting case scenarios on international characters and its usage with SQL Full-text search.



The ultimate search engine would basically understand everything in the world and it would always give you the right thing.