November, 2007

  • Jie Li's GeekWorld

    Improve User Experience in Enterprise Search Step By Step - Part IV - Relevancy Tuning by WordBreaker


    We have been talking about XSL/XML for so long a time. Now we want to give relevancy a shot.

    Hey, relevancy?

    Yes, relevancy.

    Relevancy is the most important thing for a search engine, more important than the page numbers it crawled, more important than the result update interval in most of the case - because users always look at the top results.

    Relevancy is a very complex problem. It is affected by many factors, it is quite different in different languages. In this article, we will take English and Chinese for example. (I know a little German and Japanese as well but ...)

    In this series we will go through wordbreaker, weighting, and other useful stuff. Because I'm now in a Karaoke party, I cannot describe everything in detail. I assume you already know how to deal with Bestbets and Did you mean feature. If you want to have some other information, please read Luca Bandinelli's multilingual whitepaper.


    Wordbreaker is the first issue if you hate a search engine. although word breaking in Latin languages(English, Spanish, French, German, Dutch...) is much easier than that in other languages(Chinese, Japanese, Korean, Arabic...), it's still a boring thing to deal with.

    MOSS/MSS comes with many wordbreakers, but sometimes you may not be very satisfy with them. Is there any 3rd party word breaker I can use? Yes, some Microsoft partners have quite a long history in delivering word breaking technology to production use. For example, Hylanda is a leading Chinese word breaking technology company. They did a pretty good Chinese wordbreaker for SQL Server 2000. Since we didn't change the interface of wordbreaker even in MOSS/MSS, it can be used directly here.

    To change a wordbreaker, you need to do the following things.

    a. Register the wordbreaker.

    This depends on the installation manual of the wordbreaker:). But generally speaking, it should be something like this:

    regsvr32 YourWordBreaker.dll

    b. Get GUID string of your new wordbreaker. We will need it in the next step.

    Search for your wordbreaker dll's name in registry, and you will find something located in CLSID branch. For example:


    Copy this guid string for further usage.

    c. Navigate to the branch of your language. Replace the values with your wordbreaker.

    HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Office Server\12.0\Search\Setup\ContentIndexCommon\LanguageResources\Default\YourLanguage

    WBDLLPathOverride is the path of your wordbreaker dll. In my case, my wordbreaker is located at C:\Hylanda\HlChsBrKr.dll

    WBreakerClass is the GUID string you just got.


    Don't forget to restart your search service by net stop osearch, net start osearch. Then do a FULL crawl for all the content source. If the wordbreaker did the crawl job mismatch with the new installed wordbreaker, it will result a bad search result because of query time word breaking.

    Sorry I can't post the images of search results, they are still in testing process. But I can tell you the improvement is HUGE.

  • Jie Li's GeekWorld

    Federated Location Definition Files Collection for MSS


    I just put some of my work here:

    You can download and import these FLDs into your Microsoft Search Server installation. Some of them are not polished, and others maybe used to show special effects. Later I will explain how they work.

    Locations are the same with my opensearch post:)

    WikiPedia (en-us), WikiPedia (zh), Live Search, Live News

    Google Blog, Google Blog (zh-cn), Google News, Google News (zh-CN)

    Flickr, LinkedIn, MSDN, MSDN China, TechNet, TechNet China

    Technorati, Technorati (zh), Wired, Yahoo, Yahoo Images, Yahoo News

    Britannica,, Youtube, Amazon

  • Jie Li's GeekWorld

    Improve User Experience in Enterprise Search Step By Step - Part III


    In the last two part we discussed XSLT change in core search result webpart of SharePoint/Microsoft Search Server. Now some of us are facing another challenge: How to add your custom properties to the result?

    Imagine the following scenario: In a bicycle factory, you have a product database table, which contains several columns of different items, for example, the model number of the bicycles, the picture link of them, the detail descriptions, the category...Everything. Now you want to make a search engine, when users search on this webpage, they will see the picture and the description of the bikes, not only boring model numbers...


    Scenario III. Add custom properties to search results

    Let's go over property mapping feature of MOSS/MSS. When MOSS/MSS do a crawl, it will automatically pick up every metadata by ifilters. These metadata can be anything in your documents, for example, Author field in word documents, Readers field in Lotus Notes databases, every columns in SQL/Oracle databases(through BDC)...

    Yes, they are all metadata and they will be stored in the SQL database which MOSS/MSS uses. But, this is not enough to search through them. There're too many metadata, and they have too many names. When we create a Lotus Notes view in designer, we may use "DOC_Title" for the field which points to the title for our documents. But in HTML pages, title should be the strings inside <TITLE></TITLE> tag. When a end user want to search for a title, he simply doesn't care what are the difference between the two systems, he only want to use something like "title:HelloWorld" to query all things. So it's our job to manually map different types of metadata which have the same meaning to a single category. In MOSS/MSS, this category is called managed properties. Those metadata which are not mapped, are called unmanaged properties.

    In the settings page of managed properties, you can create custom properties, and you can also add your newly crawled metadata(property) to existing managed properties. So you can map "DOC_Title" field of Notes databases to "Title" property to make sure they are correctly displayed in search results page.

    Now we have another problem, yes, title is shown on result page by default, but how can I get other properties to be shown on it?

    We need to modify core search result webpart again. This time, we have to modify XML in "Selected Columns". I can't remember the exactly name of this one for English, correct me if I'm wrong.

    <root xmlns:xsi="">   
        <Column Name="WorkId"/>
        <Column Name="Rank"/>           
    <Column Name="Title"/>  
        <Column Name="Author"/>
        <Column Name="Size"/>
        <Column Name="Path"/>
        <Column Name="Description"/>
        <Column Name="Write"/>       
        <Column Name="SiteName"/>
        <Column Name="CollapsingStatus"/> 
        <Column Name="HitHighlightedSummary"/>
        <Column Name="HitHighlightedProperties"/>
        <Column Name="ContentClass"/> 
        <Column Name="IsDocument"/> 
        <Column Name="PictureThumbnailURL"/>

    In this scenario, I created some custom properties like "xmltitle", "xmlcate", "xmlurl" and "xmlcont". We store lot of information in XML files. Each items in XML is a type of metadata, so I mapped them to the above four properties. Now we need to add the following before </Columns> .

    <Column Name="xmltitle"/> 
    <Column Name="xmlcate"/> 
    <Column Name="xmlurl"/>
    <Column Name="xmlcont"/>

    Now, our search will return these four properties in XML. But they are not displayed by default, we need to modify XSLT to show them. Open XSL Editor, let's insert something after the template of select="write".

    <xsl:call-template name="DisplayString">
          <xsl:with-param name="str" select="xmltitle" />

    Do the same thing for each of the properties you want to show, and add some effect on them, bold size font, string processing... we already talked about that in Part I and II. You are smart enough to show a picture in your result by modifying XSLT, don't you?

    Original XML file:


    Search result displayed:


    You may already noticed there's a "RANK" property in our XML selected columns. What's that?

    The answer is simple, the rank of relevancy. By default, results are sorted by relevancy. These rank values determine how they would be sorted.

    Here's a result table returned from query web service. You can see the higher the rank, the higher the position the item is placed. In another article we will discuss how this RANK value is calculated.


    Btw, although I wrote these articles, I still hope that someone can develop a much simpler way to modify these XSLT for search result. May be a GUI based program? It's still boring to deal with XSL/XML if you are not a programmer.

    That's all for today:)

  • Jie Li's GeekWorld

    Improve User Experience in Enterprise Search Step By Step - Part II


    In Part I we discussed on how to change the style of the results. But this is not enough, what about to show other things in result items? (This is a request from

    Scenario II. Show me the path!

    When you search for something, a XML result table will be returned and transformed by XSLT to show the final result. It contains different properties so you can use them,for example:

       <title>Hello World!</title>

    So we can get url value from this table, and show it.

    How to show the path file://superman/opal ? If I only want the path of the url not the file, "myresume.doc" should be cut off. But XSL is not a good way to process string, luckily some people already showed us some examples and we can use them.

    In this example, we used a template in

    Insert the following template into core result webpart XSL.

    <xsl:template name="substring-before-last">
        <xsl:param name="text"/>
        <xsl:param name="chars"/>
          <xsl:when test="string-length($text) = 0"/>
          <xsl:when test="string-length($chars) = 0">
        <xsl:value-of select="$text"/>
          <xsl:when test="contains($text, $chars)">
        <xsl:call-template name="substring-before-last-aux">
          <xsl:with-param name="text" select="$text"/>
          <xsl:with-param name="chars" select="$chars"/>
            <xsl:value-of select="$text"/>

    <xsl:template name="substring-before-last-aux">
        <xsl:param name="text"/>
        <xsl:param name="chars"/>
          <xsl:when test="string-length($text) = 0"/>
          <xsl:when test="contains($text, $chars)">
        <xsl:variable name="after">
          <xsl:call-template name="substring-before-last-aux">
            <xsl:with-param name="text" select="substring-after($text, $chars)"/>
            <xsl:with-param name="chars" select="$chars"/>
        <xsl:value-of select="substring-before($text, $chars)"/>
        <xsl:if test="string-length($after) &gt; 0">
          <xsl:value-of select="$chars"/>
          <xsl:copy-of select="$after"/>

    This will add a "substring-before-last" template to your XSL. Then search for <xsl:with-param name="str" select="write" />, insert the following string after </xsl:call-template>.

    <xsl:variable name="urlpath">
         <xsl:call-template name="substring-before-last">
             <xsl:with-param name="text" select="url" />
             <xsl:with-param name="chars" select="'/'" />
    <a href="{$urlpath}"  title="{$urlpath}"><xsl:value-of select="$urlpath"/></a>

    This means, use substring-before-last to cut strings like file://superman/opal/myresume.doc to file://superman/opal .

    This part of XSLT now looks like this:


    Apply the setting, you will see the following result:


    You can try different things in XSLT, it's very interesting.

  • Jie Li's GeekWorld

    Improve User Experience in Enterprise Search Step By Step - Part I


    Here comes the first part of my "improve user experience" series. In this part I'll cover XSLT tricks for Core result webpart in SharePoint family(WSS, MOSS, MSS, MSSE). If you are a XSLT guru, just skip this article, I'm sure many ppl can do such things better than me. But if you are not very familiar with it, follow me and you will have a quick start.

    When you search for something on the Internet, most of the time you only want a simple description in search results. But when you search for something in intranet, you may want to have it customized against different content source. For example, some departments heavily used custom metadata to tag their documents, these metadata should be displayed as a part of search result, what should you do? You don't like the default appearance of SharePoint search result, how can you change it?

    The answer is XSLT in core search result webpart. Just like its name, XSLT do a transform job for XML. Using XSLT, you can pick out different part of XML and combine it to another appearance.  Although you can deal with XSLT with any text editor, or even the dialogue inside webpart settings, I suggest that you should use an editor which can highlight the syntax and check if the file is well formatted.  Visual Studio is a good editor, but if you don't want the huge monster, you may want to try emeditor(, and install xslt syntax file. The free version is enough, and it's pretty fast compared to other editors.

    Scenario 1. Change hit highlight appearance in search result by XSL


    This is the default appearance of search result. Because "SharePoint" is the keyword, it is highlighted by BOLD font. But maybe you want change the style, let it be red, blue, or even backcolor blue with forecolor white! It's easy to do such customization, but you need to know where it is...

    First, open the settings panel of search core result webpart. click XSL editor button, and you will see this dialogue.


    Copy the content of XSL to your favorite editor. Do a search for <xsl:template match="c0">, you will find it at about line 199. This part of XSL looks like this:

    <xsl:template match="c0">
        <b><xsl:value-of select="."/></b>

    <b></b> means this part of string should be bold. So we want it to be italic. Add a <i></i> to XSL. Now it changes to:

    <xsl:template match="c0">
        <b><i><xsl:value-of select="."/></i></b>

    Save it back to the webpart, apply the settings, you will see the result:


    Piece of cake. Yeah, this is easy, but what means "c0"? And what're the other parameters, "c1","c2"...?

    Do a search for "SharePoint resources", and you will find although "SharePoint" is italic, "resources" is not.


    That's why there're many c0,c1 c2...Every parameter represents a word. If you change c1, this will affect the second word. If you are searching for languages like Chinese, Korean and Japanese, which do not use space to separate words, the sentence will be broken into several parts by a word breaker. And in this case, c0 represents the first part, c1 represents the second part. So we can modify them to show different results. You can use <u></u> for underline style too.

    Well, it's boring we only have black and white here. So now we want to change it to other color.

    <xsl:template match="c0">
        <b style="background-color:#ffff00;color:#ff0033"><i><xsl:value-of select="."/></i></b>

    Color:#ff0033 means the font color should be red. background-color:#ffff00 means the background of this part of sentence should be yellow. So it will show the following style.


  • Jie Li's GeekWorld

    When, why and how to deal with Custom Security Trimmer in Enterprise Search? - Part II


    In part II we will go through the code a little deeper.

    Check permission against different systems

    Please open this page,

    Look at this part of the code.

    for (int x = 0; x < crawlURLs.Count; x++)
    To fully implement the security trimmer, add code to perform the security check and determine if strUser can access crawlURLs[x]. 
    If strUser can access crawlURL[x], then:
    retArray[x] = true;
    //If not:
    retArray[x] = false;

    Quite simple explanation. But how can you ?

    1. Use WindowsIdentity.GetCurrent().Name to get current username, or if you are using FBA, that is HttpContext.Current.User.Identity.Name.

    2. Then use this username to check with the target system, if he has the permission to crawlURL[x], then return a True.

    Different system has different security checking method. Here' re some suggested ways to check security:

    Content source Method
    Web Sites, with SQL Server in backend Directly use System.Data.Sqlclient to deal with the database and get the permission
    Web Sites, with Oracle in backend System.Data.OracleClient. You must install Oracle Client first.
    Web Sites, with DB2 in backend DB2 .Net Data Provider
    Web Sites, with MySQL in backend MySQLDriverCS
    or MySQL connector/NET
    File Share File.GetAccessControl
    SharePoint already has security trimming function built-in for file shares. It would be very uncommon that you need to deal with CST in this scenario.
    But be aware, if you want extra security trimming with file shares, the built-in security trimmer(the one applied in query time we talked in part I) will applied first. There's no way to get it replaced. And if you are using FBA, which means your current identity is changed from windows user to a httpcontext user, you will get nothing in your search result if the file share is not a public one.
    Lotus Notes Lotus Domino Objects, a COM object to be used in other languages

    If you want to have a better performance when a CST is applied...

    I suggest that you cache the permission settings to your own box and check it in CST. Remote calls may have huge impact on the performance, especially Lotus Notes. Meanwhile, check security with remote machine also means an impact to the target system. If that system is critical, this will affect customer's business.

    The cache thing can be done with some small tools, of course you can write a small application by using Lotus Domino Objects and grab all the notes ACL back to a SQL table, that depends on your own opinion.

    Another important thing is to set a CheckLimit in your CST. If CheckLimit is reached, CST will report something back to user, or do something you defined, and stop the check. This message can be something like "too many results pls refine your keywords", "Please try keyword1+keyword2+keyword3"....That will help.

    Register a custom security trimmer

    The trimmer must be compiled with strong name. You must first install it to the assembly by the following command(There're some errors in SDK):

    C:\Program Files\Microsoft Visual Studio 8\SDK\v2.0\Bin>gacutil.exe /i c:\Trimmer\CustomSecurityTrimmerSample.dll /f

    C:\Trimmer\CustomSecurityTrimmerSample.dll is my trimmer's path, change it with your own one. 

    A very important step: Create an "include" crawl rule with the URL you want to bind this CST with. If you don't create it, you cannot deploy the trimmer. In this sample, the path is http://localhost:8100/*.

    Then you should deploy it with stsadm:

    C:\Program Files\Common Files\Microsoft Shared\web server extensions\12\BIN>stsadm -o registersecuritytrimmer -ssp SharedServices1 -id 2 -typeName "CustomSecurityTrimmerSample.CustomSecurityTrimmer, CustomSecurityTrimmerSample, Version=, Culture=neutral, PublicKeyToken=b6c7fa67516b1230" -rulepath http://localhost:8100/*

    PublicKeyToken is the token you can see in windows\assembly directory. rulepath is the crawl rule path you just created.

    And don't forget iisrset. Then, if any search result matches the crawl rule, CST will be launched to check the permission.

  • Jie Li's GeekWorld

    Microsoft Search Server 2008 Express RC go live!


    If you are interesting in Enterprise Search, now it is your chance to give a test drive on your own box!

    After several months' work, we launched Microsoft Search Server Express to the public:). Download it here:

    One of the most important enhancement over MOSSfS is federated search function. That's why I talked about OpenSearch in my former post Some OpenSearch sites for reference. You can use those RSS feeds to integrate extranet search engines, and other funny application. I'm the guy who created the most of the federation connector on the official site, so if you have any questions, you can ask on our search server forum, I'll give you the answer asap:)

    Here's a sample I created, for integrate different image/video site such as Flickr Tag Search, Yahoo Image Search and Youtube.


    Another sample, search for lotus notes, and show experts for lotus notes in LinkedIn at the sametime.


  • Jie Li's GeekWorld

    When, why and how to deal with Custom Security Trimmer in Enterprise Search? - Part I


    First of all, security trimming is very important in Enterprise Search. Users who have no rights to the documents should not see descriptions in their search results. They should not be aware of those items at all.

    When building Enterprise Search solutions using Microsoft SharePoint Server 2007(MOSS), you can find that MOSS support file share, sharepoint, lotus notes security trimming out of the box. This means, the protocol handler picks up ACL in index time, and security trimming will be applied at query time. Such query behavior is like a SQL sentence:

    SELECT * from scope() where freetext("keyword") and YourCurrentUserRight="True"  (This is not the real sql sentence, just to give you an idea)

    So query performance will not be impacted.

    But, what about other stuff like website, database, or a custom content source?

    Custom Security Trimmer(CST) is used in MOSS, to support security trimming of such things. The behavior of CST is quite different from build-in security trimmer. It is run at query time, but because the YourCurrentUserRight value is not there, CST will access target system to retrieve this value after the search results come out. It will check one by one, for example, there're 4,000 items in the result about "jokes", but you can only access 100 items. So the process changed to:

    1. Do a search for "jokes", this is like SELECT * from scope() where freetext("keyword"), and no security trimming applied. 4,000 items returned in the result. This is not displayed to the user.

    2. Because CST is registered with "crawl rules"(this is one of the worst name examples I had ever seen in Microsoft, wth, a rule applied at query time is called CRAWL RULE?), if the path of the item meet the rule, CST will be launched to check if current user has the permission to read this item. If he has, CST will report a "True". Note, multiple CST instances will be launched at the same time to check different items, and it seems you cannot control this number. I think it's around 4-5.

    3. After the "True" number of items in one page is meet, for example 10 items CST reported True after checked against about 200 items, the first page of result will be displayed.

    Let's do some basic calculation job. What will happen if a bad CST is applied? The key point is how much time will be used to check the permission in CST. If a CST will need one second to check one item, meanwhile 4 CSTs are launched, 200 items will need you 50 seconds to complete the job. This means, you have to wait for 50 seconds to get the search result showed in your browser!

    Terrible, right? Even worse, if you have 100,000 items in a result array, and you only have permission against 4 items, the search service will crush because of the timeout.

    So that's why CheckLimit is also needed in the implementation of CST.

    Now, the best practice when you want to create a CST:

    1. Reduce the time needed to check the permission. You can do some trick to make it faster, for example, store the permission mapping in a local SQL table first, and use CST to check local table not the remote one, so you can bypass the network delay.

    2. Correctly set CheckLimit. Return a more user friendly message when the limit is met.

    To implement a CST, you can refer to SDK, or these articles:

    These are not very good articles, some of the information are misleading. So far as I know, there're some much better articles on the way, but I don't know the exact date of when they will be published on MSDN.  Later in another post I will go through the code to explain more.

  • Jie Li's GeekWorld

    Some OpenSearch sites for reference


    Do you know This is a open standard for search engines, which is created by from Amazon.

    Why does it matter? Don't look at all those boring documents on, generally speaking, this is a standard that requires search engines return a RSS/ATOM feed. So you can use your favorite feed reader, get the best answers just in time.

    Meanwhile, a standard RSS feed allows you intergrate different search engine results into your own application. For example, you have a enterprise search engine inside your company, but you want to intergrate some search results from outside, an iframe is okay but not very good.  With XML+XSLT, you can intergeate them very easily. This is not limited to search engines, you can try different ideas, Yahoo Image Search, Google News Search, Live Spaces Search, LinkedIn expert search... Anything.

    Well, what if a site does not return RSS feed? Wikipedia, Linkedin, they don't have anything. Oh, Wikipedia said they support opensearch, but they only support a part of the standard. What they returned can only be used in IE7/Firefox live search. has a very useful feature to solve this problem. It can return RSS feed for any results. To give wikipedia a RSS feed, just use{searchTerms}

    And that's all.

    Here I list some of the sites to be used with RSS feed:

    WikiPedia (en-us){searchTerms}
    WikiPedia (zh){searchTerms}
    Live Search{searchTerms}&format=rss
    Live News{searchTerms}&amp;format=rss
    Google Blog;q={searchTerms}&amp;lr=&amp;ie=utf-8&amp;num=10&amp;output=rss
    Google Blog (zh-cn);q={searchTerms}&amp;lr=&amp;ie=utf-8&amp;num=10&amp;output=rss
    Google News;ned=us&amp;ie=UTF-8&amp;q={searchTerms}&amp;output=rss
    Google News (zh-CN);ned=cn&amp;q={searchTerms}&amp;ie=UTF-8&amp;output=rss
    MSDN China;query={searchTerms}&amp;lang=zh-cn&amp;feed=rss
    TechNet China;query={searchTerms}&amp;lang=zh-cn&amp;feed=rss
    Technorati (zh){searchTerms}?language=zh
    Yahoo Images;query={searchTerms}&amp;adult_ok=0
    Yahoo News{searchTerms}&amp;ei=UTF-8
  • Jie Li's GeekWorld

    Hi there!


    So finally I created this blog after I registered here for more than one year.

    Let me introduce myself first. My name is Jie Li, now a Technology Specialist in Microsoft China. The technology I focused here is Enterprise Search. But I may talk about SharePoint, Lotus Domino/Notes, Linux/Unix, Windows Mobile and nearly everything.

    You may have questions: what kind of envoriment did you use? Well, I was a "consultant" in a IT consulting company before I join Microsoft. So I dealt with Linux(mainly RHCE/Centos and Debian), Solaris 8/9, Lotus Notes/Domino R5/R6 and R8 (sorry no R7), Citrix Metaframe, Windows NT 3.5 -2003, Oracle 9/10, SQL Server 2000/2005, MYSQL 3-5, VMWare, Virtual PC/Virtual Server/Windows Server Virtualization. The programming languages I used include ASM, C/C++, ASP/ASP.NET, C#/VB, Pascal/Delphi, HTML/JS/XML/XSLT...

    So that's a lot of different things. I'm not an expert on everything, but these experiences really helped when I want to intergrate different technologies, make things interoprate with each others.

    And hey, although I don't like GPLv3, I love the word Open Source. This didn't changed even after I joined Microsoft. You can check some of my work, for example

    Why my nick name is "opal"? This name actrually is what I used for my first website since 1999. Many people call me that name in the communities, and paradiso is another nick name for me.

Page 1 of 1 (10 items)