February, 2008

  • Jie Li's GeekWorld

    Build Custom Federated Search Connector in Microsoft Search Server (and SharePoint) - Solve Problems and Extend Your Ideas

    • 0 Comments

    I assume the read of this article understand what is federated search. So we already know that in order to use Federated Search webpart in Search Server, you need to provide a RSS feed to it, which can also be called "OpenSearch" stuff.

    But, not every application you search will return this kind of RSS/ATOM feed. For example, Google, Baidu and many other web sites. So how can you federate search results from this kind of web sites?

    http://msdn2.microsoft.com/en-us/library/bb931083.aspx

    Scenario 2: Connecting to an External Search Site That Returns Results in HTML Format

    Scenario background: The site is configured to use Anonymous access.

    Possible solution: Use a Web application outside of the context of a SharePoint site, which contains a lightweight ASPX page that does the following:

    1. Submits a search request to the site by using the search terms passed in the initial request URL.

    2. Converts the results in the HTML response received from the external search site to RSS format.

    3. Returns the RSS XML in the response to the search server.

    In this scenario, the federated connector’s Web application could be located on a remote server; however, a simpler solution is to create the Web application within the _layouts folder for the SharePoint site. For more information about creating this type of Web application, see How to: Modify Configuration Settings for an Application to Coexist with Windows SharePoint Services.

    In a variation for this federated connector solution, you can add support for multiple external search sites by modifying the ASPX page to include details for more than one site within a case statement. The query template specified for these locations could then include a custom parameter that specifies which site in the case statement receives the federated query. Another variation is to combine the results for multiple external search providers, incorporating logic to order the results based on relevance.

    Well, there're already some people who did a nice job, for example Andrew Woodward:

    http://www.21apps.com/2008/01/search-server-2008-federated-sites-that.html

    I would go a little further on this. Here I take Baidu as an example. Baidu is the biggest Internet search engine in China. (Google China? God knows where them are. Baidu introduced many interesting applications that Chinese users love to use. But Google China, is only famous for stealing the input method dictionary of another major Internet company SOHU, and then made its own Pinyin input method. After this was exposed to the public, they did a not so honest "apologize" and said that were two interns who did it. Perfect, later this became a popular phase in China, if anyone did evil things but was discovered by the public, he would say it's intern's or temporary employees' fault. Well, what a shame on this "not to be evil" company. - little off topic) .

    Baidu.com does not return any RSS feed. What's more, it is using GB2312 encoding method to show the results. So if you directly use regex to capture something in Baidu, you will get some squares which do not make sense.

    And there're some limitations in asp.net Request.QueryString method. It cannot correctly process Gb2312 encoding. So the Page Load Method must be changed to the following code:

    protected void Page_Load(object sender, EventArgs e)
        {
            if (Request.QueryString["q"]!= null)
            {
                query = Request.Url.Query.ToString();
                query = query.Remove(0,3);
            }
        }

    In this way, a query string will be kept so you can process it with Encode and Decode. If you use QueryString, you will get a stupid behavior that it incorrectly use Decode method in a wrong encoding charset...The result is a disater. Stupid, stupid, stupid. I want to slap the guy who wrote this method. Does he know there're not only English in this world?

    For example, my nickname opal, in Chinese is 猫眼石. If queried from IE, it will be encoded using UTF-8. But Baidu can only consume GB-2312.

    In UTF-8, 猫眼石 is %E7%8C%AB%E7%9C%BC%E7%9F%B3.

    In GB2312, 猫眼石 is %C3%A8%D1%DB%CA%AF.

    It's quite different. If you want do a search for %E7%8C%AB%E7%9C%BC%E7%9F%B3, and it is treaten as a GB2312 string, it will become 4.5 Chinese charactors. and none of them will make sense.

    Okay, compain less, do more. So then we need to decode query string.

     private string getRssItemXml(string query)
        {
            //first you must decode it as UTF8. Because when IE access a utf-8 based website, it will pass the corresponding encoded strings.
            //Of course, you can modify web.config to make this application using Gb2312, but that doesn't make sense.
            query = HttpUtility.UrlDecode(query, Encoding.UTF8);
            //Then we need do encode it to gb2312. Baidu can only consume that.
            query = HttpUtility.UrlEncode(query, Encoding.GetEncoding("gb2312"));
            string url = string.Format("http://www.baidu.com/s?wd={0}", query);
    
            WebClient client = new WebClient();
            byte[] byteData = client.DownloadData(url);
            //Returned results are also in GB2312, so you have to rebuild it.
            string strData = Encoding.GetEncoding("gb2312").GetString(byteData);
            Regex searchPattern = new Regex("\\)\" href=\"(?<link>.*?)\" target=\"_blank\"><font size=\"3\">(?<title>.*?)</font></a><br>(?<desc>.*?)<br>");
            StringBuilder sb = new StringBuilder();
    
            foreach (Match m in searchPattern.Matches(strData))
            {
                sb.AppendFormat("<item><title><![CDATA[{0}]]></title><link><![CDATA[{1}]]></link><description><![CDATA[{2}]]></description></item>",m.Groups["title"].Value,m.Groups["link"].Value, m.Groups["desc"].Value);
            }
    
            return sb.ToString();
        }
     

    So then put this aspx file to a website, have your federated search webpart point to it, like http://www.abc.com/Baidu.aspx?q={searchTerms}, and then you can get Baidu federated search in Microsoft Search Server 2008.

    I put part of my work here:

    http://cid-8007edf5c56fc334.skydrive.live.com/self.aspx/Microsoft%20Search%20Server/CaptureWeb.rar

    It contains:

    Baidu Federated Search Web Service
    Baidu News Federated Search Web Service
    iCiba (English-Chinese Dictionary) Federated Search Web Service
    Dictionary.com Federated Search Web Service

    Yes! You can put dictionaries on your federated search web page so if anybody want to search a word, he will get the meaning immediately! You can also have some triggers to make this happen only with numbers or charactors, etc.

    snap048

  • Jie Li's GeekWorld

    Don't compare things that are not on one level

    • 1 Comments

    Nowadays we heard of many interesting topic about "killers" on newspapers and online medias. They are talking about "Office killer" - maybe OpenOffice, maybe Lotus Symphony(I worked with Lotus Notes for years, so I think compared with Notes, it is really a shame to have this stupid Symphony suite in Lotus brand). They even take Google Doc as an example. Okay, enough.

    This time, Google Site! "Another SharePoint Killer!" Medias cried.

    What then? After people got their hands on it, they realized the truth. The truth hurts, but that's the truth. Do a yahoo or a live search for yourself, see what they are talking about.

    There would be some field this application will have some space. But I still don't like these PR words, yes, you can have foolish advertisement to do FUD things, like a idiot jumping everywhere, crying "mac is safer", but don't be lame always.

  • Jie Li's GeekWorld

    Okay...Here comes Smart Search bug fix and installation guide...

    • 1 Comments

    It's my fault to ignore it for so long a time. It is not written by me originally, but by a colleague of mine, Gang Chen. But he asked me to do a favor to create a project on codeplex, so now it is maintained by me.

    http://www.codeplex.com/smartsearch

    And finally, I fixed the foolish bug and replaced the Chinese charactors to English words. I spent a whole afternoon to get it installed on my WSS+Search Server Express box. Everybody can try it here.

    http://www.mssearch.cn:5000/Search/results.aspx?k=lotus

    A installation guide is also there. I think a experienced sharepoint user can install it within 10~30 minutes.

    It is quite interesting that the installation guide is also the 11111th release on Codeplex. :)

    The reason I spent so much time is, WSS is using Windows Internal Database. In this code, we need to create a seperate sql table in content db, so it failed of course. To workaround this problem, you need to manually modify the code and point it to a Sql Server(Express) instance.

  • Jie Li's GeekWorld

    Try Microsoft Search Server Express for Federated Search

    • 1 Comments

    I just installed a Search Server Express on my public box, so everyone can try it just by several clicks! If you don't have the time to install one for yourself, now it's your chance to get your hands on it.

     

    Play with default interface:

    http://www.mssearch.cn:5000/Search/default.aspx

    Example:

    http://www.mssearch.cn:5000/Search/results.aspx?k=lotus

    snap022

     

    Play with Yahoo Image, Youtube and Flickr tag search:

    http://www.mssearch.cn:5000/Search/Sandbox/media.aspx

    Example:

    http://www.mssearch.cn:5000/Search/Sandbox/media.aspx?k=%E5%A4%A9%E5%9D%9B

    snap023

     

    Will add more funny stuff later...

    And don't forget to check here for more information.

    http://www.microsoft.com/enterprisesearch

  • Jie Li's GeekWorld

    The Greatness of Interoperability

    • 1 Comments

    Today we are talking a lot about openness and interoperability in Microsoft. Maybe most of the people can't imagine this several years before.

    I remember when I joined Microsoft, I talked a lot about "we should be interoperable with other software, we need to know them more deeply than now". Then a senior challenged me, and nearly shouted at me. She said I was crazy and foolish, and I should be "shot" immediately.

    Time goes really fast. Things changed a lot. Now we have Open Source license, we opened our .net framework to the developers, we opened many APIs and communication protocols, we opened Office OpenXML and binary formats, we have codeplex, we have Linux and open source labs, we do emphasize interoperability wherever we can... Should I still be shot today? Isn't that what we are looking at?

    Software+Services is a good story. But the foundation of it stands on openness and interoperability. If we really want to change the way we do business, we really want to innovate more and more, we really want to gather community to improve the ideas from grass root... we definately need to be OPEN. This is no longer the time we were not allowed to speak out loudly. You can see in nearly every important meeting we are talking about openness.

    I think, that's why people like me, who worked on other technology and software like linux and lotus notes, are joining Microsoft. Gradually you will see more and more people carry their interoperability experience to Microsoft. And this company, will change itself to become more OPEN.

    This is only the beginning.

  • Jie Li's GeekWorld

    Real World Lotus Notes Index Result in SharePoint 2007

    • 1 Comments

    Crawled Notes data: 1240 Gbytes

    Crawled Notes Items: 2.7 Million

    SharePoint Index: 70 Gbytes

    SharePoint Search DB: 120 Gbytes

    Data vs Index = 124:7 = 18:1, which is a ratio of 6%.

    Index vs Search DB = 7: 12. Search DB is 160% of Index.

    In total, (Search DB+Index)/Data=16%

    Remember, this ratio varies in different environment, depends on how you use Notes Applications, how many attachments are there, and many other factors. I put it here only as a reference. Maybe update later.

  • Jie Li's GeekWorld

    Next steps of SharePoint Search Enhancement, and...

    • 3 Comments

    I was collecting feedbacks from many people these days about the enhancement we did for SharePoint Search in China. Good feedback, bad feedback, that's okay. I just put the whole list here:

    Search As You Type, SAYT: The live search feel for your SharePoint search box.

    TODO: Check if there's any chance we can make use of AJAX.NET.

    http://www.codeplex.com/SearchAsYouType

    Predefined Search: Some kind of Saved Search. Just by a single click, and your query will be remembered and can be share to the public.

    TODO: Currently none. If you have ideas, contribute to the open source project!

    http://www.codeplex.com/MOSSPredefinedSearch

    Smart Search: 1. Display top ten hot search keyword. 2. Display relevant search keywords.

    TODO: Bugfix, and a pure English version.

    http://www.codeplex.com/smartsearch

    SharePoint Search Admin: A GUI based tool which can do SharePoint search administration, much better and easy to use than SharePoint Search settings page itself.

    TODO: Bugfix, add more tricks so that can help service people to deliver things on time.

    http://www.codeplex.com/searchadmin

    Chinese "did you mean" webpart: Deliver a did you mean feature for Simplified Chinese.

    TODO: None

    http://www.codeplex.com/cndidyoumean

    Some people have a concern that these codeplex tools are not supported by Microsoft. That's true. But I must emphasize here, that codeplex thing is OPEN SOURCE. Yes, we are talking about Open Source in Microsoft. Personally I am some kind of open source lover, I think if you really want to do some thing like SOA or SaaS, open source development model would be a pretty good alternative to a complex API set. The pain of API set is the blackbox development. Product team people may not understand the business needs as deeply as we field people do. So, it is quite possible that they do something unreasonable with the product. The gain of API set is also the blackbox model. It simplified the process of development, lowered the entry barrier of newbie developers. Of course, open source may lead to bad documentation and unreadable code, also complexity of the code you write to extend it. So, a balance between opensource and api would be better for service development. So far, Firefox has been proven to be a good example.

    A little off topic. But things really are different when you talk to IT pro. It depends on what is the model: a product only solution or a product+service solution? For platforms like SharePoint, a product+service solution is definitely the wiser choice. Because in a more successful implementation, nearly everything needed to be customized for the organization. In this way, the codeplex thing would be a great help.

    It may be hard to change people's thought immediately, but time would prove this.

  • Jie Li's GeekWorld

    Dealing with custom content source in SharePoint Search

    • 1 Comments

    By default, SharePoint 2007 family (MOSS/MSS) supports the following content sources: Web sites, SharePoint Sites (WSS/SharePoint Server), Exchange Public Folders(through OWA), File Shares, Lotus Notes databases on Domino Server. Through BDC and user profiles/My sites, Microsoft Office SharePoint Server 2007 has two more content sources: databases and people profiles.

    Yes, we can use Protocol Handlers(It is also called "connectors", don't be confused, protocol handler is not a friendly name to most of the people so we changed it) to index other contents. A protocol handler can be implemented in C++, and also C# if you don't care about the performance. So, you can give the database ability to Search Server 2008, you can connect some other things like FileNet and Documentum... By creating a protocol handler, you can also have the ability to control security trimming, a much better and wiser way than custom security trimmer.

    But how can I add/remove/edit a custom content source after the new protocol handler is registered?

    The answer is: using object model. You can find CustomContentSource under Microsoft.Office.Server.Search.Administration namespace. I have a open source administration tool at http://www.codeplex.com/searchadmin (SharePoint Search Admin).

    There's an interesting member in CustomContentSource: Tag. By modifying this property, you can set an URL of the page to modify the settings for a custom content source. Don't forget to append CustomContentSource.Id after the page, otherwise how can the page know which content source should it open?

    For example, you can specify http://moss:90/Edit.aspx as a edit page. If the Id property of current content source is 45, the actrual link would be something like http://moss:90/Edit.aspx?cid=45.

    Last but not the least important, custom content source is not limited to only new protocol handlers. You can also add something like file://, http:// or even notes:// to it.

    So why not make a content source edit page for Lotus Notes? Yes! You can replace the original lame one and use your own! Check out SharePoint Search Admin, you will find the option is already there for you:)

    (Still jet-lagging...See you in Seattle!)

  • Jie Li's GeekWorld

    SharePoint Search Admin v.06: Support 250K Lotus Notes Databases Now

    • 1 Comments

    http://www.codeplex.com/searchadmin

    We all know that SharePoint 2007 and Microsoft Search Server 2008 support Lotus Notes content source. So we can index the databases on Domino Servers.

    Y-E-S. But there ARE some issue you may face in the real world.

    The original idea of SharePoint Search Admin (or MOSS Search Admin, MSA) comes from a customer request in Australia. They wanted to deal with Domino Document Manager, and they had thousands of Notes DBs to index. But in SharePoint Search Settings page, you need to manually add all these Notes dbs one by one. So I wrote MSA to batch add all these DBs.

    But later, I found MSA can solve many problems when you work on enterprise search solutions built on SharePoint, for example some troubleshooting work, provide some workaround to bugs... It's more and more useful, especially in large Notes envoriment.

    We will talk about the problems later, now we will only discuss this update.

    Yesterday a issue happened in my customer envoriment. After they created 500 content sources for 500 Notes DBs, they cannot create any new content sources any more.

    Okay, there's a 500 limit in content source per SSP. This is okay for file share and websites, because we can have 500 start address per content source. But things are very tough in Notes content source, you can only have ONE Notes DB per content source if you are using default content source creation page!

    This made me crazy. We always have thousands of DBs to index, but now we can only index 500 of them. Customer don't want to have multiple SSPs.

    Thanks to our great Mitch Prince, he immediately shed the light on me: Using OM.

    In fact, SharePoint Lotus Notes content source can support multiple DB/Directory in single content source. The only reason we cannot do it OOB is: that Content source creation page are really poorly made. This can be workaround by using object model, exactly how MSA did.

    Damn, how can I forget my own tool?

    So I added this in MSA. It's quite simple, just a single line of code.

    LotusNotesCS.StartAddresses.Add(startAddress);

    So now you can add up to 250,000 Lotus Notes DBs in one SSP! Well, if you are sure you won't hit our performance barrier.

    What about the future? In next major release of SharePoint Search Admin, I'll complete reorganize the UI, and add more useful functions to make it a powerful alternative to SharePoint Search Setting UI. Hope I can did better than System Center!

Page 1 of 1 (9 items)