A while ago I built a search connector to index a MediaWiki content source and I thought I should document our experience for future generations / developers.
To start with, I learned that MediaWiki is one of Wikimedia projects (which really confused me and I kept saying “Wikime … no, Mediawi… no, Wikime… no, WikiWhatever!” whenever I wanted to reference the content source), it has an API that and even a .net wrapper (named DotNetWikiBot Framework). Of-course you can also use the API directly or even export the whole content of the MediaWiki as XML, and index it or even import it to SharePoint …
Anyways, you may ask, why would you build a search connector to index HTML content source instead of using the out-of-the-box (OOB) capabilities? Well, I’ll tell you, this specific MediaWiki had some enhancements that the OOB HTML PH is not familiar with, and anyway, what can be more fun than debugging a Search Connector? – you’ll never know until you try it.
So, back to the starting point: we learnt the content source and its interaction capabilities and decided to build 2 connectors: one that will index the exported XML and another one to index based on the API. The first one will be used for FULL INDEX and the later will be used for INCREMENTAL. Why use both methods? Well, this is a long story and maybe I’ll get into the alternatives and their pros and cons later on. Right now we’ll stick to the technical aspects.
Being familiar with Protocol Handlers and IFilters development and very non-familiar with the MediaWiki – I decided to start with a couple of POCs that apart from proving the possibilities will also produce stand alone versions of the index connectors. Those standalone versions will later be used for troubleshooting, if needed, and any other manual operation required during the component lifetime. So that’s what we did.
During the POC we encountered our first challenge: The API responded well when used as a query string in the browser, but failed miserably when invoked from code (whether directly or via the above mentioned Bot or any other way we attempted). The reason was the language of the MediaWiki. It is set in LocalSettings.php and we failed since it was set to “he” (that’s Hebrew). As soon as we set it to English (“en”) – the API responded well. But we wanted the users to use a Hebrew formatted (right to left) MediaWiki, so we set a MIRROR of the Hebrew MediaWili, which was in English, and used the MIRROR for the API. The MIRROR does not duplicate the content, it only enable setting different settings. We overcame this challenge.
Next we reviewed the exported XML and learnt that MediaWiki is quite forgiving and allows some not that valid XML, a couple of missing closing tags and so on. That would be an issue when trying to parse this xml, so we decided to use the water hose reader and mainly pray regularly. We apparently prayed well.
However, during the POC we learnt that reading every item using the above method will take 138 days for a single FULL INDEX. We considered uploading it to SQL or splitting the file but eventually decided against all those and managed to reduce the index time to … 4 hours with incremental index of a couple of minutes. How? Well, stay tuned to the coming POSTs and I’ll tell you, along with architecture, debugging and troubleshooting tips, known issues and references to official and other great resources in the internet.
Hope you enjoy and find it useful, Rona