How to use Lucene.Net in Windows Azure
Apache Lucene is a free/open source information retrieval software library, originally created in Java by Doug Cutting. It is supported by the Apache Software Foundation and is released under the Apache Software License.
Lucene.Net is a source code, class-per-class, API-per-API and algorithmatic port of the Java Lucene search engine to the C# and .NET platform utilizing Microsoft .NET Framework.
1. IndexWriter: An IndexWriter creates and maintains an index.
2. Document: The logical representation of a Document for indexing and searching.
3. Field: A field is a section of a Document. Each field has two parts, a name and a value.
4. Analyzer: An Analyzer builds TokenStreams, which analyze text. It thus represents a policy for extracting index terms from text.
5. Directory: A Directory is a flat list of files.
6. IndexSearcher: Implements search over a single IndexReader.
7. QueryParser: a lexer which interprets a string into a Lucene Query.
8. Hits: A ranked list of documents, used to hold search results.
First of all, we need check out the latest version of Lucene.Net using SVN from here.
Next, add reference to Lucene.Net library in downloaded file for your projects.
And code below is straightforward.
Azure library for Lucene.Net allows you to create Lucene Indexes via a Lucene Directory object which uses Windows Azure Blob Storage for persistent storage. In another word, it implements a class derived from Directory class.
Download it from here. There is a testing project in the solution, but we need to modify the project before we can build and run it.
Open the App.config file in the testing project, locate the appSetteings tag, and change the settings according your storage account.
Remove the code in Line67 and line 68 in Program.cs.
Remove the reference to Lucene.Net.dll both in AzureDirectory project and the testing project. Add the Lucene.Net project to Azure library for Lucene.Net solution; add the reference to Lucene.Net project for your AzureDirectory project and the testing project.
Finally, we can build and run it. The program adds 100 documents (each document contains 3 fields) to index, and performs some searching job later every time.
If we run multiple instances of the testing project, your indexes in Blob will lead to an inconsistent state. With running multi-instances for computing or storage, Windows Azure provides us extremely strong power to handle massive-demand situations.
One solution: we hold index in memory temporarily and maintain a counter which indicates the number of documents in the index. If the counter becomes greater than a certain value, we will check whether no instance is modifying the index in Blob Storage at that time. If thus, we merge the index in memory to the one in Blob storage. Otherwise, we quit the checking process, continue adding documents, and then check again later.
Lucene (an information retrieval software library which has a quite high demand of computing and storage) and Windows Azure (a platform which provides nearly unlimited power of computing and storage with pay-as-you-go pricing model) are obviously a perfect match.
Can you expand on your concurrency solution? It isn't clear to me what you are suggesting and I potentially have this issue (multiple worker roles indexing). IS this a mod to the Lucene.Net code or a change to the calling application? Are you releasing the index lock on each insert (or batch of insert)?
We have a new post discussing how to handle concurrency issues: blogs.msdn.com/.../a-common-scenario-of-multi_2d00_instances-in-windows-azure-.aspx
Please let us know if it solves your problem. Thanks for the feedback!
Thanks for the post! Hope to see more on Lucene!