In many scenarios, massive computing and storage capability is a must. Windows Azure has extremely strong computing and storage power. With its pay-as-you-go pricing model, we can run multi-instances in Windows Azure to meet the requirement of our solutions.
For multi-instances to coordinate smoothly without conflicts and interrupts, designing an approach for communication is a key point of our solutions.
For simplicity and concreteness, this post builds a sample search engine, which makes use of Lucene, on Windows Azure.
Note: Code below is just for the purpose of demonstrating, and our sample is just a very tiny simulative search engine.
The above diagram demonstrates our implementation of the search engine. There are four layers.
The first layer is is in charge of collecting the information from Internet. In our sample, the layer collects the title, content and URL of web pages, and stores the information in second storage layer.
The third layer fetches the data provided by the first layer from the storage, analysis the data, creates indexes, and then store the index in the fourth layer.
Finally, we can create various external interfaces, such as WCF service or a website and so on, for users to search something.
The first layer and the third layer work simultaneously. After the first layer collects some data from Internet and stores them in the second layer, the third layer begins to fetch the data, and perform some analytic tasks. Meanwhile, the first layer continues to collect more data from Internet.
The layers need to communicate with each other. We have 2 options: Use Windows Azure Queue Storage, or use MessageBuffer in ServiceBus . The following table summarizes these 2 options.
Windows Azure Queue Storage
8k per message
String or bytes(could use helper class to serialize other types)
Any user-defined class with DataContract attribute
No guarantee for FIFO manner
Should avoid message count overflow
After the third layer fetches the data from Queue or MessageBuffer, we utilize Lucene.Net and Azure Library for Lucene.Net, which analysis and create indexes, and then store the indexes in Windows Azure Storage Blob.
When running multi-instances to create indexes, we have to handle the concurrent issue. After instance A stores indexes in blob, instance B can overwrite the existing indexes in the Blob, and leave the indexes in an inconsistent state.
Our strategy to solve the problem is each instance maintains its local index in its memory, after adding a certain number of documents to index, the instance tries to merge its local index to the one in Blob. If another instance tries to merge its index before the first instance completes the task, it waits for the next turn while keep on adding new documents.
private void checkForMerge()
if (_count > 10)
//check the whether the index is locked.
return;//locked ,so return and wait for next turn…
//do merging here…
_count = 0; //zero the counter.
if you know little of Lucene.Net, you could refer us previous post about how to use Lucene.Net in Windows Azure.
Now that we have the index, we can expose interfaces for external clients to consume our index. Suppose we create a WCF Service for searching, and a website which invokes the service to display search results.
To keep the WCF service use up-to-date indexes for searching, every time one instance merges the index, it would notify the WCF service to update its indexes via invoking an operation on it. We can take advantage of Inter-Role Communication to keep all the service instances in a consistent state.
In conclusion, though it is just a sample, it demonstrates several ways we can make use of (Windows Azure Queue Storage, MessageBuffer, Inter-Role Communication, WCF Service and others) to coordinate multi-instances. You can choose one or more ways in you solutions accoring your practical situation.
Would be great to provide some code example of what you did here. Ideally the details in how you are merging the RAMDirectory to Blob.
Agree - are you able to post a sample project for this with some code ?