In my travels I hear about a lot of patterns of cloud computing. One that I hear frequently is “upload a file and process it”. This file could be financial data, pictures, video, electric meter readings, and lots more. It’s so common, and yet when I Bing (yes, Bing ) for blob uploaders, they’re all manual.
Now, I’m sure that lots of people have tackled this problem for their own needs, but I thought I’d put something together and share it with the community. Both client and server source code are here.
Sorry for the eye chart. I’ll try to explain (in order of the basic system flow.)
A FileSystemWatcher detects new files in a directory in the local file store. A tracking record is inserted into a SQL Express tracking DB.
Process Files Thread
If it’s a big enough file that the upload is attempted before the file is fully copied into the directory, the thread waits by attempting a write lock. The file is uploaded to Azure blob storage using parallel blocks. The block size and number of simultaneous blocks is configurable using the StorageClient API. Once the file is uploaded, a record is placed into the Notification Queue and the tracking record is updated.
The Azure worker detects the notification, determines that the blob is “ok” in some way (application specific), acknowledges receipt (in order to free up client resources), processes the blob (or sets it aside for later), then deletes the notification message.
Process Acknowledgements Thread
Receipt of the acknowledgement message triggers deletion of the file in the upload directory and update of the tracking record. Then the acknowledgement message is deleted.
What could go wrong?
In a manual upload scenario, so many controls would get in the way. Automating the process demands that care is taken to ensure fault tolerance. Here are some things that can go wrong and how the architecture protects against them:
Alternatives to the Process
Following on #4 above, if there’s lots of processing to be done on each data file (blob), but there’s also lots of files to be uploaded, you can run into trouble. If you don’t have enough workers running (costs more), files could back up on the client and potentially cause problems. By splitting the task load up, you will be able to acknowledge receipt of files quicker and clean up on the client more frequently. Just be sure you’re running enough workers to get the overall work done during your window of opportunity.
What about those queues?
Yes, you could use SQL Azure tables to manage your notifications and acknowledgements. Queues have the advantage of being highly available and highly accessible in parallel via HTTP, though this comes at a cost. If you have millions of files to process, these costs are worth considering. On the other hand, presumably your SQL Azure database will be busy with other work and you don’t want to load it down. Also, if you have lots of customers you would need to either wrap access to SQL Azure behind a web service or open its firewall to them all.
What about the FileSystemWatcher?
FSW has a buffer to hold events while they’re being processed by your code. This buffer can be expanded, but not infinitely. So you need to keep the code in your event logic to a minimum. If large numbers of files are being dropped into the upload directory, you can overwhelm the buffer. In a case like this it might make sense to set up multiple incoming directories, multiple upload programs, etc. An alternative to FSW is enumerating files, but this can be slow.
As always, I’m interested in your thoughts. Comment freely here or send a mail. Full source for both client and server are here.