There are a few excellent articles out there already that introduce the concept of Azure Blob Storage and how it's accessed from HDInsight. In this post though I wanted to start giving some backstage looks at some of the decisions we made while exposing blobs to HDInsight, hopefully answering a tiny part of the oft-asked question "what the hell was Microsoft thinking when they came up with this @#$?" My focus today will be on the mismatch between key-value stores and hierarchical file systems, with a vague hope of addressing some other topics such as concurrency and URL choices in later posts.
Azure Blob Storage, like Amazon S3 and Google Cloud Storage and I'm sure other data storage systems in the cloud and on Earth, takes a nice simple do-one-thing-well approach: give me a bunch of bits and a name within a namespace, and I'll associate these bits with that name for you to ask for it later. In the case of Azure, the bunch of bits is called a blob and the namespace is called a container (within a storage account). The bits could be a video stream or an XML document, the name could be "Plan for next billion $.docx" or "My consciousness.vhd" and it all looks the same as far as storage is concerned. In particular though: the name could be "I\am\a\directory\with\a\file" or it could be "Psht/this/is/what/a/directory/looks/like" and it would still be treated as a plain old name in blob storage, since in the storage service itself these "\" and "/" characters have no special meaning and are just like "a" or "$". This is not true of the SDK and some of the tools on top of the service, but they are all providing their own illusions on top of a simple flat name-value store, just like we needed to provide our own illusion...
The problem is that Hadoop wants its backing file system to be a hierarchical file system, with the directory structure we've all grown familiar with and fond of. So we needed to code up our WASB FileSystem class (this is the code that presents the Azure Storage service as a FileSystem implementation in Hadoop) to bridge the two worlds in a way that:
Given the requirements, the most obvious approach is to do what the SDK and most tools over Azure storage do: if Hadoop asks us to store the file "Trip Durations.csv" in the directory "Buses" within the top-level directory "Seattle", then we just store it as the blob "Seattle/Buses/Trip Durations.csv". Then when Hadoop asks us to list all the files in the directory "Seattle/Buses", we looks for all blobs whose names start with the prefix "Seattle/Buses/" (and whose names don't contain "/" after that) and we return those, taking everything after the last "/" as the file name. And if it asks us to list all directories under "Seattle", we'll look for all blobs whose names start with the prefix "Seattle/" and whose names do contain "/" after that, find all the distinct directory names (sandwiched between the "/" characters) and return those: e.g. if we find "Seattle/Buses/Trip Durations.csv", "Seattle/Buses/Ridership.csv" and "Seattle/Bikes/Trip Durations.csv" then the two directories we return are "Buses" and "Bikes". Easy (sort-of).
But of course life can't be that easy. The first major problem comes when Hadoop does something like this:
Tricky case 1
Create directory "Cars" under "Seattle"
List all directories under "Seattle"
The trickiness here comes from the fact that we want to create (and see) an empty directory. The scheme above doesn't allow for that: we typically would know that there is a directory called "Cars" once we see a blob called "Seattle/Cars/Pollution.csv" but we have no blob called that yet. We liked the basic approach, so we considered two ways to supplement and salvage it:
We decided against the first approach mainly because of contention problems when many machines are creating/deleting directories all over the place and want to record this in the one blob. So we turned to the second, and now our problem became how do we mark these special blobs. Our first attempt (which some people saw in our very early previews) was to create empty blobs with a special suffix "_$folder$", which worked OK but its main drawback (aside from its ugliness when viewed outside HDInsight) was that it led to the following situation being possible:
Tricky case 2
(In Hadoop) Create directory "Thoughts" and put file "Brilliance" in there with very important thoughts
(Oustide Hadoop) Create blob "Thoughts" with very important thoughts in there
(In Hadoop) Tell me whether "Thoughts" is a directory or a file
With our implementation so far, the third step would be stumped: we have three blobs in there, "Thoughts_$folder$", "Thoughts/Brilliance" and "Thoughts", so the concept "Thoughts" is both a file and a directory! Of course we could error out, but we weren't very fond of that because a well-meaning customer could find themselves in that situation, with data in there to analyze and no simple remedy. Ideally we would've stopped the customer in the second step before the data is placed so they could place it in a consistent manner in the first place. So for this and other reasons, we decided to move to our current implementation: we create the marker blob as an empty blob with the exact name of the directory and no ugly suffixes, and mark it instead using attached metadata called "hdi_isfolder". So now the second step in case 2 above becomes nicely blocked: the customer would get an obvious conflict and would have to stubbornly delete the marker blob before creating the new "Thoughts" blob and shooting herself in the foot.
With the approach above of honoring the standard "this/is/a/directory/with/a/file" naming convention and using special metadata-marked blobs to create empty directories we were mostly OK. We hit some interesting situations though during implementation and testing, so for fun here is my favorite:
Tricky case 3
(Outside Hadoop) Create the blob "Songs/Get Lucky"
(In Hadoop) Check if the directory "Songs" exists
(In Hadoop) Delete the file "Songs/Get Lucky"
(In Hadoop) Check if the directory "Songs" exists
The correct behavior is to answer "Yes" both times we check if the directory exists: the first time because we decided to honor the naming convention and say that the existence of the blob "Songs/Get Lucky" implies the directory "Songs", and the second time because we already said the directory exists and didn't delete it so it should still exist (normal file systems don't delete a directory just because the last file is deleted). But the naïve implementation of our approach would not yield this behavior: our implementation would dutifully perform the complex "Check for prefix or marker blob" check in the second step and correctly answer "Yes", then in the third step delete the blob "Songs/Get Lucky", then in the fourth step say "No" since now the container is completely empty.
To deal with this (and other quirks) we codified the concepts of "materialized" and "implicit" directories: a materialized directory is one where we have the special marker blob, and an implicit one is one we infer from the existence of blobs that have it as a prefix (i.e. an implicit directory "Songs" exists because of the existence of the blob "Songs/Get Lucky"). Then whenever we delete a file, we check if its parent directory is implicit or materialized: if it's implicit, we materialize it first before we delete the file then the situation above would be resolved.
Note that we could've dealt with this situation in the second step instead: when we check for directory existence, we always materialize it if necessary before answering "Yes". The main reason we didn't do that is that we honor read-only operations: Azure Storage allows customers to give read-only permissions on a container to an HDInsight cluster (through the use of Shared Access Signature keys), so if we're doing a read-only operation like checking for directory existence, we avoid doing any write operations such as materializing a directory. When deleting a file though we assume we have write-access so we give ourselves leeway to do that.