Carl Nolan’s ramblings on development
Updated post can be found here: http://blogs.msdn.com/b/carlnol/archive/2013/02/08/hdinsight-net-hdfs-file-access.aspx
Provided with the Microsoft Distribution of Hadoop, in addition to the C library, a Managed C++ solution for HDFS file access is provided. This solution enables one to consume HDFS files from within a .Net environment. The purpose of this post is first to ensure folks know about the new Windows HDFS Managed library (WinHdfsManaged), provided alongside the native C library, and secondly to give a few samples of its usage from C#.
Let’s start with a simple class diagram of the Win HDFS Managed library:
The main premise is that the HdfsFileSystem is your starting point, from which one can acquire a HdfsFileStream or a HdfsFileHandle. From the HdfsFileStream you can perform operations one would normally expect when working with .Net Streams. From the HdfsFileHandle you can perform operations analogous to normal HDFS file operations.
For brevity I have excluded samples using the HdfsFileHandle. So let’s run through some sample file operations.
As in all operations one firstly needs to get a connection to the HDFS cluster. This is achieved by calling a Connect() method and specifying the host, name or IP address, and access port:
Once one has the connection one can then easily perform a directory traversal to enquire into the files and directories:
Here is a sample output created from the test application:
In addition to getting directory information one can also query on a file or directory directly:
The HdfsFileSystem class also supports other operations such as copying and moving files, file renaming, deleting files, modifying security, checking a file exists, etc. The copy and move operations support copying and moving these files between systems.
So now onto creating and reading files.
Processing HDFS files is not that dissimilar from normal .Net file operations. Once one has opened a file for reading, operations are available for operations such as reading a byte, line, or block of bytes:
The OpenFile operations support parameter overrides for the file block size and replication factors, whereas a value of zero implies the default values will be used.
If one wants to read the full contents of a file into a second Stream, the HdfsFileStream makes this a simple process:
There are other options available for reading the full contents of a file. The first option is to perform a ReadLine() until a null is returned, processed using a StreamReader:
Alternatively, for more efficient reading of files, one can read the blocks of data into a byte array:
Other operations that are supported are PositionalReadByte(), PositionalReadBytes(), and Seek(). These operations allow reading the contents of a file from specific positions.
One final sample worth noting is copying a HDFS file to a local file using byte reads:
The reason a chunk size is specified in this case is to sync the size being used for HDFS file access to the byte array used for writing the local file.
If one has a Stream reference one can also get the associated file information:
Also one can modify the file properties:
So now onto writing files.
As in the case for reading, writing operations are supported for writing a byte, line, and block of bytes:
The chunk size when opening a file is set to correspond to the size of the buffer used for writing the data.
As in the reading case, if one wants to copy a file from the local file system to an HDFS file one would write:
All one has to do is read, in byte chunks, data from the local file and write the corresponding bytes to the HDFS file.
Of course one can also use the CopyTo operation:
A quick word is warranted on appending to a file. Although the API currently supports open files for Append, this is only supported in Hadoop version 1.0.0 and above.
The code for the managed and unmanaged libraries for HDFS file access can be found in the folder:
The download not only consists of the compiled libraries but also the full source code and sample C# application that this post is based upon. You can compile the source or just use the delivered assemblies. The source supports both x86 and x64 compilations. However one has to remember that if one does a 32-bit compilation then a 32-bit version of the JRE will also be required.
One final word is warranted about environment variables. As the C library being used by the Managed wrapper is actually calling Java code, one needs to define some additional directories in the Path and CLASSPATH environment variables. A recent addition to the code no manages this for the process. Before a connection is made through the HdfsFileSystem, the Hadoop installation and JRE path are located and the Path and CLASSPATH are adjusted accordingly.
nice post, but I cannot find an url to download the MS Distribution of hadoop. Would you provide it please? Thanks
I have received invitation code from microsoft suvery .Hadoop on windows azure installation purpose i.e mandatory to install microsoft windows server 2008 R2 and windows azure or any windows os will support.
The Developer Preview for the Apache Hadoop- based Services for Windows Azure is available by invitation @ connect.microsoft.com/.../Survey.aspx. At the moment the code is onlt available when RDPing into the head node of an Azure deployment.
Greate one , Carl. Please let me know the url to download MS Distribution of hadoop.
can you please share the code of managed and unmanaged libraries. Because i am not able to access any of the resources mentioned in various forum related to this blog. Need your help Asap.
@Sanat_KM @Gourav: Currently the MS Hadoop distribution is only available by invitation. You can now check https://www.hadooponazure.com/ for more details.
In case you have not seeen it Microsoft HDInsight is now available as a CTP download, in addition to the Azure offering: www.microsoft.com/.../details.aspx
I need to run code on an SSIS server to feed HDFS. I cannot seem to get a working copy of your library and the code is missing includes. Also I am told that the MS libraries do not work with linux instalations.
Q Where can we get a copy of the bits you mention above or what environment must I build under?
Q Will this work with Lunux instalations like Hortonworks and Cloudera?
The article implies that it is a nicely easy-to use library providing Managed HDFS access to Hadoop, but nothing could further from the truth.
Under the covers C Library is instantiating and invoking Java methods and objects and requiring you to set up full Hadoop and Full Server JDK. It also makes silly hard-coded assumptions about HADOOP_HOME being under JDK and if there are any issues, no useful exceptions get bubbled up. The most useful I have received was along the lines of "a third party component has thrown an error".
Unless you already have a working Java solution on your machine, it will take you days to get it to work and you will then be stuck with a hacky and fragile solution.