After reading Greg's article Using apache flume with HDInsight I wanted to start to learn more about flume, but my Linux skills are none existent and currently flume is not included in HDInsight. For more information on HDInsight see Windows Azure HDInsight. For more information on Apache Flume see Apache Flume.
So, I decided to setup a Windows Azure virtual machine and install HDP 2.0 for windows in order to start to use flume with my HDInsight server. Based on the article I could also use cloudberry drive on my virtual machine to move my twitter data to my storage account that HDInsight could access. Because the data is moved all within the data center no data movement cost would be incurred but there would be storage costs Windows Azure Storage Costs and of course the cost of the virtual machine.
Although I am going to setup a single HDP 2.0 server on a single virtual machine, in the past I had setup a 6 node HDP 1.0 cluster on multiple Windows Azure virtual machine. Here are a couple of suggestions in case you want to do this.
For performance and cost reasons, choose a location or data center and create all of your Windows Azure services within this data center. You can create a storage account in your chosen data center using quick create from the Windows Azure portal. After it is created review the storage configure tab and choose your local or geo replication options. These have cost implications. Also notice that you can turn on and off monitoring for blobs. This allows your monitor tab to be populated with IO information! You can use the same storage account for your HDInsight cluster and virtual machine. They will be placed in different containers. Take a quick look at the containers tab. For more information on Windows Azure storage see Windows Azure Storage. When you create a virtual machine a vhds container will be created which is the location of your virtual machines vhd files. When you create an HDInsight cluster you can also using this existing storage account and create a container here. Below are a couple of screen shots showing creating a storage account and the configure tab. For more information on Windows Azure virtual machines see Windows Azure Virtual Machine.
If you want to install a HDP 2.0 cluster the virtual machines must be able to communicate with each other. You can use either a Windows Azure network (virtual private network) or an Affinity group. I have found the network path to be easier. Below is a screen shot where you are giving a name to your network, an IP address range and the data center location. You will then install all of your virtual machines within this network and they will be able to communicate with each other. If you just want to install a HDP 2.0 single server a network is not necessary and you can skip this step. For more information on Windows Azure Virtual Networks see Windows Azure Virtual Network.
Next you can install your virtual machine. Because I like to configure HDP 2.0 to use SQL Server for the Hive and Oozie metastore instead of Derby, I like to install a virtual machine image that has the Windows 2012 operating system and a SQL Server 2012 standard edition. Instead of choosing quick create, choose from gallery. You can then choose an image with the operating system and SQL Server you want installed. If you are not interested in using SQL Server for your Hive and Oozie metastore and want to use derby, you can choose an image without SQL Server from the gallery. Give your virtual machine a name, choose a size which determines how many CPU's, memory and additional disks your virtual machine gets. A2 (2 cores, 3.5GB) or A3 (4 cores, 7GB) should be fine for a single install of HDP 2.0. If needed after the virtual machine is created you can increase or decrease the size. In choosing to configure your virtual machine leave the cloud service boxes at the default. Choose your data center or network you created before. Choose your storage account you created before. Leave the rest at the defaults. Below are some screen shots showing the steps.
After the provisioning is complete you can highlight the virtual machine and click the connect button in order to RDP to your new virtual machine!
Next we will makes some configuration changes to enable SQL Server to be used as the Hive and Oozie metastore.
From the Windows Azure portal click the endpoints tab for your newly created virtual machine. Click add and stand alone endpoint and choose the MSSQL (TCP port 1433) and leave the rest at the defaults. Once the endpoint is created click the dashboard tab and then click connect to RDP to the virtual machine. Once logged in startup SQL Server management studio and connect to your instance. Right mouse click the server and choose properties. On your security tab change the authentication to windows and SQL Authentication. Create a hive and an oozie database. Create a hive and oozie SQL Server login that are members of the db_datareader and db_datawriter groups in their respective databases. Also modify the enforce password policy. Restart the SQL Server service for the authentication change to take effect. Below are some screenshots.
In SQL Configuration manager verify that SQL Server has TCP enabled and listening on port 1433.
Now that SQL Server is configured it is time to prepare the operating system and install HDP 2.0 for windows. Before proceeding I suggest to review Hortonwork's blog post on installing HDP 2.0 on windows. Install hadoop windows
If you have chosen the Windows 2012 Server operating system the C++ runtime is already installed and you can skip this step. Download Python and run the install. You will need to add your Python install location to the path statement. In control panel open the system icon and choose the advanced tab and the environment variables button. Find the path variable and append the python path. Don't forget to add your semicolon.
Download and install java. You can install java 1.7.X instead of 1.6.X. When you install java you should change the default install path for the jdk and jre. Java does not like spaces in the path and the default location is c:\Program File\. Chose a path without spaces like c:\Java\. Next, in control panel system icon, click the advanced tab and Environment variables button. This time create a new system variable named JAVA_HOME with a path to your java install, C:\Java\jdk1.6.0_31.
Now we are ready to install our stand alone instance of HDP 2.0 on Windows. Unzip the hdp-2.0.6-GA.zip file. Open a powershell prompt as Administrator, "Run as Administrator" and execute the MSI file. Msiexec /I "hdp-184.108.40.206.winpkg.msi". The install will display the install window.
Once the install is successful you can open a command prompt or windows explorer and change directories to c:\hdp\ and execute the start_local_hdp_services.cmd. This will start the Hadoop services. You can also start and stop the services from your services icon in administrative tools in control panel. Below are screenshots of the c:\hdp folder and the services icon. Your flume service should remain stopped until we configure it.
Next let's start to look at flume under C:\hdp\flume-220.127.116.11.0.6.0-0009. You have a bin, conf, lib, and tools folders. In your conf folder there is a flume.conf file which is the flume configuration file. We will be modifying this file.
Before we start to configure flume we will need two things.
Once you have your twitter account go to Twitter Create Application link and click create new app button and follow the instructions. Below is an example of a Twitter streaming Application I created on Twitter.
Once you have created your Twitter streaming application you will need four pieces of information from Twitter. These can be found on the API Keys tab of your Twitter application. We will need these in order to configure our flume.conf file.
We need a way for flume to communicate with our Twitter streaming application. In the Hortonworks blog post on refine and visualize sentiment data there is a SentimentFile.zip download. See the links above. Within this zip file there is a flume-custom-source-1.0.0-SNAPSHOT.jar file that has a poc.hortonworks.flume.source.twitter.TwitterSource class. Take this jar file and copy it to C:\hdp\flume-18.104.22.168.0.6.0-0009\lib folder. We will configure flume to use this jar file to stream tweets to flume for certain Twitter keywords.
You can use the java jar command to see the contents of the flume-custome-sources-1.0.0-SNAPSHOT.jar file.
C:\Java\jdk1.7.0_51\bin>jar -tvf C:\hdp\flume-22.214.171.124.0.6.0-0009\lib\flume-custom-sources-1.0.0-SNAPSHOT.jar
0 Thu Mar 14 14:38:32 UTC 2013 META-INF/ 130 Thu Mar 14 14:38:32 UTC 2013 META-INF/MANIFEST.MF 2227 Thu Mar 14 14:38:32 UTC 2013 flume.conf 1219 Thu Mar 14 14:38:32 UTC 2013 log4j.xml 0 Thu Mar 14 14:38:32 UTC 2013 poc/ 0 Thu Mar 14 14:38:32 UTC 2013 poc/hortonworks/ 0 Thu Mar 14 14:38:32 UTC 2013 poc/hortonworks/flume/ 0 Thu Mar 14 14:38:32 UTC 2013 poc/hortonworks/flume/source/ 0 Thu Mar 14 14:38:32 UTC 2013 poc/hortonworks/flume/source/twitter/ 2718 Thu Mar 14 14:38:32 UTC 2013 poc/hortonworks/flume/source/twitter/TwitterSource$1.class 3975 Thu Mar 14 14:38:32 UTC 2013 poc/hortonworks/flume/source/twitter/TwitterSource.class 772 Thu Mar 14 14:38:32 UTC 2013 poc/hortonworks/flume/source/twitter/TwitterSourceConstants.class <Truncated output>
Now we need to configure our C:\hdp\flume-126.96.36.199.0.6.0-0009\conf\flume.conf file. As Greg's article describes, flume has sources, channels and sinks.
Our source is the Hortonworks java jar file that connects to our Twitter application and reads Tweet information for our Twitter keywords to the channel. The sources type is the poc.hortonworks.flume.source.twitter.TwitterSource. Flume knows to look for this class in the jar files in its lib folder. In the flum.conf file we have configured the source with our Twitter API and Token information. We have also provided a list of keywords that we want to capture from Twitter. The channel is in memory and has a 10000 capacity and transaction capacity. Our sink then reads from the channel and writes to local disk. I have created a c:\wcarroll\flume directory which is where the sink will write our tweets to disk. This directory location could be a cloudberry drive that writes to Windows Azure blob storage. The sink will roll over a new file every 7200 seconds (2 hours).
agent.sources = Twitter1 agent.channels = MemChannel agent.sinks = k1 agent.sources.Twitter1.type = poc.hortonworks.flume.source.twitter.TwitterSource agent.sources.Twitter1.channels = MemChannel agent.sources.Twitter1.consumerKey = <Twitter API Key> agent.sources.Twitter1.consumerSecret = <Twitter API Secret> agent.sources.Twitter1.accessToken = <Twitter Access Token> agent.sources.Twitter1.accessTokenSecret = <Twitter Access Secret> agent.sources.Twitter1.keywords = Ukraine,Crimea,Crimean peninsula,Kiev,Kharkiv,Donetsk,Luhansk,Rostov,Kursk,Belgorod,Tambov agent.sinks.k1.type = file_roll agent.sinks.k1.sink.rollInterval = 7200 agent.sinks.k1.channel = MemChannel agent.sinks.k1.sink.directory = c:\\wcarroll\\flume agent.channels.MemChannel.type = memory agent.channels.MemChannel.capacity = 10000 agent.channels.MemChannel.transactionCapacity = 10000
After configuring your flume.conf file go ahead and startup the flume service. If you have issues starting flume review the log files at C:\hdp\flume-188.8.131.52.0.6.0-0009\bin\logs. If it is successful you should start to see a file generated in your sink directory. Below is a screenshot showing flume writing Twitter tweets to file in my sink directory.
Now that we are successfully collecting Twitter data and storing it in our Windows Azure storage account, we can access the data from HDInsight. The Twitter data is a JSON format. Next Greg and I will discuss using Pig and Hive to do ETL tasks so that we can start to analyze the Twitter data more.
Have you also tried using hdfs://namenode/path using type as hdfs instead of file_roll. Does that work for you?
works like a charm. Did you create a similar post on Pig and Hive for the ETL?