Keep track of all the latest news and events on developer tools and technologies you care about
By Jen Stirrup
Social Media data is a big driver for Big Data technologies such as HDInsight. However, it isn’t enough just to have lots of data. Data, on its own, is valueless. In order for data to become valuable, we need to be able to do something with the data.
If we have large datasets, they can become difficult to navigate, evaluate and to use. Visualising data, in tools like Power View, will help business users to turn their valueless data into a valuable data asset, since it will help users to make meaningful data-based hypotheses and assertions about the facts that they uncover by visualising data.
In the current series, we will:
The following sections will show how we can obtain Twitter data, and place it into HDInsight. We can then run Hive queries on it before visualising the data in Power View.
Using cURL or the Analytics for Twitter PowerPivot dashboard?
If you’re not familiar with command line instructions, then it might be better to use the Analytics for Twitter dashboard. cURL is open source, and occasionally it might be an issue to install open source software on machines; IT administrators do not always like it! This article offers you two options in order to obtain Twitter data.
To get the twitter feed, you will need to install cURL. To do this, download cURL and then unzip it to a specific location eg. C:\Curl Note that any spaces in folder names will cause problems!
We obtain the twitter stream by invoking the following command, which will produce a file called tweets.txt which will hold our tweets. Note that it all goes in one line.
C:\Curl>curl -u username:'<Password goes in here without arrows>' http://stream.twitter.com/1/statuses/filter.json >>tweets.txt
This might produce a large file, so be careful! Control + C should stop the flow. We can tailor the command so that it gives us specific keywords. To do this, create a file called TwitterSearchTerms.txt file with the following content. This will use keywords to assist in tracking tweets as follows:
We will then create a command file, which will run the curl command plus give it parameters.
In notepad, create a file called getTweets.cmd. Enter the command script, adding your twitter username in place of USER and password in place of PASSWORD on the following line:
C:\Curl>curl -u jenstirrup:'<Password goes in here without arrows>' http://stream.twitter.com/1/statuses/filter.json >>tweets.txt
Once you have done this, you can save and close the file.
It’s possible to obtain twitter data using the Analytics for Twitter download from the Microsoft Download Center. This free download allows users to query Twitter directly in Microsoft® Office Excel 2010. The download uses PowerPivot Excel Add-in users can perform their own analysis on Twitter, and includes the facility to search Twitter for queries such as who are the top Tweeters, what #hashtags are they using and do they have a positive or negative tweet tone.
In this example, we will use the Analytics for Twitter download to create a static text file that will hold our tweets. If you use the Analytics for Twitter download, then you will receive a maximum of 1500 tweets per search string, per day.
For this example, our search string will be as follows:
The resulting PowerPivot window will appear, after it has been refreshed:
This means that the Excel dashboard will appear as follows:
The Analytics for Twitter dashboard has a nice feature, in that it classifies the sentiment of the tweets as positive, neutral or negative. We can export the data from the PowerPivot dashboard, and place it into HDInsight for further analysis. To do this, go to the ‘Details’ tab, copy and paste the table into a new Excel spreadsheet, and save it as a CSV file.
Putting the Twitter Data into Hadoop
Go to your HDInsight cluster, and navigate to the Remote Desktop. This will take us to the head node, where we will transfer our twitter data into the HDFS file system.
To do this, copy and paste your text file onto the head node, which you can access using the Remote Desktop session. To make it clear, I’ve made a directory called Data, which I’ve stored on the C drive. I will put it into a demo folder in HDFS.
To do this, open the Hadoop Command Line prompt, which is located on the desktop. Here is the command:
hadoop fs -put c:\data\twitter.txt hdfs:///user/demo/twitter/twitter.txt
When you’ve done this, the command prompt will simply show a new command line, awaiting instructions. This means that it has worked successfully. No news is good news!
The next step is to create a new table to store your data. We do this in Hive, which is a part of the Hadoop ecosystem that allows you to create tables and impose structure on Hadoop data. It is available in HDInsight, which is Microsoft’s distribution of Hadoop. HDInsight is currently available for public preview in both Azure (HDInsight Service) and on-premises (HDInsight Server) versions.
To access hive in the Hadoop Command Prompt, type in the keyword hive to open the interactive Hive command line interface console (CLI). If this doesn’t work, you may have to first change to the Hive bin directory in the Hadoop Command Prompt. To do this, type the following command:
Then, type ‘hive’ again to enter the Hive CLI from the Hadoop Command Prompt:
This will allow you to create the table. Here is an example script.
CREATE EXTERNAL TABLE twitter
COMMENT 'Twitter sample data'
ROW FORMAT DELIMITED FIELDS TERMINATED BY "," ESCAPED BY '\\'
STORED AS TEXTFILE LOCATION '/user/demo/twitter';
You will notice that Hive looks very like SQL. Essentially, they do the same job. Here, the script to create the table is very similar to a SQL script. Further, if you want to double-check that the table was created, the following command will ensure that the table has been populated:
SELECT * FROM TWITTER;
Note: don’t forget the semi-colon at the end, or hive will sit and wait for it!
We could also put this syntax as:
When we do this, we get a message that says:
Number of reduce tasks is set to 0 since there's no reduce operator
This basically means that the query didn’t require any reduce tasks to be set up and run. The query runs successfully.
We then get the results of this query, which is basically retrieving the data from the twitter table.
We can select the contents of the twitter table if we run the following command:
SELECT * FROM TWITTER WHERE tone = ‘Positive’;
The number of reduce jobs is set to zero, because there is no reduce operator at this point. This query will simply return all of the positive tweets.
We can also do some descriptive work on the table, for example, by finding out how many rows are in the table. As before, the command is the same as the SQL version:
SELECT COUNT(*) from twitter;
This time, a MapReduce job is running. We know this, because the output appears as follows:
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
After a few seconds, the number of rows in the table is in the output screen. In this example, we have only 178 rows. We can see an example of this output below:
We can select the data from the twitter table, and insert it into a different table. When we do a straightforward ‘Select’ command, the results are only placed into the output window. It’s obviously more useful to have the selected data stored somewhere, for re-use. To do this, the hive command is as follows:
INSERT OVERWRITE DIRECTORY '/ user/demo/twitteroutput’ ’ SELECT * FROM TWITTER WHERE tone = ‘Positive’;
This outputs the data to a file called twitteroutput. To prove it has worked, you can insert the output into a table by using the following hive command:
CREATE EXTERNAL TABLE twitteroutput
STORED AS TEXTFILE LOCATION '/user/demo/twitteroutput';
You can then retrieve the data from the twitteroutput table, in order to check your results.
Why would this be useful? Generally, users like to be able to spawn copies of data for specific purposes. This is particularly useful when conducting specific analyses where you need a subset of the whole data, particularly if the whole data set is very large. Hadoop is designed for serious, large amounts of data, and it will be a useful tool for Business Intelligence projects as a way of storing and analysing Big Data sets.
To summarise, we have used Hadoop as a data source for analysing Twitter data. We have looked at Curl and the Analytics for Twitter tool as separate ways of getting Twitter snapshot data, and then placed it into HDFS, the Hadoop file store. We have then used Hive in order to filter some of the data, and ‘hive’ it off to a separate table for further analyses.
In our next part, we will look at taking a subset of the data in Hadoop, and mashing it up with other data sources to conduct business analytics. We will also look at visualising the mashed up Hadoop data – with other data sources - in Power View in order to provide a unified data asset for business users to use and explore.
Copper Blue Consulting
http://www.copperblueconsulting.com/ Jen Stirrup is a SQL Server MVP, with a Bachelors degree in Psychology and two Masters in Artificial Intelligence and Cognitive Science. Jen is joint owner of Copper Blue Consulting, delivering data and Business Intelligence solutions to global clients. Jen is a veteran SQL Server Developer with over 12 years SQL experience in delivering end-to-end Business Intelligence solution. Jen blogs for SQLServerPedia as well as her own blog. She is very active in many SQL Server User Groups in the UK, helping to run the SQLHerts group in England. Jen's focus is on the Microsoft BI stack, with a specific passion for Data Visualisation.