Avkash Chauhan's Blog

Windows Azure, Windows 8, Cloud Computing, Big Data and Hadoop: All together at one place.. One problem, One solution at One time...

Apache Hadoop on Windows Azure : Running Hive Scripts from Interactive Hive Console

Apache Hadoop on Windows Azure : Running Hive Scripts from Interactive Hive Console

  • Comments 3

Microsoft Distribution of Apache Hadoop comes with Hive Support along with an Interactive Hive shell where users can run their Hive queries immediately without adding specific configuration. The Apache distribution running on Windows Azure has built in support to Hive.

 

What is Hive?

  • Hive is a data warehousing infrastructure based on the Hadoop which provides massive scale out and fault tolerance capabilities for data storage and processing on commodity hardware.
  • Hive is designed to enable easy data summarization, ad-hoc querying and analysis of large volumes of data.
  • Hive provides a simple query language called Hive QL, which is based on SQL, to do ad-hoc querying, summarization and data analysis easily.
  • Hive QL also allows traditional map/reduce programmers to be able to plug in their custom mappers and reducers to do more sophisticated analysis that may not be supported by the built-in capabilities of the language.

What is NOT Hive?

  • Hadoop is a batch processing system and Hadoop jobs tend to have high latency and incur substantial overheads in job submission and scheduling.
  • Hive queries are generally very high (minutes) on time even when data sets involved are very small.
  • Hive queries cannot be compared with actual SQL/Oracle data query where analyses are conducted on a significantly smaller amount of data but the analyses proceed much more iteratively with the response times between iterations being less than a few minutes.
  • Hive aims to provide acceptable (but not optimal) latency for interactive data browsing, queries over small data sets or test queries.
  • Hive is not designed for online transaction processing and does not offer real-time queries and row level updates.
  • Hive is best used for batch jobs over large sets of immutable data (like web logs).

 To learn more about Apache Hive please click here.

 When you login to Windows Azure Hadoop Portal and have you cluster configured, you can launch Interactive Hive Just by select “Interactive JavaScript/Hive” tile as below:

 

 

Next you can just select “Interactive Hive” to open the web based shell and start writing Hive Queries instantly as below:

The Interactive Hive Shell will look like as below:

 

Alternatively you can remote login to your Hadoop custer Head node and launch the Hadoop Command Shell to launch Hive Queries directly as below:

Here are a few examples of Hive Queries:


Show tables;

hivesampletable
page_view
page_view_asc

Hive history file=C:\Apps\dist\logs\history/hive_job_log_avkash_201201100621_493568624.txt
OK
Time taken: 3.265 seconds

DESCRIBE page_view;

viewtime int
userid bigint
page_url string
referrer_url string
ip string IP Address of the User
dt string
country string

Hive history file=C:\Apps\dist\logs\history/hive_job_log_avkash_201201100555_19122541.txt
OK
Time taken: 3.64 seconds

DESCRIBE EXTENDED page_view;

viewtime int
userid bigint
page_url string
referrer_url string
ip string IP Address of the User
dt string
country string

Detailed Table Information Table(tableName:page_view, dbName:default, owner:avkash, createTime:1326174755, lastAccessTime:0, retention:0, sd:StorageDescriptor(cols:[FieldSchema(name:viewtime, type:int, comment:null), FieldSchema(name:userid, type:bigint, comment:null), FieldSchema(name:page_url, type:string, comment:null), FieldSchema(name:referrer_url, type:string, comment:null), FieldSchema(name:ip, type:string, comment:IP Address of the User), FieldSchema(name:dt, type:string, comment:null), FieldSchema(name:country, type:string, comment:null)], location:hdfs://10.28.202.165:9000/hive/warehouse/page_view, inputFormat:org.apache.hadoop.mapred.SequenceFileInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat, compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, parameters:{serialization.format=1}), bucketCols:[], sortCols:[], parameters:{}), partitionKeys:[FieldSchema(name:dt, type:string, comment:null), FieldSchema(name:country, type:string, comment:null)], parameters:{transient_lastDdlTime=1326174755, comment=This is the page view table}, viewOriginalText:null, viewExpandedText:null, tableType:MANAGED_TABLE) 

Resources:

Keywords: Apache Hadoop, Windows Azure, BigData, Cloud, MapReduce

Leave a Comment
  • Please add 5 and 1 and type the answer here:
  • Post
  • Nice writeup. Unfortunately the wiki link at the end is broken :(

  • thanks,

    Can I create a job from hive script too?

    Can you please write an example?

  • Can anyone tell set hive.enforce.bucketing = true; is not working  when we try to use bucketing.

Page 1 of 1 (3 items)