Sign in
Avkash Chauhan's Blog
Windows Azure, Windows 8, Cloud Computing, Big Data and Hadoop: All together at one place.. One problem, One solution at One time...
Common Tasks
Blog Home
Email Blog Author
About
OK
RSS for comments
RSS for posts
Atom
Search Form
Tag Cloud
ACS
Announcement
Architecture
ASP.NET
Azure
Code Sample
CSUPLOAD
Error
Exception
Exception ERROR UNSUPPORTED OS
Exceptions
Hadoop
How to Do.
How to Do..
Linux
Node.js
PowerShell
SDK
Storage
VHD
Virtual Machine
VM Role
VMRole
WebSites
Windows Azure
Monthly Archives
Archives
April 2013
(1)
March 2013
(1)
February 2013
(3)
January 2013
(3)
November 2012
(6)
October 2012
(4)
September 2012
(1)
August 2012
(1)
July 2012
(1)
June 2012
(7)
May 2012
(8)
April 2012
(9)
March 2012
(5)
February 2012
(11)
January 2012
(25)
December 2011
(28)
November 2011
(30)
October 2011
(31)
September 2011
(27)
August 2011
(18)
July 2011
(16)
June 2011
(16)
May 2011
(18)
April 2011
(20)
March 2011
(15)
February 2011
(21)
January 2011
(10)
December 2010
(16)
November 2010
(8)
October 2010
(8)
Browse by Tags
MSDN Blogs
>
Avkash Chauhan's Blog
>
All Tags
>
hadoop
Tagged Content List
Blog Post:
Hadoop adventures with Microsoft HDInsight
Avkash Chauhan - MSFT
What is HDInsight? HDinsight is the product name for Microsoft installation of Hadoop and Hadoop on azure service. HDInsight is Microsoft’s 100% Apache compatible Hadoop distribution, supported by Microsoft. HDInsight, available both on Windows Server or as an Windows Azure service, empowers...
on
3 Nov 2012
Blog Post:
Programmatically retrieving Task ID and Unique Reducer ID in MapReduce
Avkash Chauhan - MSFT
For each Mapper and Reducer you can get Task attempt id and Task ID both. This can be done when you set up your map using the Context object. You may also know that the when setting a Reducer an unique reduce ID is used inside reducer class setup method. You can get this ID as well. There are multiple...
on
10 Apr 2012
Blog Post:
Programmatically setting number of reducers with MapReduce job in Hadoop Cluster
Avkash Chauhan - MSFT
When submitting a Map/Reduce job in Hadoop cluster, you can provide number of map task for the jobs and the number of reducers are created depend on the Mappers input and the Hadoop cluster capacity. Or you can push the job and Map/Reduce framework will adjust it per cluster configuration. So setting...
on
5 Apr 2012
Blog Post:
Processing already sorted data with Hadoop Map/Reduce jobs without performance overhead
Avkash Chauhan - MSFT
While working with Map/Reduce jobs in Hadoop, it is very much possible that you have got “sorted data” stored in HDFS. As you may know the “Sort function” exists not only after map process in map task but also with merge process during reduce task, so having sorted data to sort...
on
3 Apr 2012
Blog Post:
How to submit Hadoop Map/Reduce jobs in multiple command shell to run in parallel
Avkash Chauhan - MSFT
Sometimes it is required to run multiple Map/Reduce jobs in same Hadoop cluster however opening several Hadoop command shell or (Hadoop terminal) could be trouble. Note that depend on your Hadoop cluster size and configuration, you can run limited amount of Map/Reduce jobs in parallel however if you...
on
2 Apr 2012
Blog Post:
Listing current running Hadoop Jobs and Killing running Jobs
Avkash Chauhan - MSFT
When you have jobs running in Hadoop, you can use the map/reduce web view to list the current running jobs however what if you would need to kill any current running job because the submitted jobs started malfunctioning or in worst case scenario, the job is stuck in infinite loops. I have seen several...
on
1 Apr 2012
Blog Post:
How to troubleshoot MapReduce jobs in Hadoop
Avkash Chauhan - MSFT
When writing MapReduce programs you definitely going to hit problems in your programs such as infinite loops, crash in MapReduce, Incomplete jobs etc. Here are a few things which will help you to isolate these problems: Map/Reduce Logs Files: All MapReduce jobs activities are logged by default...
on
30 Mar 2012
Blog Post:
How to chain multiple MapReduce jobs in Hadoop
Avkash Chauhan - MSFT
When running MapReduce jobs it is possible to have several MapReduce steps with overall job scenarios means the last reduce output will be used as input for the next map job. Map1 -> Reduce1 -> Map2 -> Reduce2 -> Map3... While searching for an answer to my MapReduce job, I stumbled...
on
29 Mar 2012
Blog Post:
How to wipe out the DFS in Hadoop?
Avkash Chauhan - MSFT
If you format only Namenode, it will remove the metadata stored by the Namenode, however all the temporary storage and Datanode blocks will still be there. To remove temporary storage and all the Datanode blocks you would need to delete the main Hadoop storage directory from every node. This directory...
on
28 Mar 2012
Blog Post:
Running Apache Mahout at Hadoop on Windows Azure (www.hadooponazure.com)
Avkash Chauhan - MSFT
Once you have access enabled to Hadoop on Windows Azure you can run any mahout sample on head node. I am just trying to run original Apache Mahout ( http://mahout.apache.org/ ) sample which is derived from the clustering sample on Mahout's website ( https://cwiki.apache.org/confluence/display/MAHOUT...
on
6 Mar 2012
Blog Post:
Primary Namenode and Secondary Namenode configuration in Apache Hadoop
Avkash Chauhan - MSFT
Apache Hadoop Primary Namenode and secondary Namenode architecture is designed as below: Namenode Master: The conf/masters file defines the master nodes of any single or multimode cluster. On master, conf/masters that it looks like this: ———————-...
on
27 Feb 2012
Blog Post:
Master Slave architecture in Hadoop
Avkash Chauhan - MSFT
Apache Hadoop is designed to have Master Slave architecture. Master: Namenode, JobTracker Slave: {DataNode, TaskTraker}, ….. {DataNode, TaskTraker} HDFS is one primary components of Hadoop cluster and HDFS is designed to have Master-slave architecture. Master: NameNode Slave...
on
24 Feb 2012
Blog Post:
Keys to understand relationship between MapReduce and HDFS
Avkash Chauhan - MSFT
Map Task (HDFS data localization): The unit of input for a map task is an HDFS data block of the input file. The map task functions most efficiently if the data block it has to process is available locally on the node on which the task is scheduled. This approach is called HDFS data localization....
on
15 Feb 2012
Blog Post:
Hadoop Performance: How storage disk types in individual node will impact the job performance?
Avkash Chauhan - MSFT
As you may have already know that Hadoop Cluster is network and disk, IO intensive. Recently I was trying to run a test scenario where I decided to change SATA hard disk to a high performance SSD Disk while keeping the cluster hardware the same. I was running the terra sort test to validate if having...
on
14 Feb 2012
Blog Post:
Internals of Hadoop Pig Operators as MapReduce Job
Avkash Chauhan - MSFT
I was recently asked to show that Pig scripts are actually MapReduce jobs so to explain it in very simple way I have created the following example: Read a text file using Pig Script Dump the content of the file As you can see below that when “dump” command was used a...
on
8 Feb 2012
Blog Post:
Which one to choose between Pig and Hive?
Avkash Chauhan - MSFT
Technically they both will do the job, you are looking from "either hive or Pig" perspective, means you don't know what you are doing yet. However if you first define the data source, scope and the result representation and then look for which one to choose between Hive or Pig, you will find they are...
on
7 Feb 2012
Blog Post:
Customizing your Hadoop cluster running on your own Windows Azure Subscription
Avkash Chauhan - MSFT
In this article we will learn how to configure the same Hadoop cluster to do some customization. To learn about creating your own Hadoop cluster on Windows Azure by using your own Windows Azure Subscription account – Click here To add more worker nodes to Hadoop cluster on Windows Azure we...
on
29 Jan 2012
Blog Post:
Creating your own Hadoop cluster on Windows Azure by using your own Windows Azure Subscription account
Avkash Chauhan - MSFT
[ As of now this functionality is not available with Hadoop on Windows Azure. These instruction are not applicable as of now and when things will change i will add more info here.. Thanks for your support ] Apache Hadoop distribution (currently in CTP) allows you to setup your own Hadoop cluster...
on
28 Jan 2012
Blog Post:
Setting Amazon S3 Storage as data source (s3n://) in Hadoop on Azure (hadooponazure.com) portal
Avkash Chauhan - MSFT
To get your Amazon S3 account setup with Apache Hadoop cluster on Windows Azure you just need you AWS security credentials which pretty much look like as below: After you completed creating Hadoop cluster in Windows Azure, you can log into your Hadoop Portal. In the portal, you can select “Manage...
on
27 Jan 2012
Blog Post:
Understanding Map/Reduce job in Apache Hadoop on Windows Azure (A Reverse Approach)
Avkash Chauhan - MSFT
When you run Map/Reduce job in Hadoop cluster on Windows Azure you will get an aggregated progress and log directly on portal, so you can see what is happening with your job. This log is different the what see when you do check individual job status in the datanode instead this log gives you cumulative...
on
20 Jan 2012
Blog Post:
Setting Windows Azure Blob Storage (asv) as data source directly from Portal at Hadoop on Azure
Avkash Chauhan - MSFT
After you log into your Hadoop Portal and configured your cluster, you can select “Manage Data” tile as below: On the next screen you can select: “Set up ASV” to set your Windows Azure Blob Storage as data source “Set up S3” to set your Amazon S3...
on
13 Jan 2012
Blog Post:
Running Apache Pig (Pig Latin) at Apache Hadoop on Windows Azure
Avkash Chauhan - MSFT
Microsoft Distribution of Apache Hadoop comes with Pig Support along with an Interactive JavaScript shell where users can run their Pig queries immediately without adding specific configuration. The Apache distribution running on Windows Azure has built in support to Apache Pig. Apache Pig is a platform...
on
10 Jan 2012
Blog Post:
Apache Hadoop on Windows Azure : Running Hive Scripts from Interactive Hive Console
Avkash Chauhan - MSFT
Microsoft Distribution of Apache Hadoop comes with Hive Support along with an Interactive Hive shell where users can run their Hive queries immediately without adding specific configuration. The Apache distribution running on Windows Azure has built in support to Hive. What is Hive? Hive...
on
9 Jan 2012
Blog Post:
Using Windows Azure Blob Storage (asv://) for input data and storing results in Hadoop Map/Reduce Job on Windows Azure
Avkash Chauhan - MSFT
Microsoft distribution to Apache Hadoop comes by direct connectivity to cloud storage i.e. Windows Azure Blob storage or Amazon S3. Here we will learn how to connect your Windows Azure Storage directly from your Hadoop Cluster. To learn how to connect you Hadoop Cluster to Windows Azure Storage...
on
5 Jan 2012
Blog Post:
Apache Hadoop on Windows Azure: Connecting to Windows Azure Storage from Hadoop Cluster
Avkash Chauhan - MSFT
Microsoft distribution to Apache Hadoop comes by direct connectivity to cloud storage i.e. Windows Azure Blob storage or Amazon S3. Here we will learn how to connect your Windows Azure Storage directly from your Hadoop Cluster. As you know Windows Azure Storage access needed following two things...
on
5 Jan 2012
Page 1 of 2 (37 items)
1
2