We have seen multiple customers contact us where an oozie job appears to hang. The oozie job involves a sqoop action which is exporting data from a file in HDInsight to a table in a SQL Azure database. For background on Sqoop see Getting Started with Sqoop . We will use this blog to help understand HDInsight's behavior better. The actual problem is that SQL Azure database raises a primary key violation.

Typically a primary key violation is not going to be resolved by another attempt unless the record is removed from the SQL Server table before you re execute the command.

Within SQL Server you will see the 2627 error message when a primary key violation is encountered.

Msg 2627, Level 14, State 1, Line 1 Violation of PRIMARY KEY constraint 'PK_Table_4'. Cannot insert duplicate key in object 'dbo.Table1'. The duplicate key value is (1).The statement has been terminated.

 

In HDInsight tasktracker logs you will see. The same primary key violation error shows up.

2014-03-15 13:51:00,704 INFO org.apache.hadoop.util.NativeCodeLoader: Loaded the native-hadoop library
2014-03-15 13:51:02,501 INFO org.apache.hadoop.mapred.Task: Using ResourceCalculatorPlugin : org.apache.hadoop.util.WindowsResourceCalculatorPlugin@671aeb3
2014-03-15 13:51:04,094 INFO org.apache.hadoop.mapred.MapTask: Processing split: Paths:/user/hdp/Table1/Table1.csv:26+13
2014-03-15 13:51:04,547 WARN org.apache.hadoop.io.compress.snappy.LoadSnappy: Snappy native library is available
2014-03-15 13:51:04,547 INFO org.apache.hadoop.io.compress.snappy.LoadSnappy: Snappy native library loaded
2014-03-15 13:51:04,719 INFO org.apache.sqoop.mapreduce.AutoProgressMapper: Auto-progress thread is finished. keepGoing=false
2014-03-15 13:51:05,329 WARN org.apache.sqoop.mapreduce.SQLServerExportDBExecThread: Error executing statement: java.sql.BatchUpdateException: Violation of PRIMARY KEY constraint 'PK_Table_4'. Cannot insert duplicate key in object 'dbo.Table1'. The duplicate key value is (1).

 

I first followed the blog "Getting Started with Sqoop" above and created a table in SQL Azure database with four records. I then created a table1.csv file with the same records and placed it on my WASB storage account in my default container for my HDInsight cluster. Trying to export the Table1.csv records into the Table1 in my SQL Azure database should raise the primary key violation. The actual sqoop command is.

sqoop export --connect "jdbc:sqlserver://xxxxx.database.windows.net:1433;username=hdp@xxxxx;password=xxxxx;database=wpc-wadb" --table Table1 --export-dir /user/hdp/Table1 --input-fields-terminated-by ","

 

The map reduce job took 1 hour and 21 minutes to fail! Although the job eventually failed it appears to end users that the job is hung because it was running so long before it failed! Why did it take so long to fail? By default, on HDInsight, a mapper will attempt 8 times and each attempt has a task timeout of 600 seconds. In the tasktracker log below you can see that total finish time is 1 hour and 21 minutes. Eight attempts were made and each attempt took 600 seconds. This gets us to our 1 hour 21 minutes to fail the job completely. The url is http://jobtrackerhost:50030/jobtasks.jsp?jobid=job_201403141723_0012&type=map&pagenum=1

 

 

In your mapred-site.xml located at C:\apps\dist\hadoop-1.2.0.1.3.6.0-0862\conf there are two properties that affect this behavior, mapred.map.max.attempts and mapred.task.timeout.

 

<property>
<name>mapred.map.max.attempts</name>
<value>8</value>
</property>

<property>
<name>mapred.task.timeout</name>
<value>600000</value>
</property>

  

Let's change these properties, restart the namenode and jobtracker services and run the sqoop command again and see if the behavior changes. Let's change the mapred.map.max.attempts to 2 and mapred.task.timeout to 120000 (2 minutes). This time the map reduce job failed in 4 minutes and 40 seconds. It tried two map attempts and each attempt timed out at 120 seconds.

 

 

Let's put the mapred.map.max.attempts and mapred.tasks.timeout back to the HDInsight defaults and restart the namenode and jobtracker services. Also let's create an oozie workflow and job.properties file and execute an oozie job with the sqoop action.

 

Workflow.xml

<workflow-app name="sample-wf" xmlns="uri:oozie:workflow:0.1">
<start to = "OozieSqoopAction"/><action name="OozieSqoopAction">
<sqoop xmlns="uri:oozie:sqoop-action:0.2">
<job-tracker>${jobTracker}</job-tracker><name-node>${nameNode}
</name-node>
<configuration>
<property>
<name>mapred.compress.map.output</name>
<value>true</value>
</property>
</configuration>
<command>export --connect jdbc:sqlserver://xxxxx.database.windows.net:1433;username=hdp@xxxxx;password=xxxxx;database=wpc-wadb --table Table1 --export-dir /user/hdp/Table1 --input-fields-terminated-by ","</command>
</sqoop>
<ok to="end"/>
<error to="fail"/>
</action>
<kill name="fail">
<message>Job failed, error message[${wf:errorMessage(wf:lastErrorNode())}]
</message>
</kill>
<end name="end"/>
</workflow-app>

 

Job.properties file

#oozie properties
oozie.wf.application.path=wasb://xxx21@portalvhdszmhjyc3mxxxxx.blob.core.windows.net/user/wcarroll/wf
oozie.use.system.libpath=true
#Hadoop mapred.job.tracker
jobTracker=jobtrackerhost:9010
#Hadoop fs.default.name
nameNode=wasb://xxx21@portalvhdszmhjyc3mxxxxx.blob.core.windows.net
#Hadoop mapred.queue.name
queueName=default

 

I then RDP into my headnode and opened a Hadoop command prompt and changed directories to C:\apps\dist\oozie-3.3.2.1.3.6.0-0862\oozie-win-distro\bin. I copied my workflow.xml file into wasb, the job.properties file to a local folder and then issued the command below to start the oozie job.

oozie job -oozie http://namenodehost:11000/oozie -config c:\temp\pk\job.properties –run

 

If we look at the tasktracker logs again we see the job finished in 41 minutes and 1 second. It turns out that the default mapred.map.max.attempts on Hortonworks Data platform is actually 4 and the oozie service overrides the mapred-site.xml for the cluster using this parameter, so it only attempts four times instead of 8 times issuing the sqoop command outside of ozzie.

 

This is a little better but we still would like to affect this behavior. Let's modify the workflow.xml file to pass in mapreduce.map.max.attempts = 2 and mapreduce.task.timeout = 120000 and run the oozie job again. The new workflow.xml with the property changes is below. Copy it back to wasb and re run the oozie job.

New Workflow.xml

<workflow-app name="sample-wf" xmlns="uri:oozie:workflow:0.1">
<start to = "OozieSqoopAction"/>
<action name="OozieSqoopAction">
<sqoop xmlns="uri:oozie:sqoop-action:0.2">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapred.compress.map.output</name>
<value>true</value>
</property>   
<property>
<name>mapred.map.max.attempts</name>
<value>2</value>
</property>        
<property>
<name>mapred.task.timeout</name>
<value>120000</value>
</property>
</configuration>
<command>export --connect jdbc:sqlserver://xxxxx.database.windows.net:1433;username=hdp@xxxxx;password=xxxxx;database=wpc-wadb --table Table1 --export-dir /user/hdp/Table1 --input-fields-terminated-by ","
</command>
</sqoop>
<ok to="end"/>
<error to="fail"/>
</action>
<kill name="fail">
<message>Job failed, error message[${wf:errorMessage(wf:lastErrorNode())}] </message>
</kill>
<end name="end"/>
</workflow-app>

 

Now we are back to the job finishing in 4 minutes and 33 seconds with 2 attempts timing out after 120 seconds for each attempt. Changing the configuration properties of mapreduce in the workflow allows us to affect the mapreduce parameters of a specific oozie workflow without affect all the other mapreduce jobs on the cluster. In this case it also allows our mapreduce job to fail quicker, so we don't have to wait 1 hour 21 minutes to realize our job failed because of a primary key violation in the SQL Azure database. We also have a method to pass these configuration parameters for the job.

 

 

For more information on using oozie with HDInsight see Use oozie with HDInsight . This describes how to submit an ozzie job remotely with powershell which passes the mapred configuration properties remotely.

For more information on passing configuration parameters for a job on HDInsight see Passing configuration parameters for a job on HDInsight .

Before we start to change cluster properties it is best to understand HDInsight\Hadoop behavior. There are many errors that might be transient and a retry attempt is beneficial. For example a SQL Server deadlock error might benefit from a retry, however if you understand the nature of the error, passing in configuration properties in your oozie workflow is a good option to have. This way you can change the configuration properties of a single mapreduce job issues from oozie without affecting all the mapreduce jobs on the cluster.

 

Hope this helps!

Bill