Apache Oozie is a workflow scheduler system used to manage and execute Hadoop jobs. When building a Directed Acyclic Graph (DAG) of tasks using Oozie, you define a workflow where each task or action is a node, and the edges between them dictate the order of execution. Here’s a step-by-step guide on how to create a DAG with Oozie:
1. Set Up Oozie Environment
Before building the DAG, ensure that Oozie is installed and configured on your Hadoop cluster. You’ll need:
• Oozie Server: Running and accessible
• HDFS: Where you will store workflow definitions and dependencies
• Oozie Client: To submit and manage workflows
2. Define the Workflow XML
The DAG is defined in an XML file, typically named workflow.xml, which specifies each task and the dependencies between them. Each node in the DAG can represent various actions, such as MapReduce, Spark, Pig, Hive jobs, or even custom scripts.
Here’s a basic structure of a workflow XML file for Oozie:
<workflow-app xmlns="uri:oozie:workflow:0.5" name="example_workflow">
<!-- Start node of the workflow -->
<start to="first_task"/>
<!-- Define actions -->
<action name="first_task">
<map-reduce>
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<!-- Configuration parameters for the job -->
</configuration>
</map-reduce>
<ok to="second_task"/>
<error to="kill"/>
</action>
<action name="second_task">
<spark xmlns="uri:oozie:spark-action:0.2">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<master>${sparkMaster}</master>
<mode>cluster</mode>
<name>example_spark_job</name>
<class>com.example.SparkJob</class>
<jar>${sparkJobJar}</jar>
<!-- Additional Spark job arguments if necessary -->
</spark>
<ok to="end"/>
<error to="kill"/>
</action>
<!-- Kill node for error handling -->
<kill name="kill">
<message>Workflow failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<!-- End node -->
<end name="end"/>
</workflow-app>
3. Configure the Properties File
Oozie uses a .properties file to define configuration properties. This file includes paths to the workflow, names of HDFS directories, and other variables referenced in the workflow.xml file. Example:
nameNode=hdfs://namenode:8020
jobTracker=jobtracker:8032
queueName=default
oozie.wf.application.path=${nameNode}/user/${user.name}/oozie/workflows/example_workflow
sparkMaster=yarn
sparkJobJar=${nameNode}/user/${user.name}/spark-jobs/example-job.jar
4. Upload the Workflow to HDFS
Upload your workflow files (e.g., workflow.xml, the properties file, and any job-specific files) to a directory in HDFS.
hadoop fs -mkdir -p /user/<username>/oozie/workflows/example_workflow
hadoop fs -put workflow.xml /user/<username>/oozie/workflows/example_workflow
hadoop fs -put job.properties /user/<username>/oozie/workflows/example_workflow
5. Submit and Monitor the Workflow
Submit the workflow to Oozie using the oozie job command with the properties file:
oozie job -oozie http://oozie-server:11000/oozie -config job.properties -run
To monitor the workflow, use:
oozie job -oozie http://oozie-server:11000/oozie -info <job-id>
6. Define Coordinators or Bundles (Optional)
For recurring workflows, you can define coordinators that run the workflow based on time or data availability. A coordinator XML would define the frequency and the triggers to launch your DAG workflow.
Additional Tips
• Transitions: Each action specifies its transition in ok (success) or error (failure) nodes, allowing you to create complex DAGs with conditional paths.
• Fork and Join: You can parallelize tasks by using <fork> and <join> elements in your workflow, where <fork> splits tasks, and <join> synchronizes them back together.
Using these steps, you can build a DAG in Oozie to handle complex workflows, orchestrating a series of dependent and independent jobs in Hadoop.