Hadoop Workflow Automation using Apache spark and Oozie

Introduction

The real-time data representation often requires quality content. In most cases, users flooded with multiple options or methodologies that may overwhelm the user to opt for the right techniques.

The quick notes from many geeks over the internet is a good start and at some instances, the conciseness of the article might meet our requirement; some other cases, it may direct with some pointers.

I believe that knowledge sharing is an effective method to grow with mutual terms. Over time, I realized how important to contribute and document the experience in a way to contribute to the communities.

This is an effort to recollect my 14 years of IT experience in the areas of data, data integration, data warehouse, and derive various data solutions through Hadoop eco-systems.

In this article, I will walk-through:

Discuss the pre-requisites
Migration from traditional data warehouse to Hadoop system
Tools and techniques
More…

Note:I touch-base some of tools or technology over period. In this series, I will walk-through Hadoop, HDFS, Sqoop, scala, python, Hive, Hbase, Spark2, mapreduce, kafka, flumes, Oozie, Unix, Shell, SSH, sql server, oracle, Teradata, informatica power center, Informatica Data Quality, Syncsort, Salseforce, Data bricks, Dremio, beeline, API, execution plan, spark physical joins, automation and best practices.

Get started

In this section, you will see how to build a data product:

Integrate data that are available in existing Data warehouses, legacy applications and near real time sources.
Implement Data standardization
Discuss Business logic and reporting needs
Prepare data for rendering reports in Hadoop platform

Note: We will keep infrastructure needs, planning and platform setup out of scope from this article.

Solution

In this section, let us discuss the steps:

Setup a data lake and Ingest data
Data processing using Apache Spark
Automate using Apache Oozie

Dissect the details:

Let us discuss the aforementioned steps:

Data Ingestion

The enterprise data can be of structured, semi-structured, and unstructured data. In our case, we will consider structured and semi-structured data. To perform the data ingestion, we would need to build two types of the process;

Sqoop import for handling structured data; sources could be relational or delimited files
Spark streaming by reading a kafka topic

To start with, you can perform sqoop import using sqoop command from unix installed with a release of a hadoop installed. In order to support variety of sources like RDBMS, flat files and complex files you can build a framework with any of scripting or programming languages like shell scripting or advanced java based programs. With no bias decision will rely on the availability of resources and the amount of time you are ready to spend on building the reusable component and you are the right to one to decide what is right for you. In either case you will realize its benefits as short time versus long term respectively.

Here is the sample sqoop commands for relational source;

# sqoop import –connect “jdbc:oracle:thin:@<hostname:datasource>” –username BIGDATA_USER_T –password-file /user/tbdingd01/sqoop/BIGDATA_USER_T –query ‘select 1 from dual where $CONDITIONS’ -m1 –target-dir /tmp/kiran/test_ora.txt

Once the sqoop import runs successfully you can either create an externa table on hive with required table configurations or perform Hadoop file operations. By default sqoop imports file in text file format with comma as delimiter.

Data processing using Apache Spark

In the process of building a data product one would end-up applying many resource-intensive analytical operations on a medium to large data-set in an efficient way. Apache Spark is the bet in this scenario to perform faster job execution by caching data in memory and enabling parallelism in a distributed data environments.

Components involved in Spark implementation:

Initialize spark session using scala program
Ingest data from data lake through hive queries
Apply business logic using scala constructs or hive queries
Load data into HDFS or Hive targets
Execute spark programs through spark submit

We will learn the basics of implementation using Scala. Please refer this git branch click here

Let us see the simple spark submit and meaning of each configuration items

spark-submit –class org.apache.sparksample.sample \

–master yarn-client \

–num-executors 1 \

–driver-memory 512m \

–executor-memory 512m \

–executor-cores 1 \

examples/jars/spark-examples*.jar 10

Please refer the link Running Sample Spark 2.x Applications for apache documentation.

Automate using oozie workflow

There are different ways to automate the data pipelines. In this section, we will see the required components to configure Oozie:

Sample Oozie property file
Sample Oozie workflow
Nodes in Oozie workflow
Starting Oozie Workflow

Note: As Oozie do not support spark2, we will try to Customize Oozie Workflow to support Spark2 and submit the workflow through SSH.

Please refer my git oozie sample branch for the xml and configuration files to build your oozie workflow. You can build the workflow using xml file or the oozie workflow manager through ambari url. Once you build the workflow, I prefer to validate once using workflow import from hdfs path. You should be able to see the workflow view as below as long as your workflow.xml file is valid.

The simple oozie workflow job submit will look as below:

Syntax:

oozie job -oozie $OozieURL -config $jobPropertiesName -run -DnameNode=$nameNode -DjobTracker=$jobTracker -DresourceManager=$resourceManager -DjdbcUrl=$JDBCUrl -Dhs2Principal=$hs2Principal -DqueueName=$QueueName -DhdpVersion=$HdpVersion -DzeroCountTblVar=$ZeroCountTblVar -DsshHost=$SSHHost -DmyFolder=$myScriptFolder`

Example:

oozie job -oozie http://hostname:11000/oozie -config /appl1/jobs/oozie_sample.properties -run -DnameNode=hdfs://hostname-DjobTracker=hostname:8050 -DresourceManager=hostname:8050 -DjdbcUrl=jdbc:hive2://hostname:2181,hostname2:2181,hostname3:2181/default;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2;transportMode=http;httpPath=cliservice; -Dhs2Principal=hive/_HOST@hostname -DqueueName=default -DhdpVersion=2.6.0.23-2 -DsshHost=hostname -DmyFolder=oozie_sample

Oozie does support re-run of failed workflows and below is a sample:

oozie job -oozie http://hostname:11000/oozie -config /appl1/jobs/script/rerun/wf_oozie_sample.xml -rerun 0000123-19234432643631-oozie-oozi-W

0000123-19234432643631-oozie-oozi-W is the job id you can find it on the failed workflow on the oozie monitor info.

We have variety of ways to get things done, I have opted simplest way may be there are better ways to do build Hadoop data pipelines, enable logging and schedule the jobs. I have tried best to keep the scripts simple, clean and follow some basic standards for easier quality learning exchange.

If you find it interesting or wish to leave a comment, please feel free to provide your inputs to improve the quality of my content.

About Kiran Pandurangarao

Kiran come up with a 14 years of IT experience in the areas of data, data integration, data warehouse, and derive various data solutions through Hadoop eco-systems. Enthusiastic knowledge seeker and open to exchange the experience.

View all posts by Kiran Pandurangarao →