Big Data & Data Science: Big Data: Apache Tez Execution engine on Apache Hive

What is Tez?

Tez is a new application framework built on Hadoop Yarn, which can execute complex-directed acyclic graphs of general data processing tasks. In many ways, it can be considered to be a much more flexible and powerful successor to the map-reduce framework.

Tez provides developers an API framework to write native YARN applications on Hadoop that bridges the spectrum of interactive and batch workloads. It allows those data access applications to work with petabytes of data over thousands of nodes.

In simple terms, Tez is a processing engine on the top of the Hadoop ecosystem, which runs on YARN and performs well within mixed workload clusters.

Tools information:

Tools which i am using in this project are as below.

Hadoop: 2.7.5 version
Hive: 2.2.0 version
Tez: 0.9.0 version
Protobuf: 2.5.0 version

Tez Installation

Step 1. After installing Maven, install Protocol buffer 2.5.0 or higher versions of the same. You can download protocol buffer-2.5.0 from the following link:

Protocol buffer-2.5.0

By default, it will downloaded into your ‘Downloads’ folder

Step 2: Now, untar the file using the below command:

tar -xvf protobuf-2.5.0.tar.gz

Now, you can view the extracted file in the ‘Downloads’ folder itself.

Step 3: Next, let’s move the protobuf-2.5.0 folder into your desired location. Here, we are moving it into the home directory.

mv protobuf-2.5.0 $HOME/

Step 4: Now, open the protobuf-2.5.0 folder using the command

cd protobuf-2.5.0

Step 5: Now type the below commands to configure protocol buffer.

sudo apt-get install autoconf autogen

./autogen.sh

./configure --prefix=/usr/local

Step 6: Execute the make command

make

Once configure has done its job, we can invoke make to build the software. This runs a series of tasks defined in a Makefile to build the finished program from its source code.

Step 7: Type the make install command

Sudo make install

Now that the software is built and ready to run, the files can be copied to their final destinations. The make install command will copy the built program, and its libraries and documentation, to the correct location

It will take some time for the process to be completed.

Step 8: Now, check if the protocol buffer is installed or not with the below command:

protoc --version

Output: libprotoc 2.5.0

Note: if any error shows like below

ERROR: protoc: error while loading shared libraries: libprotoc.so.8: cannot open shared object file: No such file or directory

Than use this command: export LD_LIBRARY_PATH=/usr/local/lib

We have successfully installed the protocol buffer! Let’s install Apache Tez now.

Step 9: You can download Apache Tez from the following link:

wget http://redrockdigimark.com/apachemirror/tez/0.9.0/apache-tez-0.9.0-bin.tar.gz

By default, it will downloaded into your ‘Downloads’ folder. You can move the folder into your desired location. Here, we are moving it into the ‘Home’ directory.

Step 10: Now, untar the file using the below command:

tar -xvf apache-tez-0.9.0-bin.tar.gz

Step 11: now move this folder to /usr/local or $home, where you want. A safe place for it.

mv apache-tez-0.9.0-bin /usr/local

And also make a new directory in hdfs and copy this untared tez folder to there.

hdfs dfs -mkdir -p /app/tez

Step 12: now edit tez-site.xml and mapred-site.xml in tez/conf directory as below (in both local and hdfs tez folders). If any of this file doesn’t exist there, than create it first.

Tez-site.xml (edit this below property in tez-site.xml)

<value>hdfs://localhost:9000/app/tez/apache-tez-0.9.0-bin/share/tez.tar.gz</value>

<type>string</type>

</property>

Mapred-site.xml (add this property in mapred-site.xml)

<name>mapreduce.framework.name</name>

</property>

</configuration>

After making change save these two files.

Step 13: add below properties in hadoop/etc/hadoop/conf/ files

Hadoop-env.sh (add this properties in hadoop-env.sh file)

# for tez engine

export TEZ_CONF_DIR=/usr/local/apache-tez-0.9.0-bin/conf

export TEZ_JARS=/usr/local/apache-tez-0.9.0-bin/

export HADOOP_CLASSPATH=${TEZ_CONF_DIR}:${TEZ_JARS}/*:${TEZ_JARS}/lib/*:${HADOOP_CLASSPATH}:

${JAVA_JDBC_LIBS}:${MAPREDUCE_LIBS}

Yarn-site.xml (add this properties in yarn-site.xml file)

<name>yarn.nodemanager.vmem-check-enabled</name>

<value>false</value>

<description>Whether virtual memory limits will be enforced for containers</description>

</property>

<name>yarn.nodemanager.vmem-pmem-ratio</name>

<description>Ratio between virtual memory to physical memory when setting memory limits for containers</description>

</property>

Note: setting yarn.nodemanager.vmem-check-enabled to false value stops yarn to check container virtual memory limit, if in case that exceeds.

If we don’t do that then we make face error like below while running Tez as hive execution engine.

ErrorCode: failed execution error return code 1 from org.apache.hadoop.hive.ql.exec.tez.teztask

Container [pid=22291,containerID=container_1521621072290_0018_02_000001] is running beyond virtual memory limits. Current usage: 62.0 MB of 1 GB physical memory used; 2.4 GB of 2.1 GB virtual memory used. Killing container.

Step 14: adding hive lib jars to Tez in HDFS. (for using Tez engine in hive we must add hive lib jars to tez/lib also).

To add hive lib jar to local tez:

sudo cp /usr/local/hive/lib/ /usr/local/apache-tez-.0.9.0-bin/lib

To add hive lib jar to hdfs tez:

hdfs dfs -copyFromLocal file:///usr/local/hive/lib/ /app/tez/apache-tez-0.9.0-bin/lib

Now all the installation steps are over. Now Tez engine will work on the top of hadoop.

Now it’s time to test Tez engine:

To check our Tez engine we will test a query in Hive.

Follow the commands:

Step 1: start hadoop environment

Start-all.sh

And check if everything is running with jps command.

Command: jps

Output:

7074 RunJar

8805 NameNode

9766 Jps

9432 NodeManager

9146 SecondaryNameNode

9309 ResourceManager

8926 DataNode

Step 2: start hive from command shell with by typing hive.

hduser@rdharm:/home/dharm$ hive

Now when a query in MR mode (mapreduce as hive engine) and you will see map-reduce steps.

Command:

Select category, count(*) as cnt from youtube.youtube_videos group by category order by cnt desc;

Output:

Step 3: now set hive.execution.engine=tez and rune that same query again. And you will see output as below.

Commands:

Set hive.execution.engine=tez;

Select category, count(*) as cnt from youtube.youtube_videos group by category order by cnt desc;

If everything works fine, than your tez engine is working good with hive.

Big Data & Data Science

Go to

Sunday, September 9, 2018

Big Data: Apache Tez Execution engine on Apache Hive

Tez Installation

No comments:

Post a Comment

Power BI Report and Dataset Performance Optimization