Go to

Saturday, September 8, 2018

Big Data: Apache Hive installation with Hadoop and Mysql (Metastore)

Hive Overview:

Hive is an Apache opensource software mad on top of Hadoop for data warehousing, which was introduced in 2009
since then it has gained so much popularity in the Big Data world.




Sometimes writing MapReduce program for jobs may be complicated for users, so hive makes it easy for users to execute the same job with hive queries. The hive was made to simplify complicated MapReduce jobs programs
with SQL queries.

Hive queries are written in SQL like language called HiveQL, which are translated into MapReduce jobs by hive
and get executed on the hadoop cluster. Much complex MapReduce jobs can be performed by HiveQL and user can develop UDF(User defined function) in a java programming language.

After introducing hive in 2009, the hive has got a lot faster and better over the time,
Now it has several execution engines like Tez, Spark instead of MapReduce engine, which makes it faster.
And there are other tools in the Big Data world also with uses hive metadata and uses interactive SQL to Hadoop.


Hive Execution engines: 

Hive on Tez: With the introduction of YARN as an independent resource manager, Tez has emerged as a complementary high-performance execution engine. Hive is in the process of being modified to run on Tez, allowing queries
to run significantly faster.


Shark: The Shark project adds functionality to Hive to allow it to run on top of the Spark execution engine, 
optimizing workflows and offering in-memory processing, improving performance significantly.

Other Engine:
Impala: While technically a different component than Hive, Impala leverages Hive’s query language (HiveQL) 
and metadata to bring interactive SQL to Hadoop. 

Drill: Apache Drill, while offering ANSI SQL versus Hive QL, will also provide the ability to leverage the metadata in Hive megastore for querying. This is in addition to querying nested data with dynamic schemas.


Installation:- Here we are going to install hive 2.2.0 on top of hadoop2.7.5, follow the below steps to do so.


Step1: Download hive with below command:
wget  http://www-us.apache.org/dist/hive/hive-2.2.0/apache-hive-2.2.0-bin.tar.gz

This zip file will be downloaded in the download directory, so to untar it, with below command.

tar -xzf apache-hive-2.2.0-bin.tar.zip

Now create a new directory in /usr/local/hive and move untared hive file to there.
sudo mv apache-hive-2.2.0-bin /usr/local/hive

Step2 : Update bashrc file:
sudo gedit ~/.bashrc
source ~./bashrc
A file will open and add below line in it.
export HIVE_HOME=/usr/local/hive/apache-hive2.2.0-bin
export HIVE_CONF_DIR=$HIVE_HOME/conf
export PATH=$HIVE_HOME/bin:$PATH
export CLASSPATH=$CLASSPATH:$HADOOP_HOME/lib/*:.
export CLASSPATH=$CLASSPATH:$HIVE_HOME/lib/*

Step3 : Updating hive conf/hive-env.sh files.
export HIVE_CONF_DIR=/usr/local/hive2.2/conf
export HADOOP_HOME=/usr/local/hadoop/hadoop-2.7.5

Update hive/conf/hive-site.xml with below properties.
<property>
       <name>hive.exec.scratchdir</name>
       <value>hdfs://localhost:9000/tmp/hive</value>
       <description>HDFS root scratch dir for Hive jobs 
which gets created with write all (733) permission.
For each connecting user, an HDFS scratch dir:${hive.exec.scratchdir}/&lt;username&gt; is created, with ${hive.scratch.dir.permission}.</description>
    </property>

  <property>
    <name>hive.repl.rootdir</name>
    <value>hdfs://localhost:9000/user/hive/repl/</value>
    <description>HDFS root dir for all replication dumps.</description>
  </property>

<property>
    <name>hive.exec.local.scratchdir</name>
    <value>/tmp/${user.name}</value>
    <description>Local scratch space for Hive jobs</description>
</property>
 
<property>
    <name>hive.downloaded.resources.dir</name>
    <value>/tmp/${user.name}_resources</value>
    <description>Temporary local directory for added resources 
in the remote file system.</description>
</property>

<property>
    <name>hive.metastore.warehouse.dir</name>
    <value>hdfs://localhost:9000/user/hive/warehouse</value>
    <description>location of default database for the warehouse </description>
</property>

Step4: make a /user/hive/warehouse directory in your HDFS system and change permissions,
 so that hive can read/write in it.

hdfs dfs -mkdir -p /user/hive/warehouse
hdfs dfs -chmod -R 777 /user/hive/warehouse

Now everything is done.

Step 5: Command to set up derby DB as Hive Metastore:
$HIVE_HOME/bin/schematool -initSchema -dbType derby

Note: if you want to use derby as metastore for hive than just run below command 
in hive/bin directory or if you want to use MySQL than follow the below steps for 
MySQL metastore.




Setting up Mysql for Hive metastore:
Meta store for any database holds data everything related to the table, database, view and etc.
To set up MySQL as metastore for hive follow below steps.

Step 1: Download MySQL and make user, set password to it.
sudo apt-get install MySQL-server
And install java connector for MySQL.
sudo apt-get install libmysql-java
Open terminal and type below command.
mysql -u root -p
Enter your password

Step 2: Add mysql-connector-java-5.1.38.jar file from /usr/share/java/ to $HIVE_HOME/lib/ directory.
sudo cp /usr/share/java/mysql-connector-java-5.1.38.jar $HIVE_HOME/lib/

Step 3: Update some properties in hive/conf/hive-site.xml directory as below.
<property>
   <name>javax.jdo.option.ConnectionUserName</name>
   <value>root</value>
<description>Username to use against metastore database</description>
</property>

<property>
    <name>javax.jdo.option.ConnectionPassword</name>
    <value>password</value>
<description>password to use against metastore database</description>
  </property>
<property>
<name>javax.jdo.option.ConnectionURL</name>  <value>jdbc:mysql://localhost:3306/metastore?useSSL=false&amp;createDatabaseIfNotExist=true</value>
  </property>
<property>
    <name>javax.jdo.option.ConnectionDriverName</name>
    <value>com.mysql.jdbc.Driver</value>
    <description>Driver class name for a JDBC metastore</description>
</property>

After updating properties, save hive-site.xml.

Step4: Now we need to create a database in MySQL and give it a proper schema version. run below command in MySQL
CREATE DATABASE metastore;
USE metastore;
SOURCE $HIVE_HOME/scripts/metastore/upgrade/mysql/hive-schema-0.14.0.mysql.sql;

Step 5: Now when everything is set run hive in your terminal and create a table to check your 
hive is connected to MySQL and Hadoop or not.



We can check our created table in hdfs and mysql metastore also. 



As we can see table we just created in hive is there in hdfs web UI.



Congrats now your hive is working perfectly fine.

No comments:

Post a Comment

Power BI Report and Dataset Performance Optimization

  Power BI Report and Dataset Performance Optimization     For any organization developing Power BI reports, there is a strong desire to des...