Big Data & Data Science: Big Data: Apache spark with Hive Database

Making Connection between Apache Spark and Apache Hive

for making connection b/w spark and hive we need below configuration set up. if Apache Spark and Hadoop are not installed yet then follow below links and set them up first.

1. Apache Spark enviroment

2. Hadoop environment

3. Mysql database;

4. Hive environment

1. Installation Hive

1.1 Install hive: goto /usr/local create a hive folder

commands:

wget http://www-us.apache.org/dist/hive/hive-2.2.0/apache-hive-2.2.0-bin.tar.gz

sudo tar xvzf apache-hive-2.2.0-bin.tar.gz -C /usr/local/hive

1.2 Add hive path to bashrc file (Make sure Hive and Hadoop directory paths are correct):

command: sudo gedit ~/.bachrc

add bellow line in bashrc file

export HIVE_HOME=/usr/local/hive2.2

export HIVE_CONF_DIR=/usr/local/hive2.2/conf

export PATH=$HIVE_HOME/bin:$PATH

export CLASSPATH=$CLASSPATH:/usr/local/hadoop/hadoop-2.7.5/lib/*:.

export CLASSPATH=$CLASSPATH:/usr/local/hive2.2/lib/*

and run bashrc file with below command

command: source ~/.bashrc

1.3 Creating a hadoop directory for hive metastore data storage:

The directory warehouse is the location to store the table or data related to hive, and the temporary directory tmp is the temporary location to store the intermediate result of processing.

start hadoop environment with start-all.sh command and check for all running with jps.

Commands to run:

start-all.sh

hdfs dfs -mkdir -R /tmp/hive/warehouse

hdfs dfs -chmod -R 777 tmp/hive/warehouse

1.3 Edit hive-env.sh and hive-site.xml file

hive_env.sh file

# Set HADOOP_HOME to point to a specific hadoop install directory

HADOOP_HOME=/usr/local/hadoop/hadoop-2.7.5

# Hive Configuration Directory can be controlled by:

export HIVE_CONF_DIR=/usr/local/hive2.2/conf

hive-site.xml

<name>hive.metastore.warehouse.dir</name>

<value>hdfs://localhost:9000/user/hive/warehouse</value>

<description>location of default database for the warehouse</description>

</property>

<name>hive.exec.scratchdir</name>

<value>hdfs://localhost:9000/tmp/hive</value>

<description>HDFS root scratch dir for Hive jobs which gets created with write all (733) permission. For each connecting user, an HDFS scratch dir: ${hive.exec.scratchdir}/<username> is created, with ${hive.scratch.dir.permission}.</description>

</property>

<name>hive.repl.rootdir</name>

<value>hdfs://localhost:9000/user/hive/repl/</value>

<description>HDFS root dir for all replication dumps.</description>

</property>

<name>hive.exec.local.scratchdir</name>

<description>Local scratch space for Hive jobs</description>

</property>

<name>hive.downloaded.resources.dir</name>

<value>/tmp/${user.name}_resources</value>

<description>Temporary local directory for added resources in the remote file system.</description>

</property>

2. Installing mysql as metastore for hive

Step1: Install mysql:

sudo apt-get install mysql-server

Step2: Install the MySQL Java Connector:

sudo apt-get install libmysql-java

Step3: log in mysql and create a userwhich will access mysql databases from hive:

mysql -u root -p (enter this command to enter in mysql with rootuser)

( Note: Here we are creating mysql user for hive. set username and password according to you)
mysql> CREATE USER 'hduser'@'%' IDENTIFIED BY 'hduserpassword';

mysql> GRANT all on *.* to 'hduser'@localhost identified by 'hdpassword';

mysql> flush privileges;

now hduser will also have all previlege over databases;

Step4: Change following connection properties of hive-site.xml for mysql metastore as below.

<name>javax.jdo.option.ConnectionDriverName</name>

<value>com.mysql.jdbc.Driver</value>

<description>Driver class name for a JDBC metastore</description>

</property>

<name>javax.jdo.option.ConnectionURL</name>

<value>jdbc:mysql://localhost:3306/metastoreuseSSL=false&createDatabaseIfNotExist=true</value>

JDBC connect string for a JDBC metastore.

To use SSL to encrypt/authenticate the connection, provide database-specific SSL flag in the connection URL.

For example, jdbc:postgresql://myhost/db?ssl=true for postgres database.

</description>

</property>

<name>javax.jdo.option.ConnectionUserName</name>

<value>hduser</value>

<description>Username to use against metastore database</description>

</property>

<name>javax.jdo.option.ConnectionPassword</name>

<value>hduser_password</value>

<description>password to use against metastore database</description>

</property>

Step5: Create a database in mysql for hive metastore

mysql> CREATE DATABASE metastore;

mysql> USE metastore;

mysql> SOURCE $HIVE_HOME/scripts/metastore/upgrade/mysql/hive-schema-0.14.0.mysql.sql;

give it proper scema menualy with SOURCE command above otherwise it will so scheama missmatch error in spark.

Step 6: download and copy suitable mysql-connector-java-5.1.38.jar file in hive lib folder.

file: mysql-connector-java-5.1.38.jar

directory: /usr/local/hive2.2/lib/

Now it time to test you setting:

Step1: open hive in terminal and create a table:

hive> create table emp(name string, age int, salary int)

Step2: check this table in mysql with below commands:

mysql> use metastore ;

mysql>show tables:

mysql> select * from TBLS;

now you will see you created table here.

And you can also check you table in Hadoop HDFS files;

on https://localhost:50070 port. This is a namenode port for Hadoop.

3. Setting up spark for Hive

Step1: Copy hive-site.xml file from $HIVE_HOME/conf to $SPARK_HOME/conf directory

sudo cp $HIVE_HOME/conf/hive-site.xml $SPARK_HOME/conf

Step2: Copy mysql java driver file in spark jar folder.

spark2.2.1/jars/mysql-connector-java-5.1.38.jar

same file which we add in to $HIVE_HOME/lib folder before.

Step3: Convert spark-defaults.conf.template file in spark-env.sh file if file is not there.

And below lines int spark-env.sh file.

export SPARK_CLASSPATH="/usr/local/spark2.2.1/jars/mysql-connector-java-5.1.38.jar"

export HADOOP_CONF_DIR="/usr/local/hadoop/hadoop-2.7.5/etc/hadoop"

export YARN_CONF_DIR="/usr/local/hadoop/hadoop-2.7.5/etc/hadoop"

now we will be able to access hive table in spark and use spark_sql.

Big Data & Data Science

Go to

Monday, September 10, 2018

Big Data: Apache spark with Hive Database

No comments:

Post a Comment

Power BI Report and Dataset Performance Optimization