Go to

Monday, September 10, 2018

Big Data: Apache spark with Hive Database

Making Connection between Apache Spark and Apache Hive


for making connection b/w spark and hive we need below configuration set up. if Apache Spark and Hadoop are not installed yet then follow below links and set them up first.
3. Mysql database;
4. Hive environment


1. Installation Hive


1.1 Install hive: goto /usr/local create a hive folder
   commands:
      wget http://www-us.apache.org/dist/hive/hive-2.2.0/apache-hive-2.2.0-bin.tar.gz


     sudo tar xvzf apache-hive-2.2.0-bin.tar.gz -C /usr/local/hive
1.2 Add hive path to bashrc file (Make sure Hive and Hadoop directory paths are correct):
  command:   sudo gedit ~/.bachrc
add bellow line in bashrc file
    export HIVE_HOME=/usr/local/hive2.2
    export HIVE_CONF_DIR=/usr/local/hive2.2/conf
    export PATH=$HIVE_HOME/bin:$PATH
    export CLASSPATH=$CLASSPATH:/usr/local/hadoop/hadoop-2.7.5/lib/*:.
    export CLASSPATH=$CLASSPATH:/usr/local/hive2.2/lib/*
and run bashrc file with below command
command: source ~/.bashrc
1.3 Creating a hadoop directory for hive metastore data storage:
The directory warehouse is the location to store the table or data related to hive, and the temporary directory tmp is the temporary location to store the intermediate result of processing.
start hadoop environment with start-all.sh command and check for all running with jps.
   Commands to run:
      start-all.sh
    hdfs dfs -mkdir -R /tmp/hive/warehouse
    hdfs dfs -chmod -R  777 tmp/hive/warehouse


1.3 Edit hive-env.sh and  hive-site.xml file
  • hive_env.sh file
# Set HADOOP_HOME to point to a specific hadoop install directory
HADOOP_HOME=/usr/local/hadoop/hadoop-2.7.5
# Hive Configuration Directory can be controlled by:
export HIVE_CONF_DIR=/usr/local/hive2.2/conf
  • hive-site.xml
<property>
   <name>hive.metastore.warehouse.dir</name>
   <value>hdfs://localhost:9000/user/hive/warehouse</value>
   <description>location of default database for the warehouse</description>
</property>
<property>
   <name>hive.exec.scratchdir</name>
  <value>hdfs://localhost:9000/tmp/hive</value>
   <description>HDFS root scratch dir for Hive jobs which gets created with write all (733) permission. For each connecting user, an HDFS scratch dir: ${hive.exec.scratchdir}/&lt;username&gt; is created, with ${hive.scratch.dir.permission}.</description>
</property>
<property>
   <name>hive.repl.rootdir</name>
   <value>hdfs://localhost:9000/user/hive/repl/</value>
  <description>HDFS root dir for all replication dumps.</description>
</property>


      <property>
       <name>hive.exec.local.scratchdir</name>
       <value>/tmp/${user.name}</value>
       <description>Local scratch space for Hive jobs</description>
    </property>
<property>
     <name>hive.downloaded.resources.dir</name>
     <value>/tmp/${user.name}_resources</value>
     <description>Temporary local directory for added resources in the remote       file system.</description>
</property>


2. Installing mysql as metastore for hive


Step1: Install mysql:
            sudo apt-get install mysql-server


Step2: Install the MySQL Java Connector:


           sudo apt-get install libmysql-java


Step3: log in mysql and create a userwhich will access mysql databases from hive:
       
        mysql -u root -p  (enter this command to enter in mysql with rootuser)
         
( Note: Here we are creating mysql user for hive. set username and password according to you)
mysql> CREATE USER 'hduser'@'%' IDENTIFIED BY 'hduserpassword';


        mysql> GRANT all on *.* to 'hduser'@localhost identified by 'hdpassword';
        mysql>  flush privileges;
now hduser will also have all previlege over databases;


Step4: Change following connection properties of hive-site.xml for mysql metastore as below.


<property>
   <name>javax.jdo.option.ConnectionDriverName</name>
   <value>com.mysql.jdbc.Driver</value>
   <description>Driver class name for a JDBC metastore</description>
</property>


<property>
   <name>javax.jdo.option.ConnectionURL</name>
             <value>jdbc:mysql://localhost:3306/metastoreuseSSL=false&amp;createDatabaseIfNotExist=true</value>
<description>
     JDBC connect string for a JDBC metastore.
     To use SSL to encrypt/authenticate the connection, provide database-specific SSL flag in the connection URL.
     For example, jdbc:postgresql://myhost/db?ssl=true for postgres database.
   </description>
</property>


<property>
   <name>javax.jdo.option.ConnectionUserName</name>
   <value>hduser</value>
   <description>Username to use against metastore database</description>
 </property>


<property>
   <name>javax.jdo.option.ConnectionPassword</name>
   <value>hduser_password</value>
   <description>password to use against metastore database</description>
 </property>


Step5: Create a database in mysql for hive metastore


mysql> CREATE DATABASE metastore;
mysql> USE metastore;
mysql> SOURCE $HIVE_HOME/scripts/metastore/upgrade/mysql/hive-schema-0.14.0.mysql.sql;
give it proper scema menualy with SOURCE command above otherwise it will so scheama missmatch error in spark.


Step 6: download and copy suitable mysql-connector-java-5.1.38.jar  file in hive lib folder.


file:  mysql-connector-java-5.1.38.jar
directory:  /usr/local/hive2.2/lib/


Now it time to test you setting:


Step1: open hive in terminal and create a table:


hive> create table emp(name string, age int, salary int)


Step2: check this table in mysql with below commands:


    mysql> use metastore ;
   
    mysql>show tables:


    mysql> select * from TBLS;


now you will see you created table here.


And you can also check you table in Hadoop HDFS files;


on https://localhost:50070 port. This is a namenode port for Hadoop.


3. Setting up spark for Hive


Step1: Copy hive-site.xml file from $HIVE_HOME/conf to $SPARK_HOME/conf directory


sudo cp $HIVE_HOME/conf/hive-site.xml  $SPARK_HOME/conf


Step2: Copy mysql java driver file in spark jar folder.


    spark2.2.1/jars/mysql-connector-java-5.1.38.jar


same file which we add in to $HIVE_HOME/lib folder before.


Step3: Convert spark-defaults.conf.template file in spark-env.sh file if file is  not there.
And below lines int spark-env.sh file.
 
    export SPARK_CLASSPATH="/usr/local/spark2.2.1/jars/mysql-connector-java-5.1.38.jar"
    export HADOOP_CONF_DIR="/usr/local/hadoop/hadoop-2.7.5/etc/hadoop"
    export YARN_CONF_DIR="/usr/local/hadoop/hadoop-2.7.5/etc/hadoop"

now we will be able to access hive table in spark and use spark_sql.

No comments:

Post a Comment

Power BI Report and Dataset Performance Optimization

  Power BI Report and Dataset Performance Optimization     For any organization developing Power BI reports, there is a strong desire to des...