Go to

Monday, September 10, 2018

Big Data: Installing HUE WebUI ontop of Hadoop and Hive

APACHE HUE: Apache hue is a web UI for Hadoop ecosystem, certified by apache. Hue provide connection to all of Apache Hadoop ecosystem tools like HDFS, hive, pig, impala, spark and comes with interactive web UI interface.






Hue depends on these following packages:

gcc,    g++ ,    libxml2-dev ,    libxlst-dev,     libsasl2-dev ,    libsasl2-modules-gssapi-mit ,   
libmysqlclient-dev,     python-dev ,    python-setuptools,     libsqlite3-dev,     ant ,   
libkrb5-dev,     libtidy-0.99-0,     libldap2-dev,     libssl-dev,     libgmp3-dev .   


Installing all the packages: To install all of these packages follow below use commands which are giving below.


sudo apt-get update
sudo apt-get install gcc g++ libxml2-dev libxslt-dev libsasl2-dev libsasl2-modules-gssapi-mit libmysqlclient-dev python-dev python-setuptools libsqlite3-dev ant libkrb5-dev libtidy-0.99-0 libldap2-dev libssl-dev libgmp3-dev


Installation and Configuration
Performing installation as hadoop user( if you have specific hadoop use as hadoop admin).
su - hduser   
Download Hue from gethue.com (this link is an example     obtained from Hue website)   
wget     https://dl.dropboxusercontent.com/u/730827/hue/releases/4.1.0/hue-4.1.0.tgz

Extract the downloaded tarball

tar -xvf hue-4.1.0.tgz

Execute install command

    cd hue-4.1.0
    make install   
Note: if you receive an error like c/_cffi_backend.c:15:17: fatal error: ffi.h: No such file or directory hue error


Run below command to resolve it:
    sudo apt-get update
    sudo apt-get install libffi-dev


Once the above process is completed,

Update ~/.bashrc file,

export HUE_HOME=/home/hadoop/hue
    export PATH=$PATH:$HUE_HOME/build/env/bin


source after adding the entries,
    source ~/.bashrc



Configure Hue ( 3 files to edit)

  • make this changes in hue.ini file
    open this file.
    cd $HUE_HOME/desktop/conf

[desktop]
    server_user=hduser
    server_group=hduser
    default_user=hduser
    default_hdfs_superuser=hduser


 app_blacklist=impala,security
 fs_defaultfs=hdfs://localhost:50070
 hadoop_conf_dir=$HADOOP_CONF_DIR
 resourcemanager_host=localhost
 resourcemanager_api_url=http://localhost:8088

 hive_server_host=http://localhost:10000
 hive_conf_dir=/usr/local/hive2.2/conf


 proxy_api_url=http://localhost:8088
 history_server_api_url=http://localhost:19888
 hbase_clusters=(Cluster|localhost:9090)
 oozie_url=http://localhost:11000/oozie


Make change in Hadoop Conf files:
open conf directory and edit below given files.
cd $HADOOP_CONF_DIR

  • core-site.xml
    Note: hduser = your hadoop  useper huser


<property>
   <name>hadoop.proxyuser.hduser.hosts</name>
   <value>*</value>
</property>
<property>
   <name>hadoop.proxyuser.hduser.groups</name>
   <value>*</value>
</property>



  • hdfs-site.xml
    Note: enabling webUI for Hadoop so that Hue can get access.
<property>
   <name>dfs.webhdfs.enabled</name>
   <value>true</value>
</property>

  • httpfs-site.xml
    Note: Here we are giving HDFS access to our Hue User, which we has created earlier.
<property>
 <name>httpfs.proxyuser.hduser.hosts</name>
 <value>*</value>
 <description> this properties are for hue to access hdfs </description>
</property>


<property>
 <name>httpfs.proxyuser.hduser.groups</name>
 <value>*</value>
  <description> this properties are for hue to access hdfs </description>
</property>

Starting Hadoop and hiveserver2 for hue:
Start Hadoop:
start-all.sh
  Start Hive server:
$HIVE_HOME/bin/hiveserver2 


Start Hue 

nohup     supervisor    
or
$HUE_HOME/build/env/bin/hue     runserver

Login to Hue Web Interface: http://localhost:8888


Reference link for help:

Big Data: Apache spark with Hive Database

Making Connection between Apache Spark and Apache Hive


for making connection b/w spark and hive we need below configuration set up. if Apache Spark and Hadoop are not installed yet then follow below links and set them up first.
3. Mysql database;
4. Hive environment


1. Installation Hive


1.1 Install hive: goto /usr/local create a hive folder
   commands:
      wget http://www-us.apache.org/dist/hive/hive-2.2.0/apache-hive-2.2.0-bin.tar.gz


     sudo tar xvzf apache-hive-2.2.0-bin.tar.gz -C /usr/local/hive
1.2 Add hive path to bashrc file (Make sure Hive and Hadoop directory paths are correct):
  command:   sudo gedit ~/.bachrc
add bellow line in bashrc file
    export HIVE_HOME=/usr/local/hive2.2
    export HIVE_CONF_DIR=/usr/local/hive2.2/conf
    export PATH=$HIVE_HOME/bin:$PATH
    export CLASSPATH=$CLASSPATH:/usr/local/hadoop/hadoop-2.7.5/lib/*:.
    export CLASSPATH=$CLASSPATH:/usr/local/hive2.2/lib/*
and run bashrc file with below command
command: source ~/.bashrc
1.3 Creating a hadoop directory for hive metastore data storage:
The directory warehouse is the location to store the table or data related to hive, and the temporary directory tmp is the temporary location to store the intermediate result of processing.
start hadoop environment with start-all.sh command and check for all running with jps.
   Commands to run:
      start-all.sh
    hdfs dfs -mkdir -R /tmp/hive/warehouse
    hdfs dfs -chmod -R  777 tmp/hive/warehouse


1.3 Edit hive-env.sh and  hive-site.xml file
  • hive_env.sh file
# Set HADOOP_HOME to point to a specific hadoop install directory
HADOOP_HOME=/usr/local/hadoop/hadoop-2.7.5
# Hive Configuration Directory can be controlled by:
export HIVE_CONF_DIR=/usr/local/hive2.2/conf
  • hive-site.xml
<property>
   <name>hive.metastore.warehouse.dir</name>
   <value>hdfs://localhost:9000/user/hive/warehouse</value>
   <description>location of default database for the warehouse</description>
</property>
<property>
   <name>hive.exec.scratchdir</name>
  <value>hdfs://localhost:9000/tmp/hive</value>
   <description>HDFS root scratch dir for Hive jobs which gets created with write all (733) permission. For each connecting user, an HDFS scratch dir: ${hive.exec.scratchdir}/&lt;username&gt; is created, with ${hive.scratch.dir.permission}.</description>
</property>
<property>
   <name>hive.repl.rootdir</name>
   <value>hdfs://localhost:9000/user/hive/repl/</value>
  <description>HDFS root dir for all replication dumps.</description>
</property>


      <property>
       <name>hive.exec.local.scratchdir</name>
       <value>/tmp/${user.name}</value>
       <description>Local scratch space for Hive jobs</description>
    </property>
<property>
     <name>hive.downloaded.resources.dir</name>
     <value>/tmp/${user.name}_resources</value>
     <description>Temporary local directory for added resources in the remote       file system.</description>
</property>


2. Installing mysql as metastore for hive


Step1: Install mysql:
            sudo apt-get install mysql-server


Step2: Install the MySQL Java Connector:


           sudo apt-get install libmysql-java


Step3: log in mysql and create a userwhich will access mysql databases from hive:
       
        mysql -u root -p  (enter this command to enter in mysql with rootuser)
         
( Note: Here we are creating mysql user for hive. set username and password according to you)
mysql> CREATE USER 'hduser'@'%' IDENTIFIED BY 'hduserpassword';


        mysql> GRANT all on *.* to 'hduser'@localhost identified by 'hdpassword';
        mysql>  flush privileges;
now hduser will also have all previlege over databases;


Step4: Change following connection properties of hive-site.xml for mysql metastore as below.


<property>
   <name>javax.jdo.option.ConnectionDriverName</name>
   <value>com.mysql.jdbc.Driver</value>
   <description>Driver class name for a JDBC metastore</description>
</property>


<property>
   <name>javax.jdo.option.ConnectionURL</name>
             <value>jdbc:mysql://localhost:3306/metastoreuseSSL=false&amp;createDatabaseIfNotExist=true</value>
<description>
     JDBC connect string for a JDBC metastore.
     To use SSL to encrypt/authenticate the connection, provide database-specific SSL flag in the connection URL.
     For example, jdbc:postgresql://myhost/db?ssl=true for postgres database.
   </description>
</property>


<property>
   <name>javax.jdo.option.ConnectionUserName</name>
   <value>hduser</value>
   <description>Username to use against metastore database</description>
 </property>


<property>
   <name>javax.jdo.option.ConnectionPassword</name>
   <value>hduser_password</value>
   <description>password to use against metastore database</description>
 </property>


Step5: Create a database in mysql for hive metastore


mysql> CREATE DATABASE metastore;
mysql> USE metastore;
mysql> SOURCE $HIVE_HOME/scripts/metastore/upgrade/mysql/hive-schema-0.14.0.mysql.sql;
give it proper scema menualy with SOURCE command above otherwise it will so scheama missmatch error in spark.


Step 6: download and copy suitable mysql-connector-java-5.1.38.jar  file in hive lib folder.


file:  mysql-connector-java-5.1.38.jar
directory:  /usr/local/hive2.2/lib/


Now it time to test you setting:


Step1: open hive in terminal and create a table:


hive> create table emp(name string, age int, salary int)


Step2: check this table in mysql with below commands:


    mysql> use metastore ;
   
    mysql>show tables:


    mysql> select * from TBLS;


now you will see you created table here.


And you can also check you table in Hadoop HDFS files;


on https://localhost:50070 port. This is a namenode port for Hadoop.


3. Setting up spark for Hive


Step1: Copy hive-site.xml file from $HIVE_HOME/conf to $SPARK_HOME/conf directory


sudo cp $HIVE_HOME/conf/hive-site.xml  $SPARK_HOME/conf


Step2: Copy mysql java driver file in spark jar folder.


    spark2.2.1/jars/mysql-connector-java-5.1.38.jar


same file which we add in to $HIVE_HOME/lib folder before.


Step3: Convert spark-defaults.conf.template file in spark-env.sh file if file is  not there.
And below lines int spark-env.sh file.
 
    export SPARK_CLASSPATH="/usr/local/spark2.2.1/jars/mysql-connector-java-5.1.38.jar"
    export HADOOP_CONF_DIR="/usr/local/hadoop/hadoop-2.7.5/etc/hadoop"
    export YARN_CONF_DIR="/usr/local/hadoop/hadoop-2.7.5/etc/hadoop"

now we will be able to access hive table in spark and use spark_sql.

Power BI Report and Dataset Performance Optimization

  Power BI Report and Dataset Performance Optimization     For any organization developing Power BI reports, there is a strong desire to des...