Big Data & Data Science: September 2018

APACHE HUE: Apache hue is a web UI for Hadoop ecosystem, certified by apache. Hue provide connection to all of Apache Hadoop ecosystem tools like HDFS, hive, pig, impala, spark and comes with interactive web UI interface.

Hue depends on these following packages:

gcc, g++ , libxml2-dev , libxlst-dev, libsasl2-dev , libsasl2-modules-gssapi-mit ,

libmysqlclient-dev, python-dev , python-setuptools, libsqlite3-dev, ant ,

libkrb5-dev, libtidy-0.99-0, libldap2-dev, libssl-dev, libgmp3-dev .

Installing all the packages: To install all of these packages follow below use commands which are giving below.

sudo apt-get update
sudo apt-get install gcc g++ libxml2-dev libxslt-dev libsasl2-dev libsasl2-modules-gssapi-mit libmysqlclient-dev python-dev python-setuptools libsqlite3-dev ant libkrb5-dev libtidy-0.99-0 libldap2-dev libssl-dev libgmp3-dev

Installation and Configuration

Performing installation as hadoop user( if you have specific hadoop use as hadoop admin).

su - hduser

Download Hue from gethue.com (this link is an example obtained from Hue website)

wget https://dl.dropboxusercontent.com/u/730827/hue/releases/4.1.0/hue-4.1.0.tgz

Extract the downloaded tarball

tar -xvf hue-4.1.0.tgz

Execute install command

cd hue-4.1.0
make install

Note: if you receive an error like c/_cffi_backend.c:15:17: fatal error: ffi.h: No such file or directory hue error

Run below command to resolve it:

sudo apt-get update
sudo apt-get install libffi-dev

Once the above process is completed,

Update ~/.bashrc file,

export HUE_HOME=/home/hadoop/hue
export PATH=$PATH:$HUE_HOME/build/env/bin

source after adding the entries,

source ~/.bashrc

Configure Hue ( 3 files to edit)

make this changes in hue.ini file

open this file.

cd $HUE_HOME/desktop/conf

[desktop]
    server_user=hduser
    server_group=hduser
    default_user=hduser
    default_hdfs_superuser=hduser

app_blacklist=impala,security

fs_defaultfs=hdfs://localhost:50070

hadoop_conf_dir=$HADOOP_CONF_DIR

resourcemanager_host=localhost

resourcemanager_api_url=http://localhost:8088

hive_server_host=http://localhost:10000

hive_conf_dir=/usr/local/hive2.2/conf

proxy_api_url=http://localhost:8088

history_server_api_url=http://localhost:19888

hbase_clusters=(Cluster|localhost:9090)

oozie_url=http://localhost:11000/oozie

Make change in Hadoop Conf files:

open conf directory and edit below given files.

cd $HADOOP_CONF_DIR

core-site.xml

Note: hduser = your hadoop useper huser

<property>
   <name>hadoop.proxyuser.hduser.hosts</name>
   <value>*</value>
</property>
<property>
   <name>hadoop.proxyuser.hduser.groups</name>
   <value>*</value>
</property>

hdfs-site.xml

Note: enabling webUI for Hadoop so that Hue can get access.

<property>
<name>dfs.webhdfs.enabled</name>
<value>true</value>
</property>

httpfs-site.xml

Note: Here we are giving HDFS access to our Hue User, which we has created earlier.

<name>httpfs.proxyuser.hduser.hosts</name>

<description> this properties are for hue to access hdfs </description>

</property>

<name>httpfs.proxyuser.hduser.groups</name>

<description> this properties are for hue to access hdfs </description>

</property>

Starting Hadoop and hiveserver2 for hue:

Start Hadoop:

start-all.sh
Start Hive server:

$HIVE_HOME/bin/hiveserver2

Start Hue

nohup supervisor

$HUE_HOME/build/env/bin/hue runserver

Reference link for help:

http://gethue.com/how-to-configure-hue-in-your-hadoop-cluster/

Making Connection between Apache Spark and Apache Hive

for making connection b/w spark and hive we need below configuration set up. if Apache Spark and Hadoop are not installed yet then follow below links and set them up first.

1. Apache Spark enviroment

2. Hadoop environment

3. Mysql database;

4. Hive environment

1. Installation Hive

1.1 Install hive: goto /usr/local create a hive folder

commands:

wget http://www-us.apache.org/dist/hive/hive-2.2.0/apache-hive-2.2.0-bin.tar.gz

sudo tar xvzf apache-hive-2.2.0-bin.tar.gz -C /usr/local/hive

1.2 Add hive path to bashrc file (Make sure Hive and Hadoop directory paths are correct):

command: sudo gedit ~/.bachrc

add bellow line in bashrc file

export HIVE_HOME=/usr/local/hive2.2

export HIVE_CONF_DIR=/usr/local/hive2.2/conf

export PATH=$HIVE_HOME/bin:$PATH

export CLASSPATH=$CLASSPATH:/usr/local/hadoop/hadoop-2.7.5/lib/*:.

export CLASSPATH=$CLASSPATH:/usr/local/hive2.2/lib/*

and run bashrc file with below command

command: source ~/.bashrc

1.3 Creating a hadoop directory for hive metastore data storage:

The directory warehouse is the location to store the table or data related to hive, and the temporary directory tmp is the temporary location to store the intermediate result of processing.

start hadoop environment with start-all.sh command and check for all running with jps.

Commands to run:

start-all.sh

hdfs dfs -mkdir -R /tmp/hive/warehouse

hdfs dfs -chmod -R 777 tmp/hive/warehouse

1.3 Edit hive-env.sh and hive-site.xml file

hive_env.sh file

# Set HADOOP_HOME to point to a specific hadoop install directory

HADOOP_HOME=/usr/local/hadoop/hadoop-2.7.5

# Hive Configuration Directory can be controlled by:

export HIVE_CONF_DIR=/usr/local/hive2.2/conf

hive-site.xml

<name>hive.metastore.warehouse.dir</name>

<value>hdfs://localhost:9000/user/hive/warehouse</value>

<description>location of default database for the warehouse</description>

</property>

<name>hive.exec.scratchdir</name>

<value>hdfs://localhost:9000/tmp/hive</value>

<description>HDFS root scratch dir for Hive jobs which gets created with write all (733) permission. For each connecting user, an HDFS scratch dir: ${hive.exec.scratchdir}/<username> is created, with ${hive.scratch.dir.permission}.</description>

</property>

<name>hive.repl.rootdir</name>

<value>hdfs://localhost:9000/user/hive/repl/</value>

<description>HDFS root dir for all replication dumps.</description>

</property>

<name>hive.exec.local.scratchdir</name>

<description>Local scratch space for Hive jobs</description>

</property>

<name>hive.downloaded.resources.dir</name>

<value>/tmp/${user.name}_resources</value>

<description>Temporary local directory for added resources in the remote file system.</description>

</property>

2. Installing mysql as metastore for hive

Step1: Install mysql:

sudo apt-get install mysql-server

Step2: Install the MySQL Java Connector:

sudo apt-get install libmysql-java

Step3: log in mysql and create a userwhich will access mysql databases from hive:

mysql -u root -p (enter this command to enter in mysql with rootuser)

( Note: Here we are creating mysql user for hive. set username and password according to you)
mysql> CREATE USER 'hduser'@'%' IDENTIFIED BY 'hduserpassword';

mysql> GRANT all on *.* to 'hduser'@localhost identified by 'hdpassword';

mysql> flush privileges;

now hduser will also have all previlege over databases;

Step4: Change following connection properties of hive-site.xml for mysql metastore as below.

<name>javax.jdo.option.ConnectionDriverName</name>

<value>com.mysql.jdbc.Driver</value>

<description>Driver class name for a JDBC metastore</description>

</property>

<name>javax.jdo.option.ConnectionURL</name>

<value>jdbc:mysql://localhost:3306/metastoreuseSSL=false&createDatabaseIfNotExist=true</value>

JDBC connect string for a JDBC metastore.

To use SSL to encrypt/authenticate the connection, provide database-specific SSL flag in the connection URL.

For example, jdbc:postgresql://myhost/db?ssl=true for postgres database.

</description>

</property>

<name>javax.jdo.option.ConnectionUserName</name>

<value>hduser</value>

<description>Username to use against metastore database</description>

</property>

<name>javax.jdo.option.ConnectionPassword</name>

<value>hduser_password</value>

<description>password to use against metastore database</description>

</property>

Step5: Create a database in mysql for hive metastore

mysql> CREATE DATABASE metastore;

mysql> USE metastore;

mysql> SOURCE $HIVE_HOME/scripts/metastore/upgrade/mysql/hive-schema-0.14.0.mysql.sql;

give it proper scema menualy with SOURCE command above otherwise it will so scheama missmatch error in spark.

Step 6: download and copy suitable mysql-connector-java-5.1.38.jar file in hive lib folder.

file: mysql-connector-java-5.1.38.jar

directory: /usr/local/hive2.2/lib/

Now it time to test you setting:

Step1: open hive in terminal and create a table:

hive> create table emp(name string, age int, salary int)

Step2: check this table in mysql with below commands:

mysql> use metastore ;

mysql>show tables:

mysql> select * from TBLS;

now you will see you created table here.

And you can also check you table in Hadoop HDFS files;

on https://localhost:50070 port. This is a namenode port for Hadoop.

3. Setting up spark for Hive

Step1: Copy hive-site.xml file from $HIVE_HOME/conf to $SPARK_HOME/conf directory

sudo cp $HIVE_HOME/conf/hive-site.xml $SPARK_HOME/conf

Step2: Copy mysql java driver file in spark jar folder.

spark2.2.1/jars/mysql-connector-java-5.1.38.jar

same file which we add in to $HIVE_HOME/lib folder before.

Step3: Convert spark-defaults.conf.template file in spark-env.sh file if file is not there.

And below lines int spark-env.sh file.

export SPARK_CLASSPATH="/usr/local/spark2.2.1/jars/mysql-connector-java-5.1.38.jar"

export HADOOP_CONF_DIR="/usr/local/hadoop/hadoop-2.7.5/etc/hadoop"

export YARN_CONF_DIR="/usr/local/hadoop/hadoop-2.7.5/etc/hadoop"

now we will be able to access hive table in spark and use spark_sql.

Big Data & Data Science

Go to

Monday, September 10, 2018

Big Data: Installing HUE WebUI ontop of Hadoop and Hive

Big Data: Apache spark with Hive Database

Power BI Report and Dataset Performance Optimization