Big Data & Data Science: BigData: Apache Hadoop Installation on Ubuntu Machine (Ubuntu 16.04)

Hadoop is a big data tool which is used for making ETL on Petascale data. Apache Hadoop

comes with HDFS files system, YARN resource manager and MapReduce data processing.

Hadoop uses parallel processing on distributed data over the cluster with the MapReduce algorithm.

Process for installation will be:

1. Installation

2. Update $HOME/.bashrc

3. Excursus: Hadoop Distributed File System (HDFS)
4.Configuration. hadoop-env.sh. conf/*-site.xml
5. Formatting the HDFS filesystem via the NameNode
6. Starting your single-node cluster

Hadoop Installation steps:

1. Installing Java:  

To use Hadoop on any system we must have java JVM in the system so our first step will be installing java. We are using Linux for example.

sudo apt-get update #Update list of app in apt-get

sudo apt-get install default-jdk #Install java on system
java -version #Now check the version

2. Installing SSH:
ssh: The command we use to connect to remote machines - the client.

sshd: The daemon that is running on the server and allows clients to connect to the server.

The ssh is pre-enabled on Linux, but in order to start sshd daemon, we need to install ssh first.
Use this command to do that.
sudo apt-get install ssh

3. Create and Setup SSH Certificates:

Hadoop requires SSH access to manage its nodes, i.e. remote machines plus our local machine.

For our single-node setup of Hadoop, therefore, we need to configure SSH access to Localhost.

ssh-keygen -t rsa -P ""
cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

The second command adds the newly created key to the list of authorized keys so that
Hadoop can use ssh without prompting for a password.

We can check if ssh work

ssh localhost

4. Download Hadoop:

We will be installing Hadoop 2.7.5 version from below link.

http://www-eu.apache.org/dist/hadoop/common/
Make a directory for hadoop in your /usr/local
sudo mkdir /usr/local/hadoop
And copy that downloaded file into this directory by below command
sudo cp ‘path to hadoop zip’ ‘usr/local/hadoop’

5. Installing hadoop:

Now when you have hadoop2.7.5 in /usr/local/hadoop folder, so open a command terminal and
goto hadoop directory, extract hadoop file.

cd /usr/local/hadoop

sudo tar -xzvf hadoop-2.7.3.tar.gz

Now add hadoop path to your bashrc file in system.

sudo gedit ~/.bashrc

Bashrc file will open, add bellow path in it

export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64

export HADOOP_INSTALL=/usr/local/hadoop/hadoop2.7.5

export PATH=$PATH:$HADOOP_INSTALL/bin

export PATH=$PATH:$HADOOP_INSTALL/sbin

export HADOOP_MAPRED_HOME=$HADOOP_INSTALL

export HADOOP_COMMON_HOME=$HADOOP_INSTALL

export HADOOP_HDFS_HOME=$HADOOP_INSTALL

export YARN_HOME=$HADOOP_INSTALL

export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_INSTALL/lib/native

export HADOOP_OPTS="-Djava.library.path=$HADOOP_INSTALL/lib"

Run bashrc file now.

source ~/.bashrc

6. Changing Files:

In ~/hadoop/hadoop2.7.5/etc/hadoop/conf directory.

Open below files one by one and edit them.
1. Hadoop-env.sh:open this file in editor and add below command in it.
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-am

2. Core-site.xml: create a directory in hadoop2.7.5 directory by name data where all name
and data node will be created.

<configuration>

<property> <name>hadoop.tmp.dir</name>

<value>/path/hadoop2.7.5/data</value> <description> A base for other temporary
directories. </description>
</property>

<property> <name>fs.default.name</name>

<value>hdfs://localhost:9000</value>
<description>
The name of the default file system.
A URI whose scheme and authority determine
the FileSystem implementation.
Theuri's scheme determines the config property
fs.SCHEME.impl) naming
the FileSystem implementation class.
The uri's authority is used to
determine the host, port, etc.for a filesystem.</description>

</property>

</configuration>

3. mapred-site.xml

<configuration>

<property> <name>mapred.job.tracker</name>
<value>localhost:9000</value>
<description>
The host and port that the
MapReduce job tracker runs

at. If "local", then jobs are run in-process
as a single map and reduce task.
</description>
</property>
</configuration>
4. hdfs-site.xml.

<configuration>

<property> <name>dfs.replication</name>

<value>1</value>
<description>
Default block replication.
The actual number of replications can
be specified when the file is created.

The default is used if replication is
not specified in create time.

</description>
</property>

<property> <name>dfs.namenode.name.dir</name>
<value>/usr/local/hadoop/hadoop2.7.5/data/namenode</value>
</property>

<property> <name>dfs.datanode.data.dir</name>
<value>f/usr/local/hadoop/hadoop2.7.5/data/datanode</value>
</property>

</configuration>
Note: if any of this file have file.template.xml format or .template then change it to XML and edit.
Now save every file you have edit and close them.

7. Time to run hadoop:
Now, the Hadoop file system needs to be formatted so that we can start to use it.
The format command should be issued with write permission since it creates a current directory
under /usr/local/hadoop/hadoop2.7.5/hdfs/namenode folder:
hadoop namenode -format
Now start hadoop system with

start-all.sh