Go to

Saturday, September 8, 2018

BigData: Apache Hadoop Installation on Ubuntu Machine (Ubuntu 16.04)


Hadoop is a big data tool which is used for making ETL on Petascale data. Apache Hadoop
comes with HDFS files system, YARN resource manager and MapReduce data processing.
Hadoop uses parallel processing on distributed data over the cluster with the MapReduce algorithm.


Process for installation will be:
1. Installation
2. Update $HOME/.bashrc
    3. Excursus: Hadoop Distributed File System (HDFS)
     4.Configuration. hadoop-env.sh. conf/*-site.xml
     5. Formatting the HDFS filesystem via the NameNode
    6. Starting your single-node cluster

    Hadoop Installation steps:




  1. 1. Installing Java:
  2. To use Hadoop on any system we must have java JVM in the system so our first step will be installing java. We are using Linux for example.
  3. sudo apt-get update #Update list of app in apt-get  
    sudo apt-get install default-jdk #Install java on system
     java -version #Now check the version

    2. Installing SSH:

    ssh: The command we use to connect to remote machines - the client. 
    sshd: The daemon that is running on the server and allows clients to connect to the server.
    The ssh is pre-enabled on Linux, but in order to start sshd daemon, we need to install ssh first. 
    Use this command to do that.
      sudo apt-get install ssh
    3. Create and Setup SSH Certificates: 
    Hadoop requires SSH access to manage its nodes, i.e. remote machines plus our local machine.  
    For our single-node setup of Hadoop, therefore, we need to configure SSH access to Localhost.
      ssh-keygen -t rsa -P ""
    cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys
    The second command adds the newly created key to the list of authorized keys so that
    Hadoop can use ssh without prompting for a password.
     We can check if ssh work
    ssh localhost
    4. Download Hadoop:
    We will be installing Hadoop 2.7.5 version from below link.
               http://www-eu.apache.org/dist/hadoop/common/
    Make a directory for hadoop in your /usr/local
    sudo mkdir /usr/local/hadoop
    And copy that downloaded file into this directory by below  command
      sudo cp ‘path to hadoop zip’ ‘usr/local/hadoop’

    5. Installing hadoop:
    Now when you have hadoop2.7.5 in /usr/local/hadoop folder, so  open a command terminal and
    goto hadoop directory, extract hadoop file.
    cd /usr/local/hadoop
    sudo tar -xzvf hadoop-2.7.3.tar.gz
    Now add hadoop path to your bashrc file in system.

    sudo gedit ~/.bashrc
    Bashrc file will open, add bellow path in it

    export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64

    export HADOOP_INSTALL=/usr/local/hadoop/hadoop2.7.5
    export PATH=$PATH:$HADOOP_INSTALL/bin
    export PATH=$PATH:$HADOOP_INSTALL/sbin
    export HADOOP_MAPRED_HOME=$HADOOP_INSTALL
    export HADOOP_COMMON_HOME=$HADOOP_INSTALL
    export HADOOP_HDFS_HOME=$HADOOP_INSTALL
    export YARN_HOME=$HADOOP_INSTALL
    export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_INSTALL/lib/native
    export HADOOP_OPTS="-Djava.library.path=$HADOOP_INSTALL/lib"


    Run bashrc file now.

    source ~/.bashrc


    6. Changing Files:
    In ~/hadoop/hadoop2.7.5/etc/hadoop/conf directory.
    Open below files one by one and edit them.
      1. Hadoop-env.sh:open this  file in editor and add below command in it.
    export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-am
    2. Core-site.xml: create a directory in hadoop2.7.5 directory by name data where all name
    and data node will be created.

    <configuration>

     <property> <name>hadoop.tmp.dir</name>
                            <value>/path/hadoop2.7.5/data</value>   <description> A base for other temporary
    directories. </description>
    </property>
       <property> <name>fs.default.name</name>
                            <value>hdfs://localhost:9000</value>
      <description>
    The name of the default file system.  
    A URI whose scheme and authority determine
    the FileSystem implementation.
    Theuri's scheme determines the config property
    fs.SCHEME.impl) naming
    the FileSystem implementation class.
    The uri's authority is used to
    determine the host, port, etc.for a filesystem.</description>  
    </property>
    </configuration>
      
    3. mapred-site.xml

    <configuration>

    <property>  <name>mapred.job.tracker</name>
      <value>localhost:9000</value> 
    <description>
    The host and port that the 
    MapReduce job tracker runs
    at.  If "local", then jobs are run in-process 
    as a single map and reduce task.
      </description>
    </property>
      </configuration>
      4. hdfs-site.xml.
    <configuration>

    <property> <name>dfs.replication</name>
    <value>1</value>
    <description>
    Default block replication.
    The actual number of replications can 
    be specified when the file is created.
    The default is used if replication is
    not specified in create time.
    </description>
      </property>
    <property> <name>dfs.namenode.name.dir</name>
     <value>/usr/local/hadoop/hadoop2.7.5/data/namenode</value> 
    </property>
              <property> <name>dfs.datanode.data.dir</name>
    <value>f/usr/local/hadoop/hadoop2.7.5/data/datanode</value>
      </property>
    </configuration>
    Note: if any of this file have file.template.xml format or .template then change it to XML and edit.
    Now save every file you have edit and close them.

    7. Time to run hadoop:
    Now, the Hadoop file system needs to be formatted so that we can start to use it. 
    The format command should be issued with write permission since it creates a current directory 
    under /usr/local/hadoop/hadoop2.7.5/hdfs/namenode folder:
    hadoop namenode -format
    Now start hadoop system with
    start-all.sh

    Now check if all is running properly or not with jps command.

    hduser@rdharm:/home/dharm$ jps

    6016 Jps

    5232 SecondaryNameNode

    5461 ResourceManager

    5048 DataNode

    4924 NameNode

    5596 NodeManager


    Note: If you want to check your resource manager and Namenode, Datanodes on web UI, then check

    http://localhost:8088/cluster for resource manager monitoring.


    http://localhost:50070 for namenode monitoring.


    If everything works fine than congrats your Hadoop is working fine.

    No comments:

    Post a Comment

    Power BI Report and Dataset Performance Optimization

      Power BI Report and Dataset Performance Optimization     For any organization developing Power BI reports, there is a strong desire to des...