Go to

Sunday, September 9, 2018

Big Data: Apache Spark Installation on Ubuntu 16.04 machine


Apache Spark Overview:
Apache Spark is an open-source parallel processing framework for running large-scale data analytics applications across
clustered computers. spark framework does parallel computation like Hadoop MapReduce but a lot faster than Hadoop MapReduce.


Apache Spark works on the concept of RDD( resilient distributed dataset). RDDs are stored on RAM, All the computation in spark
is done over ram that why io is a lot faster.




Apache spark comes with its super use full API that can do almost everything in the world of data, like lighting
fast computation on big data, streaming, machine learning, Graphx for graphical uses and spark SQL for
structured data.


Spark data APIs.


  • Spark core or RDD
  • Spark streaming
  • Spark SQL
  • Spark MLib
  • Graphx
  • Structured streaming


All of this APIs are available in different programing language to make spark more user-friendly.
  • scala,
  • java and
  • python


Apache spark comes with R language also.


Follow the below step to install Apache Spark.
Step 1:
Download apache spark the latest version from https://spark.apache.org/downloads.html




Step 2:  
Extract tar file, make a new directory in /usr/local or /home and move extracted tar file there.

sudo tar -xvf spark-2.3.0-bin-hadoop2.7.tgz
sudo mv spark-2.3.0-bin-hadoop2.7 /usr/local/spark


Step 3:
Add below path as a parameter to bashrc file.


#spark
export SPARK_HOME=/usr/local/spark2.2.1
export PATH=$PATH:$SPARK_HOME/bin
export PATH=$PATH:$SPARK_HOME/sbin
export PYSPARK_DRIVER_PYTHON='ipython'
export PYSPARK_DRIVER_PYTHON_OPTS="notebook"
export HADOOP_HOME= $HADOOP_HOME
export JAVA_HOME =$JAVA_HOME


Note: Here we are using jupyter notebook for spark coding. Make sure you have
installed anaconda, java and Hadoop before spark in your local machine


Step 4:
Now it the time to start our spark. Open a command shell and type below command to run spark with
Jupyter notebook.


Command: pyspark-master local[2]


Command will start pyspark in jupyter notebook, where we can start coding program or application.




Note: The --master option specifies the master URL for a distributed cluster, or local to run locally
with one thread, or local[N] to run locally with N threads. You should start by using local for testing.
For a full list of options, run Spark shell with the --help option.


Jupyter notebook is anaconda python web interface IDE, here we are using Jupyter notebook for Pyspark.




Jupyter notebook web interface for pyspark will open. So make a new folder in your any directory for storing
code.


Code:
from pyspark import SparkContext
sc = SparkContext.getOrCreate()


line = ['Apache Spark is a fast and general-purpose cluster computing system.
It provides high-level APIs in Java, Scala, Python and R, and an optimized engine
that supports general execution graphs. It also supports a rich set of higher-level
tools including Spark SQL for SQL and structured data processing, MLlib for machine
learning, GraphX for graph processing, and Spark Streaming.']


initRdd = sc.parallelize(line)
     words = initRdd.flatMap(lambda x:x.split(" "))
print(words.map(lambda x: (x,1)).reduceByKey(lambda x,y:x+y).collect())


Output:
[('Apache', 1), ('Spark', 3), ('is', 1), ('general-purpose', 1), ('It', 2),
('provides', 1), ('high-level', 1), ('APIs', 1), ('in', 1), ('Java,', 1),
('Scala,', 1), ('Python', 1), ('an', 1), ('optimized', 1), ('engine', 1),
('supports', 2), ('execution', 1), ('set', 1), ('of', 1), ('tools', 1),
('SQL', 2), ('processing,', 2), ('MLlib', 1), ('machine', 1),
('learning,', 1), ('GraphX', 1), ('graph', 1), ('Streaming.', 1),
('', 1), ('a', 2), ('fast', 1), ('and', 5), ('cluster', 1), ('computing', 1),
('system.', 1), ('R,', 1), ('that', 1), ('general', 1), ('graphs.', 1),
('also', 1), ('rich', 1), ('higher-level', 1), ('including', 1),
('for', 3), ('structured', 1), ('data', 1)]




Note: spark also has spark-submit for batch file processing, so if have .py file than we can direct run this file
with spark-submit.
Command: Spark-submit file_name.py


Congrats now you can start your Big Data analysis with Apache Spark.

No comments:

Post a Comment

Power BI Report and Dataset Performance Optimization

  Power BI Report and Dataset Performance Optimization     For any organization developing Power BI reports, there is a strong desire to des...