Big Data & Data Science: Big Data: Apache Spark Installation on Ubuntu 16.04 machine

Apache Spark Overview:

Apache Spark is an open-source parallel processing framework for running large-scale data analytics applications across
clustered computers. spark framework does parallel computation like Hadoop MapReduce but a lot faster than Hadoop MapReduce.

Apache Spark works on the concept of RDD( resilient distributed dataset). RDDs are stored on RAM, All the computation in spark
is done over ram that why io is a lot faster.

Apache spark comes with its super use full API that can do almost everything in the world of data, like lighting
fast computation on big data, streaming, machine learning, Graphx for graphical uses and spark SQL for
structured data.

Spark data APIs.

Spark core or RDD
Spark streaming
Spark SQL
Spark MLib
Graphx
Structured streaming

All of this APIs are available in different programing language to make spark more user-friendly.

scala,
java and
python

Apache spark comes with R language also.

Follow the below step to install Apache Spark.

Step 1:

Download apache spark the latest version from https://spark.apache.org/downloads.html

Command:
wget http://www-us.apache.org/dist/spark/spark-2.3.0/spark-2.3.0-bin-hadoop2.7.tgz

Step 2:

Extract tar file, make a new directory in /usr/local or /home and move extracted tar file there.

sudo tar -xvf spark-2.3.0-bin-hadoop2.7.tgz

sudo mv spark-2.3.0-bin-hadoop2.7 /usr/local/spark

Step 3:

Add below path as a parameter to bashrc file.

#spark

export SPARK_HOME=/usr/local/spark2.2.1

export PATH=$PATH:$SPARK_HOME/bin

export PATH=$PATH:$SPARK_HOME/sbin

export PYSPARK_DRIVER_PYTHON='ipython'

export PYSPARK_DRIVER_PYTHON_OPTS="notebook"

export HADOOP_HOME= $HADOOP_HOME

export JAVA_HOME =$JAVA_HOME

Note: Here we are using jupyter notebook for spark coding. Make sure you have
installed anaconda, java and Hadoop before spark in your local machine

Step 4:

Now it the time to start our spark. Open a command shell and type below command to run spark with
Jupyter notebook.

Command: pyspark-master local[2]

Command will start pyspark in jupyter notebook, where we can start coding program or application.

Note: The --master option specifies the master URL for a distributed cluster, or local to run locally
with one thread, or local[N] to run locally with N threads. You should start by using local for testing.
For a full list of options, run Spark shell with the --help option.

Jupyter notebook is anaconda python web interface IDE, here we are using Jupyter notebook for Pyspark.

Jupyter notebook web interface for pyspark will open. So make a new folder in your any directory for storing
code.

Code:

from pyspark import SparkContext

sc = SparkContext.getOrCreate()

line = ['Apache Spark is a fast and general-purpose cluster computing system.
It provides high-level APIs in Java, Scala, Python and R, and an optimized engine
that supports general execution graphs. It also supports a rich set of higher-level
tools including Spark SQL for SQL and structured data processing, MLlib for machine
learning, GraphX for graph processing, and Spark Streaming.']

initRdd = sc.parallelize(line)

words = initRdd.flatMap(lambda x:x.split(" "))

print(words.map(lambda x: (x,1)).reduceByKey(lambda x,y:x+y).collect())

Output:

[('Apache', 1), ('Spark', 3), ('is', 1), ('general-purpose', 1), ('It', 2),
('provides', 1), ('high-level', 1), ('APIs', 1), ('in', 1), ('Java,', 1),
('Scala,', 1), ('Python', 1), ('an', 1), ('optimized', 1), ('engine', 1),
('supports', 2), ('execution', 1), ('set', 1), ('of', 1), ('tools', 1),
('SQL', 2), ('processing,', 2), ('MLlib', 1), ('machine', 1),
('learning,', 1), ('GraphX', 1), ('graph', 1), ('Streaming.', 1),
('', 1), ('a', 2), ('fast', 1), ('and', 5), ('cluster', 1), ('computing', 1),
('system.', 1), ('R,', 1), ('that', 1), ('general', 1), ('graphs.', 1),
('also', 1), ('rich', 1), ('higher-level', 1), ('including', 1),
('for', 3), ('structured', 1), ('data', 1)]

Note: spark also has spark-submit for batch file processing, so if have .py file than we can direct run this file
with spark-submit.

Command: Spark-submit file_name.py

Congrats now you can start your Big Data analysis with Apache Spark.

Big Data & Data Science

Go to

Sunday, September 9, 2018

Big Data: Apache Spark Installation on Ubuntu 16.04 machine

No comments:

Post a Comment

Power BI Report and Dataset Performance Optimization