Tutorial: Quick overview of Spark 2.0.1 Core Functionality

In this blog we will discuss about Spark 2.0.1 Core Functionality. It demonstrates the basic functionality of Spark 2.0.1. We also describe  SparkSession, Spark SQL and DataFrame API functionality. We have tried to cover basics of Spark 2.0.1 core functionality and SparkSession.

SparkSession:

SparkSession is new entry point of Spark.In previous version (1.6.x) of Spark ,Spark Context was entry point for Spark and in Spark 2.0.1 SparkSession is entry point of Spark. Spark session internally has a spark context for actual computation.As we know RDD was main API, it was created and manipulated using context API’s. For every other API,we needed to use different contexts.For streaming, we needed StreamingContext, for SQL sqlContext and for hive HiveContext. But as DataSet and Dataframe API’s are becoming new standard API’s we need an entry point build for them. So in Spark 2.0.1, we have a new entry point for DataSet and Dataframe API’s called as Spark Session.

Creating SparkSession:

Here we describe how to create SparkSession.

val spark = SparkSession.builder.master("local")
.appName("spark example")
.getOrCreate()

once we have created spark session then we can use it to read the data.

Read data using Spark Session

It looks like exactly like reading using SQLContext. You can easily replace all your code of SQLContext with SparkSession now.

val spark = SparkSession.builder.
 master("local")
 .appName("spark example")
 .getOrCreate()

Creating DataFrames:

How to create DataFrames with the help of SparkSession,applications can create DataFrames from an existing RDD, from a from data sources.As an example, the following creates a DataFrame based on the content of a csv file:

val dataFrame =spark.read.option("header","true")
 .csv("src/main/resources/team.csv")

csv.png

creates a DataFrame based on the content of a json file:

val dataFrame =spark.read.option("header","true")
.json("src/main/resources/cars_price.json")

dataFrame.show()

json.png

Creating Datasets:

Dataset is new abstraction in Spark introduced as alpha API in Spark 1.6. It’s becoming stable API in spark 2.0.1.Datasets are similar to RDDs, however, instead of using Java serialization or Kryo they use a specialized Encoder to serialize the objects for processing or transmitting over the network.The major difference is, dataset is collection of domain specific objects where as RDD is collection of any object. Domain object part of definition signifies the schema part of dataset. So dataset API is always strongly typed and optimized using schema where RDD is not. Dataset definition also talks about Dataframes API. Dataframe is special dataset where there is no compilation checks for schema. So this makes dataSet new single abstraction replacing RDD from earlier versions of spark. We read data using read.text API which is similar to textFile API of RDD. the following creates a DataFrame based on the content of a txt file:

import spark.implicits._
val rklickData = spark.read.text("src/main/resources/rklick.txt").as[String]
val rklickWords = rklickData.flatMap(value => value.split("\\s+"))
val rklickGroupedWords = rklickWords.groupByKey(_.toLowerCase)
val rklickWordCount = rklickGroupedWords.count()
rklickWordCount.show()

dataset

We would look at how we can create more useful tutorials to grow it , then we would be adding more content to it together. If you have any suggestion feel free to suggest us 🙂 Stay tuned.

 

Advertisements

Tutorial : DataFrame API Functionalities using Spark 1.6

In previous tutorial, we  have explained  about the SparkSQL and DataFrames Operations using Spark 1.6. Now In this tutorial we have covered  DataFrame API Functionalities . And we have provided running example of each functionality for better support. Lets begin the tutorial and discuss about the DataFrame API  Operations using Spark 1.6 .

DataFrame API Example Using Different types of Functionalities

Different type of DataFrame operations are :-

1.Action
2.Basic
3.Operations

Here we are using  JSON document named cars.json with the following content and generate a table based on the schema in the JSON document.

Continue reading

Tutorial : Spark SQL and DataFrames Operations using Spark 1.6

In previous tutorial, we  have explained about Spark Core and RDD functionalities. Now In this tutorial we have covered Spark SQL and DataFrame operation from different source like JSON, Text and CSV data files. And we have provided running example of each functionality for better support. Lets begin the tutorial and discuss about the SparkSQL and DataFrames Operations using Spark 1.6

SparkSQL

Spark SQL is a component on top of Spark Core that introduces a new data abstraction called SchemaRDD, which provides support for structured and semi-structured data. Spark SQL is to execute SQL queries written using either a basic SQL syntax or HiveQL. It can also be used to read data from an existing Hive installation.It provides a programming abstraction called DataFrame and can act as distributed SQL query engine. Continue reading