In this blog we will discuss about Spark 2.0.1 Core Functionality. It demonstrates the basic functionality of Spark 2.0.1. We also describe SparkSession, Spark SQL and DataFrame API functionality. We have tried to cover basics of Spark 2.0.1 core functionality and SparkSession.
SparkSession is new entry point of Spark.In previous version (1.6.x) of Spark ,Spark Context was entry point for Spark and in Spark 2.0.1 SparkSession is entry point of Spark. Spark session internally has a spark context for actual computation.As we know RDD was main API, it was created and manipulated using context API’s. For every other API,we needed to use different contexts.For streaming, we needed StreamingContext, for SQL sqlContext and for hive HiveContext. But as DataSet and Dataframe API’s are becoming new standard API’s we need an entry point build for them. So in Spark 2.0.1, we have a new entry point for DataSet and Dataframe API’s called as Spark Session.
Here we describe how to create SparkSession.
val spark = SparkSession.builder.master("local") .appName("spark example") .getOrCreate()
once we have created spark session then we can use it to read the data.
Read data using Spark Session
It looks like exactly like reading using SQLContext. You can easily replace all your code of SQLContext with SparkSession now.
val spark = SparkSession.builder. master("local") .appName("spark example") .getOrCreate()
How to create DataFrames with the help of SparkSession,applications can create DataFrames from an existing
RDD, from a from data sources.As an example, the following creates a DataFrame based on the content of a csv file:
val dataFrame =spark.read.option("header","true") .csv("src/main/resources/team.csv")
creates a DataFrame based on the content of a json file:
val dataFrame =spark.read.option("header","true") .json("src/main/resources/cars_price.json") dataFrame.show()
Dataset is new abstraction in Spark introduced as alpha API in Spark 1.6. It’s becoming stable API in spark 2.0.1.Datasets are similar to RDDs, however, instead of using Java serialization or Kryo they use a specialized Encoder to serialize the objects for processing or transmitting over the network.The major difference is, dataset is collection of domain specific objects where as RDD is collection of any object. Domain object part of definition signifies the schema part of dataset. So dataset API is always strongly typed and optimized using schema where RDD is not. Dataset definition also talks about Dataframes API. Dataframe is special dataset where there is no compilation checks for schema. So this makes dataSet new single abstraction replacing RDD from earlier versions of spark. We read data using read.text API which is similar to textFile API of RDD. the following creates a DataFrame based on the content of a txt file:
import spark.implicits._ val rklickData = spark.read.text("src/main/resources/rklick.txt").as[String] val rklickWords = rklickData.flatMap(value => value.split("\\s+")) val rklickGroupedWords = rklickWords.groupByKey(_.toLowerCase) val rklickWordCount = rklickGroupedWords.count() rklickWordCount.show()
We would look at how we can create more useful tutorials to grow it , then we would be adding more content to it together. If you have any suggestion feel free to suggest us 🙂 Stay tuned.