Tutorial: How to load and save data from different data source in Spark 2.0.2

In this blog we will discuss about Spark 2.0.2 . It demonstrates the basic functionality of Spark 2.0.2. We also describe  how to load and save data in Spark2.0.2. We have tried to cover basics of Spark 2.0.2 core functionality  like read and write data from different source (Csv,JSON,Txt) .

Loading and saving CSV file

As an example, the following creates a DataFrame based on the content of a CSV file. Read a csv document named team.csv with the following content and generate a table based on the schema in the csv document.

id,name,age,location
1,Mahendra Singh Dhoni ,38,ranchi
2,Virat Kohli,25,delhi
3,Shikhar Dhawan,25,mumbai
4,Rohit Sharma,33,mumbai
5,Stuart Binny,22,chennai

 

def main(args: Array[String]) {
  val spark = SparkSession
    .builder()
    .master("local")
    .appName("Spark2.0")
    .getOrCreate()
  val df = spark.read.option("header", "true")
    .csv("/resources/team.csv")
 val selectedData = df.select("name", "age")
   selectedData.write.option("header", "true")
    .save(s"src/main/resources/${UUID.randomUUID()}")
  println("OK")
}

 csv1.png

csvsave.png

Loading and saving JSON file

Here we include some basic examples of structured data processing using DataFrames. As an example, the following creates a DataFrame based on the content of a JSON file. Read a JSON document named cars_price.json with the following content and generate a table based on the schema in the JSON document.

[{"itemNo" : 1, "name" : "Ferrari", "price" : 52000000 , "kph": 6.1},  {"itemNo" : 2, "name" : "Jaguar", "price" : 15000000 , "kph": 3.4},  {"itemNo" : 3, "name" : "Mercedes", "price" : 10000000, "kph": 3}, {"itemNo" : 4, "name" : "Audi", "price" : 5000000 , "kph": 3.6}, {"itemNo" : 5, "name" : "Lamborghini", "price" : 5000000 , "kph": 2.9}]
def main(args: Array[String]) {
  val spark = SparkSession
    .builder()
    .master("local")
    .appName("Spark2.0")
    .getOrCreate()
  val df = spark.read.option("header", "true")
    .json("/resources/cars_price.json")
  val selectedData = df.select("name", "price")
  selectedData.write.option("header", "true")
    .save(s"src/main/resources/${UUID.randomUUID()}")
  println("OK")
}

savejson.pngjsonsave

Loading and saving txt file

As an example, the following creates a Data Frame based on the content of a text file. Read a text document named rklick.txt with the following content and generate a table based on the schema in the text document.

Rklick creative technology company providing key digital services.
Focused on helping our clients to build a successful business on web and mobile.
We are mature and dynamic solution oriented IT full service product development and customized consulting services company. Since 2007 we're dedicated team of techno-whiz engaged in providing outstanding solutions to diverse group of business entities across North America.
Known as Professional Technologists, we help our customers enable latest technology stack or deliver quality cost effective products and applications
def main(args: Array[String]) {
    val spark = SparkSession
      .builder()
      .master("local")
      .appName("Spark2.0")
      .getOrCreate()
    import spark.implicits._
    val rklickData = spark.read.text("src/main/resources/rklick.txt").as[String]
    val rklickWords = rklickData.flatMap(value=>value.split("\\s+"))
    val saveTxt = rklickWords.write.text(s"src/main/resources/${UUID.randomUUID()}")
    println("OK")
  }

txtload.pngtxtsave

We would look at how we can create more useful tutorials to grow it , then we would be adding more content to it together. If you have any suggestion feel free to suggest us 🙂 Stay tuned.

Advertisements

Tutorial: Quick overview of Spark 2.0.1 Core Functionality

In this blog we will discuss about Spark 2.0.1 Core Functionality. It demonstrates the basic functionality of Spark 2.0.1. We also describe  SparkSession, Spark SQL and DataFrame API functionality. We have tried to cover basics of Spark 2.0.1 core functionality and SparkSession.

SparkSession:

SparkSession is new entry point of Spark.In previous version (1.6.x) of Spark ,Spark Context was entry point for Spark and in Spark 2.0.1 SparkSession is entry point of Spark. Spark session internally has a spark context for actual computation.As we know RDD was main API, it was created and manipulated using context API’s. For every other API,we needed to use different contexts.For streaming, we needed StreamingContext, for SQL sqlContext and for hive HiveContext. But as DataSet and Dataframe API’s are becoming new standard API’s we need an entry point build for them. So in Spark 2.0.1, we have a new entry point for DataSet and Dataframe API’s called as Spark Session.

Creating SparkSession:

Here we describe how to create SparkSession.

val spark = SparkSession.builder.master("local")
.appName("spark example")
.getOrCreate()

once we have created spark session then we can use it to read the data.

Read data using Spark Session

It looks like exactly like reading using SQLContext. You can easily replace all your code of SQLContext with SparkSession now.

val spark = SparkSession.builder.
 master("local")
 .appName("spark example")
 .getOrCreate()

Creating DataFrames:

How to create DataFrames with the help of SparkSession,applications can create DataFrames from an existing RDD, from a from data sources.As an example, the following creates a DataFrame based on the content of a csv file:

val dataFrame =spark.read.option("header","true")
 .csv("src/main/resources/team.csv")

csv.png

creates a DataFrame based on the content of a json file:

val dataFrame =spark.read.option("header","true")
.json("src/main/resources/cars_price.json")

dataFrame.show()

json.png

Creating Datasets:

Dataset is new abstraction in Spark introduced as alpha API in Spark 1.6. It’s becoming stable API in spark 2.0.1.Datasets are similar to RDDs, however, instead of using Java serialization or Kryo they use a specialized Encoder to serialize the objects for processing or transmitting over the network.The major difference is, dataset is collection of domain specific objects where as RDD is collection of any object. Domain object part of definition signifies the schema part of dataset. So dataset API is always strongly typed and optimized using schema where RDD is not. Dataset definition also talks about Dataframes API. Dataframe is special dataset where there is no compilation checks for schema. So this makes dataSet new single abstraction replacing RDD from earlier versions of spark. We read data using read.text API which is similar to textFile API of RDD. the following creates a DataFrame based on the content of a txt file:

import spark.implicits._
val rklickData = spark.read.text("src/main/resources/rklick.txt").as[String]
val rklickWords = rklickData.flatMap(value => value.split("\\s+"))
val rklickGroupedWords = rklickWords.groupByKey(_.toLowerCase)
val rklickWordCount = rklickGroupedWords.count()
rklickWordCount.show()

dataset

We would look at how we can create more useful tutorials to grow it , then we would be adding more content to it together. If you have any suggestion feel free to suggest us 🙂 Stay tuned.