Tutorial: Quick overview of Spark 2.0.1 Core Functionality

In this blog we will discuss about Spark 2.0.1 Core Functionality. It demonstrates the basic functionality of Spark 2.0.1. We also describe  SparkSession, Spark SQL and DataFrame API functionality. We have tried to cover basics of Spark 2.0.1 core functionality and SparkSession.


SparkSession is new entry point of Spark.In previous version (1.6.x) of Spark ,Spark Context was entry point for Spark and in Spark 2.0.1 SparkSession is entry point of Spark. Spark session internally has a spark context for actual computation.As we know RDD was main API, it was created and manipulated using context API’s. For every other API,we needed to use different contexts.For streaming, we needed StreamingContext, for SQL sqlContext and for hive HiveContext. But as DataSet and Dataframe API’s are becoming new standard API’s we need an entry point build for them. So in Spark 2.0.1, we have a new entry point for DataSet and Dataframe API’s called as Spark Session.

Creating SparkSession:

Here we describe how to create SparkSession.

val spark = SparkSession.builder.master("local")
.appName("spark example")

once we have created spark session then we can use it to read the data.

Read data using Spark Session

It looks like exactly like reading using SQLContext. You can easily replace all your code of SQLContext with SparkSession now.

val spark = SparkSession.builder.
 .appName("spark example")

Creating DataFrames:

How to create DataFrames with the help of SparkSession,applications can create DataFrames from an existing RDD, from a from data sources.As an example, the following creates a DataFrame based on the content of a csv file:

val dataFrame =spark.read.option("header","true")


creates a DataFrame based on the content of a json file:

val dataFrame =spark.read.option("header","true")



Creating Datasets:

Dataset is new abstraction in Spark introduced as alpha API in Spark 1.6. It’s becoming stable API in spark 2.0.1.Datasets are similar to RDDs, however, instead of using Java serialization or Kryo they use a specialized Encoder to serialize the objects for processing or transmitting over the network.The major difference is, dataset is collection of domain specific objects where as RDD is collection of any object. Domain object part of definition signifies the schema part of dataset. So dataset API is always strongly typed and optimized using schema where RDD is not. Dataset definition also talks about Dataframes API. Dataframe is special dataset where there is no compilation checks for schema. So this makes dataSet new single abstraction replacing RDD from earlier versions of spark. We read data using read.text API which is similar to textFile API of RDD. the following creates a DataFrame based on the content of a txt file:

import spark.implicits._
val rklickData = spark.read.text("src/main/resources/rklick.txt").as[String]
val rklickWords = rklickData.flatMap(value => value.split("\\s+"))
val rklickGroupedWords = rklickWords.groupByKey(_.toLowerCase)
val rklickWordCount = rklickGroupedWords.count()


We would look at how we can create more useful tutorials to grow it , then we would be adding more content to it together. If you have any suggestion feel free to suggest us 🙂 Stay tuned.


Tutorial : Spark SQL and DataFrames Operations using Spark 1.6

In previous tutorial, we  have explained about Spark Core and RDD functionalities. Now In this tutorial we have covered Spark SQL and DataFrame operation from different source like JSON, Text and CSV data files. And we have provided running example of each functionality for better support. Lets begin the tutorial and discuss about the SparkSQL and DataFrames Operations using Spark 1.6


Spark SQL is a component on top of Spark Core that introduces a new data abstraction called SchemaRDD, which provides support for structured and semi-structured data. Spark SQL is to execute SQL queries written using either a basic SQL syntax or HiveQL. It can also be used to read data from an existing Hive installation.It provides a programming abstraction called DataFrame and can act as distributed SQL query engine. Continue reading

Tutorial : Quick overview of Spark 1.6 Core Functionality

In this blog we will discuss about Spark 1.6 Core Functionality and provides a quick introduction to using Spark. It demonstrates the basic functionality of RDDs. Later on we demonstrate Spark SQL and DataFrame API functionality. We have tried to cover basics of Spark 1.6  core functionality and  programming contexts.

Introduction to Apache Spark

 Spark is a powerful open source processing engine built around speed, ease of use, and sophisticated analytics.It is a cluster computing framework originally developed in the AMPLab at University of California, Berkeley but was later donated to the Apache Software Foundation where it remains today. Apache Spark is a lightning-fast cluster computing technology, designed for fast computation. It is a framework for performing general data analytics on distributed computing cluster like Hadoop. The main feature of Spark is its in-memory cluster computing that increases the processing speed of an application. It provides in memory computations for increase speed and data process over map reduce.It runs on top of existing Hadoop cluster and access Hadoop data store (HDFS), can also process structured data in Hive and Streaming data from HDFS, Flume, Kafka, Twitter. Continue reading


Modern functional applications architecture requires slew of silo pieces working collectively in harmony  based on “PnP (Plug n Play)” architecture. Enterprise leans towards fluid, functional and scalable applications paradigm utilizing  “Micro Service Architecture (MSA)”. MSA implementations help  break large monolothic application tiers into smaller and manageable services. Each MSA have it’s own dedicated and non-blocking process ,communicate over REST and can contain mix-bag of multiple languages with each performing it’s own allocated tasks. One service failure have no impact on others.

Micro-Services Characteristics:

  1.   Developed and deployed independently of other.
  2.   Communicate using REST.
  3.  Each service can be written in different programming languages.
  4.  Each service can use different data storage technologies.
  5.  Ideally each service should perform only one task.

Continue reading

Dribbling with Spark 1.6 GraphX Components

GraphX provide distributed in-memory computing. The GraphX API enables users to view data both as graphs and as collections (i.e., RDDs) without data movement or duplication.

In this example, we have process a small social network with users as vertices’s and relation between users as edges and find out these details:

  • Evaluate what’s the most important users in the graph
  • Find all three users graph where every two users are connected
  • Find pair of users where connection in each direction between them

Continue reading

Play 2.4.x & RethinkDB: Classic CRUD application backed by RethinkDB

In this blog We have created Classic CRUD application using Play 2.4.x , Scala and RethinkDB. Where Scala meets Object-Oriented things in Functional way, Play is a High Velocity Web Framework For Java & Scala and RethinkDB is the open-source, scalable database that makes building realtime apps dramatically easier.


Continue reading