Tutorial : DataFrame API Functionalities using Spark 1.6

In previous tutorial, we  have explained  about the SparkSQL and DataFrames Operations using Spark 1.6. Now In this tutorial we have covered  DataFrame API Functionalities . And we have provided running example of each functionality for better support. Lets begin the tutorial and discuss about the DataFrame API  Operations using Spark 1.6 .

DataFrame API Example Using Different types of Functionalities

Different type of DataFrame operations are :-


Here we are using  JSON document named cars.json with the following content and generate a table based on the schema in the JSON document.

Continue reading

Tutorial : Spark SQL and DataFrames Operations using Spark 1.6

In previous tutorial, we  have explained about Spark Core and RDD functionalities. Now In this tutorial we have covered Spark SQL and DataFrame operation from different source like JSON, Text and CSV data files. And we have provided running example of each functionality for better support. Lets begin the tutorial and discuss about the SparkSQL and DataFrames Operations using Spark 1.6


Spark SQL is a component on top of Spark Core that introduces a new data abstraction called SchemaRDD, which provides support for structured and semi-structured data. Spark SQL is to execute SQL queries written using either a basic SQL syntax or HiveQL. It can also be used to read data from an existing Hive installation.It provides a programming abstraction called DataFrame and can act as distributed SQL query engine. Continue reading

Tutorial : Quick overview of Spark 1.6 Core Functionality

In this blog we will discuss about Spark 1.6 Core Functionality and provides a quick introduction to using Spark. It demonstrates the basic functionality of RDDs. Later on we demonstrate Spark SQL and DataFrame API functionality. We have tried to cover basics of Spark 1.6  core functionality and  programming contexts.

Introduction to Apache Spark

 Spark is a powerful open source processing engine built around speed, ease of use, and sophisticated analytics.It is a cluster computing framework originally developed in the AMPLab at University of California, Berkeley but was later donated to the Apache Software Foundation where it remains today. Apache Spark is a lightning-fast cluster computing technology, designed for fast computation. It is a framework for performing general data analytics on distributed computing cluster like Hadoop. The main feature of Spark is its in-memory cluster computing that increases the processing speed of an application. It provides in memory computations for increase speed and data process over map reduce.It runs on top of existing Hadoop cluster and access Hadoop data store (HDFS), can also process structured data in Hive and Streaming data from HDFS, Flume, Kafka, Twitter. Continue reading

Dribbling with Spark 1.6 GraphX Components

GraphX provide distributed in-memory computing. The GraphX API enables users to view data both as graphs and as collections (i.e., RDDs) without data movement or duplication.

In this example, we have process a small social network with users as vertices’s and relation between users as edges and find out these details:

  • Evaluate what’s the most important users in the graph
  • Find all three users graph where every two users are connected
  • Find pair of users where connection in each direction between them

Continue reading

Play 2.4.x & RethinkDB: Classic CRUD application backed by RethinkDB

In this blog We have created Classic CRUD application using Play 2.4.x , Scala and RethinkDB. Where Scala meets Object-Oriented things in Functional way, Play is a High Velocity Web Framework For Java & Scala and RethinkDB is the open-source, scalable database that makes building realtime apps dramatically easier.


Continue reading