Tutorial : DataFrame API Functionalities using Spark 1.6

In previous tutorial, we  have explained  about the SparkSQL and DataFrames Operations using Spark 1.6. Now In this tutorial we have covered  DataFrame API Functionalities . And we have provided running example of each functionality for better support. Lets begin the tutorial and discuss about the DataFrame API  Operations using Spark 1.6 .

DataFrame API Example Using Different types of Functionalities

Different type of DataFrame operations are :-

1.Action
2.Basic
3.Operations

Here we are using  JSON document named cars.json with the following content and generate a table based on the schema in the JSON document.

[{"itemNo" : 1, "name" : "Ferrari", "speed" : 259 , "weight": 800},
 {"itemNo" : 2, "name" : "Jaguar", "speed" : 274 , "weight":998},
 {"itemNo" : 3, "name" : "Mercedes", "speed" : 340 , "weight": 1800},
 {"itemNo" : 4, "name" : "Audi", "speed" : 345 , "weight": 875},
 {"itemNo" : 5, "name" : "Lamborghini", "speed" : 355 , "weight": 1490}]

Action:

Action are operations (such as take, count, first, and so on) that return a value after running a computation on an DataFrame.

object DataFrameAPI {
  val sc = SparkCommon.sparkContext

  val sqlContext = SparkCommon.sparkSQLContext

  // Use the following command to create SQLContext.
  val ssc = SparkCommon.sparkSQLContext

  val schemaOptions = Map("header" -> "true", "inferSchema" -> "true")

  import sqlContext.implicits._
  import org.apache.spark.sql.functions._

  def main(args: Array[String]) 
    val cars = "src/main/resources/cars.json"
    val carsPrice = "src/main/resources/cars_price.json"
    val carsDataFrame: DataFrame = ssc.read.format("json").options(schemaOptions).load(cars)
    val carsDataFrame: DataFrame = ssc.read.format("json").options(schemaOptions).load(carsPrice)

    //If you want to see top 20 rows of DataFrame in a tabular form then use the following command.
    carDataFrame.show()
  }
}

show()

If you want to see top 20 rows of DataFrame in a tabular form then use the following command.

carDataFrame.show()

showdata

show(n)

If you want to see n  rows of DataFrame in a tabular form then use the following command.

carDataFrame.show(2)

shown

take()

take(n) Returns the first n rows in the DataFrame.

carDataFrame.take(2).foreach(println)

takedata

count()

Returns the number of rows.

carDataFrame.groupBy("speed").count().show()

countdata

head()

head () is used to returns first row.

val resultHead = carDataFrame.head()
println(resultHead.mkString(","))

headdata

 head(n)

head(n) returns first n rows.

val resultHeadNo = carDataFrame.head(3)
println(resultHeadNo.mkString(","))

headn

first()

Returns the first row.

val resultFirst = carDataFrame.first()
println("fist:" + resultFirst.mkString(","))

firstdata

collect()

Returns an array that contains all of Rows in this DataFrame.

val resultCollect = carDataFrame.collect()
println(resultCollect.mkString(","))

collectdata

Basic DataFrame functions
printSchema()

If you want to see the Structure (Schema) of the DataFrame, then use the following command.

carDataFrame.printSchema()

printschemadata

toDF()

toDF() Returns a new DataFrame with columns renamed. It can be quite convenient in conversion from a RDD of tuples into a DataFrame with meaningful names.

val car = sc.textFile("src/main/resources/fruits.txt")
.map(_.split(","))
.map(f => Fruit(f(0).trim.toInt, f(1), f(2).trim.toInt))
.toDF().show()

todfdata

dtypes()

Returns all column names and their data types as an array.

carDataFrame.dtypes.foreach(println)

dtypesdata

columns ()

Returns all column names as an array.

carDataFrame.columns.foreach(println)

columnsdata

Data Frame operations:

sort()

Returns a new DataFrame sorted by the given expressions.

carDataFrame.sort($"itemNo".desc).show()

shortdata

orderBy()

Returns a new DataFrame sorted by the specified column(s).

carDataFrame.orderBy(desc("speed")).show()

orderbydata

groupBy()

counting the number of cars who are of the same speed .

carDataFrame.groupBy("speed").count().show()

groupbydata

na()

Returns a DataFrame na  Functions for working with missing data.

carDataFrame.na.drop().show()

nadata

as()

Returns a new DataFrame with an alias set.

carDataFrame.select(avg($"speed").as("avg_speed")).show()

asdata

alias()

Returns a new DataFrame with an alias set. Same as `as`.

carDataFrame.select(avg($"weight").alias("avg_weight")).show()

aliasdata

select()

To fetch speed-column among all columns from the DataFrame.

carDataFrame.select("speed").show()

selectdata

filter()

filter the cars whose speed is greater than 300 (speed > 300).

carDataFrame.filter(carDataFrame("speed") > 300).show()

filterdata

For more details see here.

We would look at how we can create more useful tutorials to grow it , then we would be adding more content to it together. If you have any suggestion feel free to suggest us 🙂 Stay tuned.

Advertisements

One thought on “Tutorial : DataFrame API Functionalities using Spark 1.6

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s