Introduction to Spark 2.0

Overview of Dataset , Dataframe and RDD API :

Resilient Distributed Datasets (RDD) is a fundamental data structure of Spark. It is an immutable distributed collection of objects. Each dataset in RDD is divided into logical partitions, which may be computed on different nodes of the cluster.

But due to facing issue related to advanced optimization move to dataframe.

Dataframe brought custom memory management and runtime code generation which greatly improved performance. So in last year most of the improvements went into Dataframe API.

A DataFrame is a Dataset organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood. DataFrames can be constructed from a wide array of sources such as: structured data files, tables in Hive, external databases, or existing RDDs.

Though dataframe API solved many issues, it was not a good enough replacement for RDD API. One of the major issues with dataframe API was no compile time safety and not able to work with domain objects. So this held back people using dataframe API everywhere. But with introduction of Dataset API in 1.6, we were able to fill the gap.

A Dataset is a distributed collection of data. Dataset is a new interface added in Spark 1.6 that provides the benefits of RDDs (strong typing, ability to use powerful lambda functions) with the benefits of Spark SQL’s optimized execution engine. A Dataset can be constructed from JVM objects and then manipulated using functional transformations (map, flatMap, filter, etc.).

So in Spark 2.0, Dataset API will be become a stable API. So Dataset API combined with Dataframe API should able to cover most of the use cases where RDD was used earlier. So as a spark developer it is advised to start embracing these two API’s over RDD API from Spark 2.0.

Points to be discussed :

Datasets : Starting in Spark 2.0, DataFrame is just a type alias for Dataset of Row. Both the typed methods (e.g. map, filter, groupByKey) and the untyped methods (e.g. select, groupBy) are available on the Dataset class. Also, this new combined Dataset interface is the abstraction used for Structured Streaming.

For long, RDD was the standard abstraction of Spark. But from Spark 2.0, Dataset will become the new abstraction layer for spark. Though RDD API will be available, it will become low level API, used mostly for runtime and library development. All user land code will be written against the Dataset abstraction and it’s subset Dataframe API.

Dataset is a superset of Dataframe API which is released in Spark 1.3. Dataset together with Dataframe API brings better performance and flexibility to the platform compared to RDD API. Dataset will be also replacing RDD as an abstraction for streaming in future releases.

The major difference is, dataset is collection of domain specific objects where as RDD is collection of any object. Domain object part of definition signifies the schema part of dataset. So dataset API is always strongly typed and optimized using schema where RDD is not.

Dataset definition also talks about Dataframes API. Dataframe is special dataset where there is no compilation checks for schema. So this makes dataSet new single abstraction replacing RDD from earlier versions of spark.

SparkSession: A new entry point that replaces the old SQLContext and HiveContext. For users of the DataFrame API, a common source of confusion for Spark is which “context” to use. Now you can use SparkSession, which subsumes both, as a single entry point, as demonstrated in this notebook. Note that the old SQLContext and HiveContext are still kept for backward compatibility.

In earlier versions of spark, spark context was entry point for Spark. As RDD was main API, it was created and manipulated using context API’s. For every other API,we needed to use different contexts.For streaming, we needed StreamingContext, for SQL sqlContext and for hive HiveContext. But as DataSet and Dataframe API’s are becoming new standard API’s we need an entry point build for them. So in Spark 2.0, we have a new entry point for DataSet and Dataframe API’s called as Spark Session.

SparkSession is essentially combination of SQLContext, HiveContext and future StreamingContext. All the API’s available on those contexts are available on spark session also. Spark session internally has a spark context for actual computation.

 

val sparkSession = SparkSession.builder
.master("local")
.appName("spark session example")
.getOrCreate()

New Accumulator API: We have designed a new Accumulator API that has a simpler type hierarchy and support specialization for primitive types. The old Accumulator API has been deprecated but retained for backward compatibility

 

 Catalog  API :  In Spark2.0 ,A new Catalog API is introduced to access metadata.You can fetch all database list as well as table list using this API.

To access all databases list :-

val sparkSession = SparkSession.builder
           .master("local")
           .appName("catalog example")
           .getOrCreate()

val catalog = sparkSession.catalog
catalog.listDatabases().select("name").show()
To access all table list :-
val sparkSession = SparkSession.builder.
           master("local")
           .appName("example")
           .getOrCreate()

val catalog = sparkSession.catalog

catalog.listTables().select("name").show()

This blog provides a quick introduction to using Spark2.0 .It demonstrates the basic functionality of new API.This is the start of using Spark with Scala, from next week onwards we would be working on this tutorial to make it grow. We would look at how we can add more functionality into it , then we would be adding more modules to it together. If you have any changes then feel free to send.
If you have any suggestion feel free to suggest us 🙂 Stay tuned.
You can access full code from here

 

Advertisements

Hadoop on Multi Node Cluster

Step 1: Installing Java:

Java is the primary requirement to running hadoop on system, so make sure you have Java installed on your system using following command:

$ java -version

If you don’t have Java installed on your system, use one of following link to install it first.

Step 2: Creating Hadoop User :

We recommend to create a normal (nor root) account for hadoop working. So create a system account using following command:

$ adduser hadoop
$ passwd hadoop

Step 3 : Generate SSH Keys

After creating account, it also required to set up key based ssh to its own account. To do this use execute following commands.

[root@rklick01 ~]# su hadoop
[hadoop@rklick01 root]$ cd
[hadoop@rklick01 ~]$
[hadoop@rklick01 ~]$ ssh-keygen -t rsa

We would see these types of logs and follow these instructions
Generating public/private rsa key pair.
Enter file in which to save the key (/home/hadoop/.ssh/id_rsa): 
/home/hadoop/.ssh/id_rsa already exists.
Overwrite (y/n)? y
Enter passphrase (empty for no passphrase): 
Enter same passphrase again: 
Your identification has been saved in /home/hadoop/.ssh/id_rsa.
Your public key has been saved in /home/hadoop/.ssh/id_rsa.pub.
The key fingerprint is:
f2:fb:77:5a:e3:e3:9d:b6:03:40:04:ad:2a:be:c9:37 hadoop@rklick01
The key's randomart image is:
+--[ RSA 2048]----+
|         .+.     |
|           o     |
|          o      |
|         . .     |
|      . S   .    |
|     . +     .   |
|    . . .     +  |
|    ...E .  .oo=.|
|     +o o....++++|
+-----------------+
To access Worker nodes via SSH without providing password
copy SSH key to first nodes

$ ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop@rklick01

We would see these types of logs

Now try logging into the machine, with "ssh 'hadoop@rklick01'", and 
check in:
  .ssh/authorized_keys

to make sure we haven't added extra keys that you weren't expecting.
 copy SSH key to second nodes

$ ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop@rklick02

We would see these types of logs

hadoop@rklick02's password: 
Now try logging into the machine, with "ssh 'hadoop@rklick02'", and
check in:
  .ssh/authorized_keys

to make sure we haven't added extra keys that you weren't expecting.
 copy SSH key to third nodes

$ ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop@rklick03

We would see these types of logs

hadoop@rklick03's password: 
Now try logging into the machine, with "ssh 'hadoop@rklick03'", and check 
in:
  .ssh/authorized_keys

to make sure we haven't added extra keys that you weren't expecting.

 How to check SSH keys working

[hadoop@rklick01 ~]$ ssh ‘hadoop@rklick02’

We would see these type of logs and follow instruct

Last login: Thu May  5 05:22:56 2016 from rklick01
[hadoop@rklick02 ~]$ 
[hadoop@rklick02 ~]$ 
[hadoop@rklick02 ~]$ 
[hadoop@rklick02 ~]$ 
[hadoop@rklick02 ~]$ exit
logout
Connection to rklick02 closed.
[hadoop@rklick01 ~]$ 

i.e. After successfully SSH implemented, you would reach to rklick02 without password. After exit command , you would be again back to rklick01 .

Step 3. Downloading Hadoop 2.6.0

$ wget http://apache.claz.org/hadoop/common/hadoop-2.6.0/hadoop-2.6.0.tar.gz
$ tar xzf hadoop-2.6.0.tar.gz

Now rename Hadoop 2.6.0 to Hadoop

$ mv hadoop-2.6.0 hadoop

Step 4. Configure Hadoop Pseudo-Distributed Mode

4.1. Setup Environment Variables

4.1.1. Edit the bashrc file

First we need to set environment variable uses by hadoop. Edit ~/.bashrc file and append following values at end of file.

$ vi ~/.bashrc

Adding these lines in bashrc

export HADOOP_HOME=/home/hadoop/hadoop
export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin

 

Now apply the changes in current running environment

$ source ~/.bashrc

4.1.2. Edit the Hadoop Env. file

Now edit $HADOOP_HOME/etc/hadoop/hadoop-env.sh file and set JAVA_HOME environment variable. Change the JAVA path as per install on your system.

$ vi $HADOOP_HOME/etc/hadoop/hadoop-env.sh

Add these line in this file

export JAVA_HOME=/usr/

4.2. Edit Configuration Files

Hadoop has many of configuration files, which need to configure as per requirements of your hadoop infrastructure. Lets start with the configuration with basic hadoop single node cluster setup. first navigate to below location

$ cd $HADOOP_HOME/etc/hadoop

4.2.1 Edit core-site Files

[hadoop@rklick01 hadoop]$ vi core-site.xml

Edit like this

<configuration>
<property>
  <name>fs.default.name</name>
    <value>hdfs://23.227.167.180:9000</value>
</property>
<property>
<name>dfs.permissions</name>
<value>false</value>
</property>
</configuration>

After updated this file looks like this

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->

<!-- Put site-specific property overrides in this file. -->

<configuration>
<property>
  <name>fs.default.name</name>
    <value>hdfs://23.227.167.180:9000</value>
</property>
<property>
<name>dfs.permissions</name>
<value>false</value>
</property>
</configuration>

4.2.2 Edit hdfs-site Files

[hadoop@rklick01 hadoop]$ vi hdfs-site.xml

Edit like this

<configuration>
<property>
 <name>dfs.replication</name>
 <value>1</value>
</property>

<property>
  <name>dfs.name.dir</name>
    <value>file:///home/hadoop/hadoopdata/hdfs/namenode</value>
</property>

<property>
  <name>dfs.data.dir</name>
    <value>file:///home/hadoop/hadoopdata/hdfs/datanode</value>
</property>
</configuration>

After updated this file looks like this

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->

<!-- Put site-specific property overrides in this file. -->

<configuration>
<property>
 <name>dfs.replication</name>
 <value>1</value>
</property>

<property>
  <name>dfs.name.dir</name>
    <value>file:///home/hadoop/hadoopdata/hdfs/namenode</value>
</property>

<property>
  <name>dfs.data.dir</name>
    <value>file:///home/hadoop/hadoopdata/hdfs/datanode</value>
</property>
</configuration>

4.2.3 Edit mapred-site Files

[hadoop@rklick01 hadoop]$ vi mapred-site.xml

Edit like this

<configuration>
 <property>
  <name>mapreduce.framework.name</name>
   <value>yarn</value>
 </property>
</configuration>

After updated this file looks like this

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->

<!-- Put site-specific property overrides in this file. -->

<configuration>
<property>
  <name>mapreduce.framework.name</name>
   <value>yarn</value>
 </property>
</configuration>

4.2.4 Edit yarn-site Files

[hadoop@rklick01 hadoop]$ vi yarn-site.xml

Edit like this

<configuration>
 <property>
  <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce_shuffle</value>
 </property>
</configuration>

After updated this file looks like this

<?xml version="1.0"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->
<configuration>
<property>
  <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce_shuffle</value>
 </property>
<!-- Site specific YARN configuration properties -->

</configuration>                   

4.3. Copy Configuration file to all other node

Copy all config to rklick02

[hadoop@rklick01 ~]$ scp -r hadoop rklick02:/home/hadoop/

Copy all config to rklick03

[hadoop@rklick01 ~]$ scp -r hadoop rklick03:/home/hadoop/

We would see these types of logs

LICENSE.txt                                                                                                   100%   15KB  15.1KB/s   00:00    
README.txt                                                                                                    100% 1366     1.3KB/s   00:00    
libhadoop.so                                                                                                  100%  787KB 787.1KB/s   00:00    
....     
ETC

4.4. Copy SSH Key to all node

Start from rklick01

[hadoop@rklick01 ~]$ ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop@rklick01

We would see these types of logs

Now try logging into the machine, with "ssh 'hadoop@rklick01'", and
check in:
  .ssh/authorized_keys

to make sure we haven't added extra keys that you weren't expecting.


Start from rklick02

[hadoop@rklick01 ~]$ ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop@rklick02

We would see these types of logs

Now try logging into the machine, with "ssh 'hadoop@rklick02'", and
check in:
  .ssh/authorized_keys

to make sure we haven't added extra keys that you weren't expecting.

Start from rklick03

[hadoop@rklick01 ~]$ ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop@rklick03

We would see these types of logs

Now try logging into the machine, with "ssh 'hadoop@rklick03'", and
check in:
  .ssh/authorized_keys

to make sure we haven't added extra keys that you weren't expecting.

4.5. Give authentication

[hadoop@rklick01 ~]$ chmod 0600 ~/.ssh/authorized_keys

How to test SSH is set Successfully
[hadoop@rklick01 ~]$ ssh 'hadoop@rklick02'
Last login: Thu May  5 05:22:56 2016 from rklick01
[hadoop@rklick02 ~]$ 
[hadoop@rklick02 ~]$ 
[hadoop@rklick02 ~]$ 
[hadoop@rklick02 ~]$ 
[hadoop@rklick02 ~]$ exit
logout
Connection to rklick02 closed.
[hadoop@rklick01 ~]$

4.6. Format Namenode

Now format the namenode using following command, make sure that Storage directory is

[hadoop@rklick01 hadoop]$ hdfs namenode -format

We would see these types of logs

16/05/05 05:35:08 INFO namenode.NameNode: STARTUP_MSG: 
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = rklick01/24.111.123.456&amp;lt;/code&amp;gt;&amp;lt;/pre&amp;gt;
&amp;lt;pre&amp;gt;&amp;lt;code&amp;gt;STARTUP_MSG: args = [-format] STARTUP_MSG: version = 2.6.0 STARTUP_MSG: classpath = /home/hadoop/hadoop/etc/hadoop:
/home/hadoop/hadoop/share/hadoop/common/lib/htrace-core-3.0.4.jar:
/home/hadoop/hadoop/share/hadoop/common/lib/jasper-runtime-5.5.23.jar:
/home/hadoop/hadoop/share/hadoop/common/lib/commons-logging-1.1.3.jar:
/home/hadoop/hadoop/share/hadoop/common/lib/jackson-mapper-asl-1.9.13.jar:
/home/hadoop/hadoop/share/hadoop/common/lib/hamcrest-core-1.3.jar:
/home/hadoop/hadoop/share/hadoop/common/lib/jetty-util-6.1.26.jar:
/home/hadoop/hadoop/share/hadoop/common/lib/zookeeper-3.4.6.jar:
/home/hadoop/hadoop/share/hadoop/common/lib/jersey-core-1.9.jar:
/home/hadoop/hadoop/share/hadoop/common/lib/xmlenc-0.52.jar:
/home/hadoop/hadoop/share/hadoop/common/lib/commons-cli-1.2.jar:
/home/hadoop/hadoop/share/hadoop/common/lib/guava-11.0.2.jar:
/home/hadoop/hadoop/share/hadoop/common/lib/jackson-jaxrs-1.9.13.jar:
/home/hadoop/hadoop/share/hadoop/common/lib/asm-3.2.jar:
/home/hadoop/hadoop/share/hadoop/common/lib/jets3t-0.9.0.jar:
/home/hadoop/hadoop/share/hadoop/common/lib/log4j-1.2.17.jar:
/home/hadoop/hadoop/share/hadoop/common/lib/servlet-api-2.5.jar:
/home/hadoop/hadoop/share/hadoop/common/lib/slf4j-api-1.7.5.jar:
/home/hadoop/hadoop/share/hadoop/common/lib/mockito-all-1.8.5.jar:
/home/hadoop/hadoop/share/hadoop/common/lib/java-xmlbuilder-0.4.jar:
/home/hadoop/hadoop/share/hadoop/common/lib/httpclient-4.2.5.jar:
/home/hadoop/hadoop/share/hadoop/common/lib/avro-1.7.4.jar:
/home/hadoop/hadoop/share/hadoop/common/lib/activation-1.1.jar:
/home/hadoop/hadoop/share/hadoop/common/lib/jersey-json-1.9.jar:
/home/hadoop/hadoop/share/hadoop/common/lib/jaxb-api-2.2.2.jar:
/home/hadoop/hadoop/share/hadoop/common/lib/jsr305-1.3.9.jar:
/home/hadoop/hadoop/share/hadoop/common/lib/snappy-java-1.0.4.1.jar:
/home/hadoop/hadoop/share/hadoop/common/lib/junit-4.11.jar:
/home/hadoop/hadoop/share/hadoop/common/lib/hadoop-auth-2.6.0.jar:
/home/hadoop/hadoop/share/hadoop/common/lib/hadoop-annotations-2.6.0.jar:
/home/hadoop/hadoop/share/hadoop/common/lib/jackson-xc-1.9.13.jar:
/home/hadoop/hadoop/share/hadoop/common/lib/jetty-6.1.26.jar:
/home/hadoop/hadoop/share/hadoop/common/lib/jsp-api-2.1.jar:
/home/hadoop/hadoop/share/hadoop/common/lib/paranamer-2.3.jar:
/home/hadoop/hadoop/share/hadoop/common/lib/commons-beanutils-1.7.0.jar:
/home/hadoop/hadoop/share/hadoop/common/lib/jackson-core-asl-1.9.13.jar:
/home/hadoop/hadoop/share/hadoop/common/lib/jaxb-impl-2.2.3-1.jar:
/home/hadoop/hadoop/share/hadoop/common/lib/commons-lang-2.6.jar:
/home/hadoop/hadoop/share/hadoop/common/lib/curator-recipes-2.6.0.jar:
/home/hadoop/hadoop/share/hadoop/common/lib/netty-3.6.2.Final.jar:
/home/hadoop/hadoop/share/hadoop/common/lib/commons-collections-3.2.1.jar:
/home/hadoop/hadoop/share/hadoop/common/lib/commons-compress-1.4.1.jar:
/home/hadoop/hadoop/share/hadoop/common/lib/httpcore-4.2.5.jar:
/home/hadoop/hadoop/share/hadoop/common/lib/commons-beanutils-core-1.8.0.jar:
/home/hadoop/hadoop/share/hadoop/common/lib/gson-2.2.4.jar:
/home/hadoop/hadoop/share/hadoop/common/lib/commons-io-2.4.jar:
/home/hadoop/hadoop/share/hadoop/common/lib/commons-digester-1.8.jar:
/home/hadoop/hadoop/share/hadoop/common/lib/protobuf-java-2.5.0.jar:
/home/hadoop/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar:
/home/hadoop/hadoop/share/hadoop/common/lib/curator-client-2.6.0.jar:
/home/hadoop/hadoop/share/hadoop/common/lib/commons-codec-1.4.jar:
/home/hadoop/hadoop/share/hadoop/common/lib/stax-api-1.0-2.jar:
/home/hadoop/hadoop/share/hadoop/common/lib/apacheds-i18n-2.0.0-M15.jar:
/home/hadoop/hadoop/share/hadoop/common/lib/jasper-compiler-5.5.23.jar:
/home/hadoop/hadoop/share/hadoop/common/lib/xz-1.0.jar:
/home/hadoop/hadoop/share/hadoop/common/lib/api-util-1.0.0-M20.jar:
/home/hadoop/hadoop/share/hadoop/common/lib/commons-httpclient-3.1.jar:
/home/hadoop/hadoop/share/hadoop/common/lib/api-asn1-api-1.0.0-M20.jar:/
home/hadoop/hadoop/share/hadoop/common/lib/jersey-server-1.9.jar:
/home/hadoop/hadoop/share/hadoop/common/lib/jettison-1.1.jar:
/home/hadoop/hadoop/share/hadoop/common/lib/jsch-0.1.42.jar:
/home/hadoop/hadoop/share/hadoop/common/lib/commons-el-1.0.jar:
/home/hadoop/hadoop/share/hadoop/common/lib/curator-framework-2.6.0.jar:
/home/hadoop/hadoop/share/hadoop/common/lib/apacheds-kerberos-codec-2.0.0-M15.jar:
/home/hadoop/hadoop/share/hadoop/common/lib/commons-configuration-1.6.jar:
/home/hadoop/hadoop/share/hadoop/common/lib/commons-net-3.1.jar:
/home/hadoop/hadoop/share/hadoop/common/lib/commons-math3-3.1.1.jar:
/home/hadoop/hadoop/share/hadoop/common/hadoop-common-2.6.0.jar:
/home/hadoop/hadoop/share/hadoop/common/hadoop-nfs-2.6.0.jar:
/home/hadoop/hadoop/share/hadoop/common/hadoop-common-2.6.0-tests.jar:
/home/hadoop/hadoop/share/hadoop/hdfs:/home/hadoop/hadoop/share/hadoop/hdfs/lib/htrace-core-3.0.4.jar:/home/hadoop/hadoop/share/hadoop/hdfs/lib/jasper-runtime-5.5.23.jar:/home/hadoop/hadoop/share/hadoop/hdfs/lib/commons-logging-1.1.3.jar:/home/hadoop/hadoop/share/hadoop/hdfs/lib/jackson-mapper-asl-1.9.13.jar:/home/hadoop/hadoop/share/hadoop/hdfs/lib/jetty-util-6.1.26.jar:/home/hadoop/hadoop/share/hadoop/hdfs/lib/jersey-core-1.9.jar:/home/hadoop/hadoop/share/hadoop/hdfs/lib/xmlenc-0.52.jar:/home/hadoop/hadoop/share/hadoop/hdfs/lib/commons-cli-1.2.jar:/home/hadoop/hadoop/share/hadoop/hdfs/lib/guava-11.0.2.jar:/home/hadoop/hadoop/share/hadoop/hdfs/lib/asm-3.2.jar:/home/hadoop/hadoop/share/hadoop/hdfs/lib/commons-daemon-1.0.13.jar:/home/hadoop/hadoop/share/hadoop/hdfs/lib/log4j-1.2.17.jar:/home/hadoop/hadoop/share/hadoop/hdfs/lib/servlet-api-2.5.jar:/home/hadoop/hadoop/share/hadoop/hdfs/lib/jsr305-1.3.9.jar:/home/hadoop/hadoop/share/hadoop/hdfs/lib/jetty-6.1.26.jar:/home/hadoop/hadoop/share/hadoop/hdfs/lib/jsp-api-2.1.jar:/home/hadoop/hadoop/share/hadoop/hdfs/lib/xml-apis-1.3.04.jar:/home/hadoop/hadoop/share/hadoop/hdfs/lib/jackson-core-asl-1.9.13.jar:/home/hadoop/hadoop/share/hadoop/hdfs/lib/commons-lang-2.6.jar:/home/hadoop/hadoop/share/hadoop/hdfs/lib/netty-3.6.2.Final.jar:/home/hadoop/hadoop/share/hadoop/hdfs/lib/commons-io-2.4.jar:/home/hadoop/hadoop/share/hadoop/hdfs/lib/protobuf-java-2.5.0.jar:/home/hadoop/hadoop/share/hadoop/hdfs/lib/xercesImpl-2.9.1.jar:
/home/hadoop/hadoop/share/hadoop/hdfs/lib/commons-codec-1.4.jar:/home/hadoop/hadoop/share/ha

ETC 

Step 5. Start Hadoop Cluster

Lets start your hadoop cluster using the scripts provides by hadoop. Just navigate to your hadoop sbin directory and execute scripts one by one.

$ cd $HADOOP_HOME/sbin/

Now run start-dfs.sh script.

[hadoop@rklick01 hadoop]$ sbin/start-dfs.sh

Now run start-yarn.sh script.

[hadoop@rklick01 hadoop]$ sbin/start-yarn.sh

Step 6. Access Hadoop Services in Browser

Hadoop NameNode started on port 50070 default. Access your server on port 50070 in your favorite web browser.

http://24.111.123.456:50070/ 

Now access port 8088 for getting the information about cluster and all applications

http://24.111.123.456:8088/ 

Access port 50090 for getting details about secondary namenode.

http://24.111.123.456:50090/ 

Access port 50075 to get details about DataNode

http://24.111.123.456:50075/ 

EXTRA POINTS

1. How to change Hadoop user password

[root@rklick01 ~]# passwd hadoop
Changing password for user hadoop.
New password: 
BAD PASSWORD: it is based on a dictionary word
BAD PASSWORD: is too simple
Retype new password: 
passwd: all authentication tokens updated successfully.

2. Extra point to setup after deleting user hadoop

Delete hadoopdata folder before format Namenode

i.e Before running this command

[hadoop@rklick01 hadoop]$ hdfs namenode -format

You should delete hadoopdata folder.

Here these lines works for HDFS:

val hPath = s"hdfs://$host:$port$path/$filename.$format"

We would look at how we can create more useful to grow it, then we would be adding more content to it together. If you have any suggestion feel free to suggest us :) Stay tuned.

Tutorial : DataFrame API Functionalities using Spark 1.6

In previous tutorial, we  have explained  about the SparkSQL and DataFrames Operations using Spark 1.6. Now In this tutorial we have covered  DataFrame API Functionalities . And we have provided running example of each functionality for better support. Lets begin the tutorial and discuss about the DataFrame API  Operations using Spark 1.6 .

DataFrame API Example Using Different types of Functionalities

Different type of DataFrame operations are :-

1.Action
2.Basic
3.Operations

Here we are using  JSON document named cars.json with the following content and generate a table based on the schema in the JSON document.

Continue reading

Tutorial : Spark SQL and DataFrames Operations using Spark 1.6

In previous tutorial, we  have explained about Spark Core and RDD functionalities. Now In this tutorial we have covered Spark SQL and DataFrame operation from different source like JSON, Text and CSV data files. And we have provided running example of each functionality for better support. Lets begin the tutorial and discuss about the SparkSQL and DataFrames Operations using Spark 1.6

SparkSQL

Spark SQL is a component on top of Spark Core that introduces a new data abstraction called SchemaRDD, which provides support for structured and semi-structured data. Spark SQL is to execute SQL queries written using either a basic SQL syntax or HiveQL. It can also be used to read data from an existing Hive installation.It provides a programming abstraction called DataFrame and can act as distributed SQL query engine. Continue reading

ElasticSearch Basic

Elasticsearch:

 

elasticsearch.png

ElasticSearch can be described as follows:

  1.  A distributed real-time storage system.
  2. Every field is indexed and searchable.
  3. A distributed search engine with real-time analytics
  4. Capable of scaling to hundreds of servers and petabytes of structured and
    unstructured data.

And it packages up all this functionality into a standalone server that your application
can talk to via a simple RESTful API, using a web client from your favorite program‐
ming language, or even from the command line.

Installing ElasticSearch :

Step 1 — Installing Java

sudo add-apt-repository -y ppa:webupd8team/java
sudo apt-get update
sudo apt-get install openjdk-8-jdk
(or sudo apt-get -y install oracle-java8)

To verify your JRE is installed and can be used, run the command:

java -version

Step 2 — Downloading and Installing ElasticSearch

ElasticSearch can be downloaded directly from elastic.co in zip, tar.gz, deb, or rpm packages. For Ubuntu, it’s best to use the deb (Debian) package which will install everything you need to run ElasticSearch.


wget https://download.elasticsearch.org/elasticsearch/release/org/
elasticsearch/distribution/deb/elasticsearch/2.1.0/elasticsearch-2.1.0.deb

Then install it in the usual Ubuntu way with the dpkg command like this:


sudo dpkg -i elasticsearch-2.1.0.deb

Elasticsearch is now installed.

ElasticSearch Configuration :

The elasticseach configuration files are in the /elasticsearch/config directory. There are two

files :

elasticsearch.yml :

Configures the Elasticsearch server settings. This is where all options, are stored.

logging.yml :

All configuration in this file only for logging. In the beginning, you can leave all default logging options.

Edit the configuration file like this :

(a)How to change network host :

Open the elasticsearch.yml file.

Change the network host to localhost like this :


# network.host: 192.168.0.1 (Before)
network.host: localhost (After Changing)

We have restricted outside user to access Elasticsearch instance, so using HTTP API outside user’s can not read your data or shutdown your Elasticsearch cluster.

(b)How to change cluster name :

Change cluster name according to your project.Here,cluster name is akshay_es_blog

cluster.name: akshay_es_blog

 

How to run  Elasticsearch:

Elasticsearch is now ready to run. You can start it up in the foreground with this:


./bin/elasticsearch

How to Test- ElasticSearch is running :

You can open another terminal window and running the following:

curl 'http://localhost:9200/?pretty'

You should see a response like this:


{
  "name" : "Mad Jack",
  "cluster_name" : "akshay_es_blog",
  "version" : {
    "number" : "2.1.0",
    "build_hash" : "72cd1f1a3eee09505e036106146dc1949dc5dc87",
    "build_timestamp" : "2015-11-18T22:40:03Z",
    "build_snapshot" : false,
    "lucene_version" : "5.3.1"
  },
  "tagline" : "You Know, for Search"
}

This means that your Elasticsearch cluster is up and running, and we can start experi‐
menting with it.

Key Points :

We would discuss about these point one by one :

  • What is Node ?
  • How to shutdown ES?
  • What is Node Client ?
  • What is Transport Client ?
  • What is HTTP method or Verb ?
  • Example of Complete Elasticsearch request .
  • CRUD opearation using Elasticsearch.
  • Searching using Elasticsearch.

What is Node :

A node is a running instance of Elasticsearch. A cluster is a group
of nodes with the same cluster.name that are working together
to share data and to provide failover and scale, although a single
node can form a cluster all by itself.

Shutdown ES:

When Elasticsearch is running in the foreground, you
can stop it by pressing Ctrl-C; otherwise, you can shut it down with the shutdown
API:


curl -XPOST 'http://localhost:9200/_shutdown'

Talking to Elasticsearch:

How you talk to Elasticsearch depends on whether you are using Java.
Java API
If you are using Java, Elasticsearch comes with two built-in clients that you can use in
your code:

Node client:

The node client joins a local cluster as a non data node. In other words, it doesn’t
hold any data itself, but it knows what data lives on which node in the cluster,
and can forward requests directly to the correct node.

Transport client :

The lighter-weight transport client can be used to send requests to a remote cluster. It doesn’t join the cluster itself, but simply forwards requests to a node in the cluster.Both Java clients talk to the cluster over port 9300, using the native Elasticsearch transport protocol. The nodes in the cluster also communicate with each other over port 9300. If this port is not open, your nodes will not be able to form a cluster.

RESTful API with JSON over HTTP :

All other languages can communicate with Elasticsearch over port 9200 using a RESTful API, accessible with your favorite web client. In fact, as you have seen, you can even talk to Elasticsearch from the command line by using the curl command.

HTTP method or Verb :

The appropriate HTTP method or verb: GET , POST , PUT , HEAD , or DELETE .

 

Complete Elastic Search Request with all Component :

A request to Elasticsearch consists of the same parts as any HTTP request:

curl -X<VERB> '<PROTOCOL>://<HOST>/<PATH>?<QUERY_STRING>' -d '<BODY>'

The parts marked with < > above are:

VERB

The appropriate HTTP method or verb: GET , POST , PUT , HEAD , or DELETE .

PROTOCOL

Either http or https (if you have an https proxy in front of Elasticsearch.)

HOST

The hostname of any node in your Elasticsearch cluster, or localhost for a node on your local machine.

PORT

The port running the Elasticsearch HTTP service, which defaults to 9200 .

QUERY_STRING

Any optional query-string parameters (for example ?pretty will pretty-print the
JSON response to make it easier to read.)

BODY

A JSON-encoded request body (if the request needs one.)

For instance, to count the number of documents in the cluster, we could use this:


curl -XGET 'http://localhost:9200/_count?pretty' -d '
{
"query": {
"match_all": {}
}
}
'

Elasticsearch returns an HTTP status code like 200 OK and (except for HEAD requests)
a JSON-encoded response body. The preceding curl request would respond with a
JSON body like the following:

{
  "count" : 1562,
  "_shards" : {
    "total" : 3,
    "successful" : 3,
    "failed" : 0
  }
}

Elasticsearch is document oriented, meaning that it stores entire objects or documents.
It not only stores them, but also indexes the contents of each document in order to
make them searchable. In Elasticsearch, you index, search, sort, and filter documents
—not rows of columnar data. This is a fundamentally different way of thinking about
data and is one of the reasons Elasticsearch can perform complex full-text search.

JSON

Elasticsearch uses JavaScript Object Notation, or JSON, as the serialization format for
documents. JSON serialization is supported by most programming languages, and
has become the standard format used by the NoSQL movement. It is simple, concise,
and easy to read.

NOTE :

Relational DB ⇒ Databases ⇒ Tables ⇒ Rows⇒ Columns
Elasticsearch ⇒ Indices ⇒ Types ⇒ Documents ⇒ Fields

An Elasticsearch cluster can contain multiple indices (databases), which in turn con‐
tain multiple types (tables). These types hold multiple documents (rows), and each
document has multiple fields (columns).

CRUD Operation using Elasticsearch :

Indexing User’s details :

The act of stoing data in Elasticsearch is called indexing.For indexing a document, we
need to decide where to store it.

To store user details ,we should  type(i.e ‘table name’ in RDBMS manner) and index(‘database name’  in RDBMS manner).

So firstly we would decide type name and index name

Suppose index name =>Rklick

type=>user_details

Following command would store user details  in user_details type of Rklick index  :


curl -X PUT 'http://localhost:9200/rklick/user_details/1' -d  '{
"first_name" : "Akshay",
"last_name" : "Saxena",
"age" :24,
"address" :"MG road,Gurgaon"
}'

 

Notice that the path /Rklick/user_details/1  contains three pieces of information:
Rklick => Index name
user_details=>type name
1          =>    The ID of this particular employee
The request body—the JSON document—contains all the information about this
employee. His name is Akshay, he’s 25, and he is living in MG road.

We can add user details for user Himanshu in Rklick like this:


curl -X PUT 'http://localhost:9200/rklick/user_details/2' -d  '{
"first_name" : "Himanshu",
"last_name" : "Saxena",
"age" :24,
"address" :"MG road,Gurgaon"
}'

Another user is Anand ,we can add his information like this :


curl -X PUT 'http://localhost:9200/rklick/user_details/3' -d  '{
"first_name" : "Anand",
"last_name" : "Kumar",
"age" :26,
"address" :"Chatterpur,Gurgaon"
}'

Retrieving a Document(User):

We can get all user's data based on id ,elasticsearch get request should be like this: :
curl -X GET 'http://localhost:9200/rklick/user_details/1?pretty'
Now ,we get response like this :

{
  "_index" : "rklick",
  "_type" : "user_details",
  "_id" : "1",
  "_version" : 3,
  "found" : true,
  "_source":{
"first_name" : "Akshay",
"last_name" : "Saxena",
"age" :24,
"address" :"MG road,Gurgaon"
}
}

Searching Using Domain-Specific Language:

a) Searching Using Last Name :


curl -X GET 'http://localhost:9200/rklick/user_details/_search?pretty' -d '{
"query" : {
"term" : {
"last_name" : "kumar"
}
}
}'

Result would be like this :


{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 0.30685282,
    "hits" : [ {
      "_index" : "rklick",
      "_type" : "user_details",
      "_id" : "3",
      "_score" : 0.30685282,
      "_source":{
"first_name" : "Anand",
"last_name" : "Kumar",
"age" :26,
"address" :"Chatterpur,Gurgaon"
}
    } ]
  }
}

b) Searching Using Text :

curl -X GET 'http://localhost:9200/rklick/user_details/_search?pretty' -d '{
"query" : {
"match" : {
"address" : "Chatterpur"
}
}
}
'

Result would be like this :

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 0.19178301,
    "hits" : [ {
      "_index" : "rklick",
      "_type" : "user_details",
      "_id" : "3",
      "_score" : 0.19178301,
      "_source":{
"first_name" : "Anand",
"last_name" : "Kumar",
"age" :26,
"address" :"Chatterpur,Gurgaon"
}
    } ]
  }
}


c) Phrase Search :

When we want to match exact sequences of words or phrases. For example ,we want to match words “MG” and “road” in correct sequence like “MG road” then we can use match phrase query.

i) Searching for “MG road” phrase :

In this case,query would be like this :


curl -X GET 'http://localhost:9200/rklick/user_details/_search?pretty' -d '{
"query" : {
"match_phrase" : {
"address" : "MG road"
}
}
}'

As we know , two users have MG road sequence in our address field ,So Result would be like this :

{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 2,
    "max_score" : 0.30685282,
    "hits" : [ {
      "_index" : "rklick",
      "_type" : "user_details",
      "_id" : "2",
      "_score" : 0.30685282,
      "_source":{
"first_name" : "Himanshu",
"last_name" : "Saxena",
"age" :24,
"address" :"MG road,Gurgaon"
}
    }, {
      "_index" : "rklick",
      "_type" : "user_details",
      "_id" : "1",
      "_score" : 0.30685282,
      "_source":{
"first_name" : "Akshay",
"last_name" : "Saxena",
"age" :24,
"address" :"MG road,Gurgaon"
}
    } ]
  }
}


ii)Searching for “road MG” phrase :

In this case,query would be like this :


curl -X GET 'http://localhost:9200/rklick/user_details/_search?pretty' -d '{
"query" : {
"match_phrase" : {
"address" : "road MG"
}
}
}'



As we know , No user has “road MG” sequence in our address field ,So Result would be like this :


{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 0,
    "max_score" : null,
    "hits" : [ ]
  }
}


We are Continuously working to create more useful tutorials. If you have any suggestion feel free to suggest us 🙂 Stay tuned. …

Play 2.4.x & RethinkDB: Classic CRUD application backed by RethinkDB

In this blog We have created Classic CRUD application using Play 2.4.x , Scala and RethinkDB. Where Scala meets Object-Oriented things in Functional way, Play is a High Velocity Web Framework For Java & Scala and RethinkDB is the open-source, scalable database that makes building realtime apps dramatically easier.

play_full_colorscala-logoquickstartwebjarsbootswatch

Continue reading