Tutorial: Quick overview of Spark 2.0.1 Core Functionality

In this blog we will discuss about Spark 2.0.1 Core Functionality. It demonstrates the basic functionality of Spark 2.0.1. We also describe  SparkSession, Spark SQL and DataFrame API functionality. We have tried to cover basics of Spark 2.0.1 core functionality and SparkSession.

SparkSession:

SparkSession is new entry point of Spark.In previous version (1.6.x) of Spark ,Spark Context was entry point for Spark and in Spark 2.0.1 SparkSession is entry point of Spark. Spark session internally has a spark context for actual computation.As we know RDD was main API, it was created and manipulated using context API’s. For every other API,we needed to use different contexts.For streaming, we needed StreamingContext, for SQL sqlContext and for hive HiveContext. But as DataSet and Dataframe API’s are becoming new standard API’s we need an entry point build for them. So in Spark 2.0.1, we have a new entry point for DataSet and Dataframe API’s called as Spark Session.

Creating SparkSession:

Here we describe how to create SparkSession.

val spark = SparkSession.builder.master("local")
.appName("spark example")
.getOrCreate()

once we have created spark session then we can use it to read the data.

Read data using Spark Session

It looks like exactly like reading using SQLContext. You can easily replace all your code of SQLContext with SparkSession now.

val spark = SparkSession.builder.
 master("local")
 .appName("spark example")
 .getOrCreate()

Creating DataFrames:

How to create DataFrames with the help of SparkSession,applications can create DataFrames from an existing RDD, from a from data sources.As an example, the following creates a DataFrame based on the content of a csv file:

val dataFrame =spark.read.option("header","true")
 .csv("src/main/resources/team.csv")

csv.png

creates a DataFrame based on the content of a json file:

val dataFrame =spark.read.option("header","true")
.json("src/main/resources/cars_price.json")

dataFrame.show()

json.png

Creating Datasets:

Dataset is new abstraction in Spark introduced as alpha API in Spark 1.6. It’s becoming stable API in spark 2.0.1.Datasets are similar to RDDs, however, instead of using Java serialization or Kryo they use a specialized Encoder to serialize the objects for processing or transmitting over the network.The major difference is, dataset is collection of domain specific objects where as RDD is collection of any object. Domain object part of definition signifies the schema part of dataset. So dataset API is always strongly typed and optimized using schema where RDD is not. Dataset definition also talks about Dataframes API. Dataframe is special dataset where there is no compilation checks for schema. So this makes dataSet new single abstraction replacing RDD from earlier versions of spark. We read data using read.text API which is similar to textFile API of RDD. the following creates a DataFrame based on the content of a txt file:

import spark.implicits._
val rklickData = spark.read.text("src/main/resources/rklick.txt").as[String]
val rklickWords = rklickData.flatMap(value => value.split("\\s+"))
val rklickGroupedWords = rklickWords.groupByKey(_.toLowerCase)
val rklickWordCount = rklickGroupedWords.count()
rklickWordCount.show()

dataset

We would look at how we can create more useful tutorials to grow it , then we would be adding more content to it together. If you have any suggestion feel free to suggest us 🙂 Stay tuned.

 

ElasticSearch Character Filter

In this post, I am going to explain, how ‘Elasticsearch Character Filter’ work. So there are following steps to done this.
Step -1:  Set mapping for your index : Suppose our index name is ‘testindex’ and type is ‘testtype’. Now, we are going to set analyzer and filter.


curl -XPUT 'localhost:9200/testindex' -d '
{
"settings": {
"analysis": {
"char_filter": {
"quotes": {
"type": "mapping",
"mappings": [
"&=>and"
]
}
},
"analyzer": {
"gramAnalyzer": {
"type": "custom",
"tokenizer": "whitespace",
"char_filter" : ["quotes"],
"filter": [
"lowercase"
]
},
"whitespaceAnalyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase"
]
}
}
}
},
"mappings": {
"testtype": {
"_all": {
"analyzer": "gramAnalyzer",
"search_analyzer": "gramAnalyzer"
},
"properties": {
"Name": {
"type": "string",
"include_in_all": true,
"analyzer": "gramAnalyzer",
"search_analyzer": "gramAnalyzer",
"store":true
}
}
}
}
}'

“Character filters are used to preprocess the string of characters before it is passed to the tokenizer. A character filter may be used to strip out HTML markup, or to convert “&” characters to the word “and” “ As you seen above, we have set a filter to convert ‘&’ to ‘and’ Step-2 Index your data: Now, we are going to index data containing ‘&’ character. But we want to fetch data from ‘and’


curl -XPOST 'localhost:9200/testindex/testtype/1' -d '{
"Name" :"karra&john"
}'

Step-3: Check how analyzer work for your index data :


curl http://localhost:9200/testindex/_analyze?analyzer=gramAnalyzer \
-d 'karra&john'

Step-4: Fetch data from match query : Now, I want to fetch data from match query. As you have seen above, indexing data is ‘karra&john’ but now we would fetch those data from ‘karra&john’. See below query .


curl -XGET 'http://localhost:9200/testindex/testtype/_search' -d '
{
"query": {
"match": {
"Name": {
"query": "karraandjohn",
"analyzer": "gramAnalyzer"
}
}
}
}'

You can also do like this


curl -XGET 'http://localhost:9200/testindex/testtype/_search' -d '
{
"query": {
"match": {
"Name": {
"query": "karraandjohn"
}
}
}
}'

Step -5: Fetch data from filter :

</pre>
curl -XGET 'http://localhost:9200/testindex/testtype/_search' -d '
{
"query": {
"filtered": {
"filter": {
"term": { "Name" :"karraandjohn"}
}
}
}
}'

Result would like this :


{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 1.0,
"hits" : [ {
"_index" : "testindex",
"_type" : "testtype",
"_id" : "1",
"_score" : 1.0,
"_source" : {
"Name" : "karra&john"
}
} ]
}
}

So we have learned how character filter work .

This is the start of elasticsearch, from next week onwards we would be working on new topic. If you have any suggestion feel free to suggest us🙂 Stay tuned.

Introduction to Spark 2.0

Overview of Dataset , Dataframe and RDD API :

Resilient Distributed Datasets (RDD) is a fundamental data structure of Spark. It is an immutable distributed collection of objects. Each dataset in RDD is divided into logical partitions, which may be computed on different nodes of the cluster.

But due to facing issue related to advanced optimization move to dataframe.

Dataframe brought custom memory management and runtime code generation which greatly improved performance. So in last year most of the improvements went into Dataframe API.

A DataFrame is a Dataset organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood. DataFrames can be constructed from a wide array of sources such as: structured data files, tables in Hive, external databases, or existing RDDs.

Though dataframe API solved many issues, it was not a good enough replacement for RDD API. One of the major issues with dataframe API was no compile time safety and not able to work with domain objects. So this held back people using dataframe API everywhere. But with introduction of Dataset API in 1.6, we were able to fill the gap.

A Dataset is a distributed collection of data. Dataset is a new interface added in Spark 1.6 that provides the benefits of RDDs (strong typing, ability to use powerful lambda functions) with the benefits of Spark SQL’s optimized execution engine. A Dataset can be constructed from JVM objects and then manipulated using functional transformations (map, flatMap, filter, etc.).

So in Spark 2.0, Dataset API will be become a stable API. So Dataset API combined with Dataframe API should able to cover most of the use cases where RDD was used earlier. So as a spark developer it is advised to start embracing these two API’s over RDD API from Spark 2.0.

Points to be discussed :

Datasets : Starting in Spark 2.0, DataFrame is just a type alias for Dataset of Row. Both the typed methods (e.g. map, filter, groupByKey) and the untyped methods (e.g. select, groupBy) are available on the Dataset class. Also, this new combined Dataset interface is the abstraction used for Structured Streaming.

For long, RDD was the standard abstraction of Spark. But from Spark 2.0, Dataset will become the new abstraction layer for spark. Though RDD API will be available, it will become low level API, used mostly for runtime and library development. All user land code will be written against the Dataset abstraction and it’s subset Dataframe API.

Dataset is a superset of Dataframe API which is released in Spark 1.3. Dataset together with Dataframe API brings better performance and flexibility to the platform compared to RDD API. Dataset will be also replacing RDD as an abstraction for streaming in future releases.

The major difference is, dataset is collection of domain specific objects where as RDD is collection of any object. Domain object part of definition signifies the schema part of dataset. So dataset API is always strongly typed and optimized using schema where RDD is not.

Dataset definition also talks about Dataframes API. Dataframe is special dataset where there is no compilation checks for schema. So this makes dataSet new single abstraction replacing RDD from earlier versions of spark.

SparkSession: A new entry point that replaces the old SQLContext and HiveContext. For users of the DataFrame API, a common source of confusion for Spark is which “context” to use. Now you can use SparkSession, which subsumes both, as a single entry point, as demonstrated in this notebook. Note that the old SQLContext and HiveContext are still kept for backward compatibility.

In earlier versions of spark, spark context was entry point for Spark. As RDD was main API, it was created and manipulated using context API’s. For every other API,we needed to use different contexts.For streaming, we needed StreamingContext, for SQL sqlContext and for hive HiveContext. But as DataSet and Dataframe API’s are becoming new standard API’s we need an entry point build for them. So in Spark 2.0, we have a new entry point for DataSet and Dataframe API’s called as Spark Session.

SparkSession is essentially combination of SQLContext, HiveContext and future StreamingContext. All the API’s available on those contexts are available on spark session also. Spark session internally has a spark context for actual computation.

 

val sparkSession = SparkSession.builder
.master("local")
.appName("spark session example")
.getOrCreate()

New Accumulator API: We have designed a new Accumulator API that has a simpler type hierarchy and support specialization for primitive types. The old Accumulator API has been deprecated but retained for backward compatibility

 

 Catalog  API :  In Spark2.0 ,A new Catalog API is introduced to access metadata.You can fetch all database list as well as table list using this API.

To access all databases list :-

val sparkSession = SparkSession.builder
           .master("local")
           .appName("catalog example")
           .getOrCreate()

val catalog = sparkSession.catalog
catalog.listDatabases().select("name").show()
To access all table list :-
val sparkSession = SparkSession.builder.
           master("local")
           .appName("example")
           .getOrCreate()

val catalog = sparkSession.catalog

catalog.listTables().select("name").show()

This blog provides a quick introduction to using Spark2.0 .It demonstrates the basic functionality of new API.This is the start of using Spark with Scala, from next week onwards we would be working on this tutorial to make it grow. We would look at how we can add more functionality into it , then we would be adding more modules to it together. If you have any changes then feel free to send.
If you have any suggestion feel free to suggest us 🙂 Stay tuned.
You can access full code from here

 

Elasticsearch Graph capabilities

Step 1 — Downloading and Installing Elasticsearch :


a) Download the elasticsearch using the following command  :


wget https://download.elastic.co/elasticsearch/release/org/elasticsearch/distribution/tar/elasticsearch/2.3.2/elasticsearch-2.3.2.tar.gz

b) After downloading untar it ,using this command :

tar -xzf elasticsearch-2.3.2.tar.gz

c) go to elasticsearch directory

cd elasticsearch-2.3.2

Step 2 – Install Graph into Elasticsearch :

a) For Graph license :

bin/plugin install license

b) For installing Graph plugin :

bin/plugin install graph

Step 3: Install Graph into Kibana :

bin/kibana plugin --install elasticsearch/graph/latest

Step 4 : Run Elasticsearch :

bin/elasticsearch

Step 5 : Run kibana :

bin/kibana

Step 6 : Ingest data into Elasticsearch :
Download csv from the following link and ingest into elasticsearch either using curl or you can follow my last blog to insert spreadsheet data into elasticsearch directly.

https://drive.google.com/file/d/0BxeFnKIg5Lg_UEFqTkdNMmt6SzQ/view

Step 7 : Go to Graph UI :

After ingesting go to

http://HOST:PORT/app/graph 

You would see this screen

FirstGraphScreeshot.png

Step 8 : Select you index and fields :

In this step , select you index from drop down box and also select  Item ,Region, Rep from field column .

Step 9 : Type your data in search box to get relational graph :

Suppose we type central pencil in search box.

You would see like this :

GraphScreenshot2.png

 

click on the link between central and pencil then you would in right hand side a link summary.

Link summary explains that :

a) 104 documents having pencil.

b) 192 documents having central.

c) 72 documents having both pencil and central.

Hadoop on Multi Node Cluster

Step 1: Installing Java:

Java is the primary requirement to running hadoop on system, so make sure you have Java installed on your system using following command:

$ java -version

If you don’t have Java installed on your system, use one of following link to install it first.

Step 2: Creating Hadoop User :

We recommend to create a normal (nor root) account for hadoop working. So create a system account using following command:

$ adduser hadoop
$ passwd hadoop

Step 3 : Generate SSH Keys

After creating account, it also required to set up key based ssh to its own account. To do this use execute following commands.

[root@rklick01 ~]# su hadoop
[hadoop@rklick01 root]$ cd
[hadoop@rklick01 ~]$
[hadoop@rklick01 ~]$ ssh-keygen -t rsa

We would see these types of logs and follow these instructions
Generating public/private rsa key pair.
Enter file in which to save the key (/home/hadoop/.ssh/id_rsa): 
/home/hadoop/.ssh/id_rsa already exists.
Overwrite (y/n)? y
Enter passphrase (empty for no passphrase): 
Enter same passphrase again: 
Your identification has been saved in /home/hadoop/.ssh/id_rsa.
Your public key has been saved in /home/hadoop/.ssh/id_rsa.pub.
The key fingerprint is:
f2:fb:77:5a:e3:e3:9d:b6:03:40:04:ad:2a:be:c9:37 hadoop@rklick01
The key's randomart image is:
+--[ RSA 2048]----+
|         .+.     |
|           o     |
|          o      |
|         . .     |
|      . S   .    |
|     . +     .   |
|    . . .     +  |
|    ...E .  .oo=.|
|     +o o....++++|
+-----------------+
To access Worker nodes via SSH without providing password
copy SSH key to first nodes

$ ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop@rklick01

We would see these types of logs

Now try logging into the machine, with "ssh 'hadoop@rklick01'", and 
check in:
  .ssh/authorized_keys

to make sure we haven't added extra keys that you weren't expecting.
 copy SSH key to second nodes

$ ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop@rklick02

We would see these types of logs

hadoop@rklick02's password: 
Now try logging into the machine, with "ssh 'hadoop@rklick02'", and
check in:
  .ssh/authorized_keys

to make sure we haven't added extra keys that you weren't expecting.
 copy SSH key to third nodes

$ ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop@rklick03

We would see these types of logs

hadoop@rklick03's password: 
Now try logging into the machine, with "ssh 'hadoop@rklick03'", and check 
in:
  .ssh/authorized_keys

to make sure we haven't added extra keys that you weren't expecting.

 How to check SSH keys working

[hadoop@rklick01 ~]$ ssh ‘hadoop@rklick02’

We would see these type of logs and follow instruct

Last login: Thu May  5 05:22:56 2016 from rklick01
[hadoop@rklick02 ~]$ 
[hadoop@rklick02 ~]$ 
[hadoop@rklick02 ~]$ 
[hadoop@rklick02 ~]$ 
[hadoop@rklick02 ~]$ exit
logout
Connection to rklick02 closed.
[hadoop@rklick01 ~]$ 

i.e. After successfully SSH implemented, you would reach to rklick02 without password. After exit command , you would be again back to rklick01 .

Step 3. Downloading Hadoop 2.6.0

$ wget http://apache.claz.org/hadoop/common/hadoop-2.6.0/hadoop-2.6.0.tar.gz
$ tar xzf hadoop-2.6.0.tar.gz

Now rename Hadoop 2.6.0 to Hadoop

$ mv hadoop-2.6.0 hadoop

Step 4. Configure Hadoop Pseudo-Distributed Mode

4.1. Setup Environment Variables

4.1.1. Edit the bashrc file

First we need to set environment variable uses by hadoop. Edit ~/.bashrc file and append following values at end of file.

$ vi ~/.bashrc

Adding these lines in bashrc

export HADOOP_HOME=/home/hadoop/hadoop
export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin

 

Now apply the changes in current running environment

$ source ~/.bashrc

4.1.2. Edit the Hadoop Env. file

Now edit $HADOOP_HOME/etc/hadoop/hadoop-env.sh file and set JAVA_HOME environment variable. Change the JAVA path as per install on your system.

$ vi $HADOOP_HOME/etc/hadoop/hadoop-env.sh

Add these line in this file

export JAVA_HOME=/usr/

4.2. Edit Configuration Files

Hadoop has many of configuration files, which need to configure as per requirements of your hadoop infrastructure. Lets start with the configuration with basic hadoop single node cluster setup. first navigate to below location

$ cd $HADOOP_HOME/etc/hadoop

4.2.1 Edit core-site Files

[hadoop@rklick01 hadoop]$ vi core-site.xml

Edit like this

<configuration>
<property>
  <name>fs.default.name</name>
    <value>hdfs://23.227.167.180:9000</value>
</property>
<property>
<name>dfs.permissions</name>
<value>false</value>
</property>
</configuration>

After updated this file looks like this

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->

<!-- Put site-specific property overrides in this file. -->

<configuration>
<property>
  <name>fs.default.name</name>
    <value>hdfs://23.227.167.180:9000</value>
</property>
<property>
<name>dfs.permissions</name>
<value>false</value>
</property>
</configuration>

4.2.2 Edit hdfs-site Files

[hadoop@rklick01 hadoop]$ vi hdfs-site.xml

Edit like this

<configuration>
<property>
 <name>dfs.replication</name>
 <value>1</value>
</property>

<property>
  <name>dfs.name.dir</name>
    <value>file:///home/hadoop/hadoopdata/hdfs/namenode</value>
</property>

<property>
  <name>dfs.data.dir</name>
    <value>file:///home/hadoop/hadoopdata/hdfs/datanode</value>
</property>
</configuration>

After updated this file looks like this

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->

<!-- Put site-specific property overrides in this file. -->

<configuration>
<property>
 <name>dfs.replication</name>
 <value>1</value>
</property>

<property>
  <name>dfs.name.dir</name>
    <value>file:///home/hadoop/hadoopdata/hdfs/namenode</value>
</property>

<property>
  <name>dfs.data.dir</name>
    <value>file:///home/hadoop/hadoopdata/hdfs/datanode</value>
</property>
</configuration>

4.2.3 Edit mapred-site Files

[hadoop@rklick01 hadoop]$ vi mapred-site.xml

Edit like this

<configuration>
 <property>
  <name>mapreduce.framework.name</name>
   <value>yarn</value>
 </property>
</configuration>

After updated this file looks like this

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->

<!-- Put site-specific property overrides in this file. -->

<configuration>
<property>
  <name>mapreduce.framework.name</name>
   <value>yarn</value>
 </property>
</configuration>

4.2.4 Edit yarn-site Files

[hadoop@rklick01 hadoop]$ vi yarn-site.xml

Edit like this

<configuration>
 <property>
  <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce_shuffle</value>
 </property>
</configuration>

After updated this file looks like this

<?xml version="1.0"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->
<configuration>
<property>
  <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce_shuffle</value>
 </property>
<!-- Site specific YARN configuration properties -->

</configuration>                   

4.3. Copy Configuration file to all other node

Copy all config to rklick02

[hadoop@rklick01 ~]$ scp -r hadoop rklick02:/home/hadoop/

Copy all config to rklick03

[hadoop@rklick01 ~]$ scp -r hadoop rklick03:/home/hadoop/

We would see these types of logs

LICENSE.txt                                                                                                   100%   15KB  15.1KB/s   00:00    
README.txt                                                                                                    100% 1366     1.3KB/s   00:00    
libhadoop.so                                                                                                  100%  787KB 787.1KB/s   00:00    
....     
ETC

4.4. Copy SSH Key to all node

Start from rklick01

[hadoop@rklick01 ~]$ ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop@rklick01

We would see these types of logs

Now try logging into the machine, with "ssh 'hadoop@rklick01'", and
check in:
  .ssh/authorized_keys

to make sure we haven't added extra keys that you weren't expecting.


Start from rklick02

[hadoop@rklick01 ~]$ ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop@rklick02

We would see these types of logs

Now try logging into the machine, with "ssh 'hadoop@rklick02'", and
check in:
  .ssh/authorized_keys

to make sure we haven't added extra keys that you weren't expecting.

Start from rklick03

[hadoop@rklick01 ~]$ ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop@rklick03

We would see these types of logs

Now try logging into the machine, with "ssh 'hadoop@rklick03'", and
check in:
  .ssh/authorized_keys

to make sure we haven't added extra keys that you weren't expecting.

4.5. Give authentication

[hadoop@rklick01 ~]$ chmod 0600 ~/.ssh/authorized_keys

How to test SSH is set Successfully
[hadoop@rklick01 ~]$ ssh 'hadoop@rklick02'
Last login: Thu May  5 05:22:56 2016 from rklick01
[hadoop@rklick02 ~]$ 
[hadoop@rklick02 ~]$ 
[hadoop@rklick02 ~]$ 
[hadoop@rklick02 ~]$ 
[hadoop@rklick02 ~]$ exit
logout
Connection to rklick02 closed.
[hadoop@rklick01 ~]$

4.6. Format Namenode

Now format the namenode using following command, make sure that Storage directory is

[hadoop@rklick01 hadoop]$ hdfs namenode -format

We would see these types of logs

16/05/05 05:35:08 INFO namenode.NameNode: STARTUP_MSG: 
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = rklick01/24.111.123.456&amp;lt;/code&amp;gt;&amp;lt;/pre&amp;gt;
&amp;lt;pre&amp;gt;&amp;lt;code&amp;gt;STARTUP_MSG: args = [-format] STARTUP_MSG: version = 2.6.0 STARTUP_MSG: classpath = /home/hadoop/hadoop/etc/hadoop:
/home/hadoop/hadoop/share/hadoop/common/lib/htrace-core-3.0.4.jar:
/home/hadoop/hadoop/share/hadoop/common/lib/jasper-runtime-5.5.23.jar:
/home/hadoop/hadoop/share/hadoop/common/lib/commons-logging-1.1.3.jar:
/home/hadoop/hadoop/share/hadoop/common/lib/jackson-mapper-asl-1.9.13.jar:
/home/hadoop/hadoop/share/hadoop/common/lib/hamcrest-core-1.3.jar:
/home/hadoop/hadoop/share/hadoop/common/lib/jetty-util-6.1.26.jar:
/home/hadoop/hadoop/share/hadoop/common/lib/zookeeper-3.4.6.jar:
/home/hadoop/hadoop/share/hadoop/common/lib/jersey-core-1.9.jar:
/home/hadoop/hadoop/share/hadoop/common/lib/xmlenc-0.52.jar:
/home/hadoop/hadoop/share/hadoop/common/lib/commons-cli-1.2.jar:
/home/hadoop/hadoop/share/hadoop/common/lib/guava-11.0.2.jar:
/home/hadoop/hadoop/share/hadoop/common/lib/jackson-jaxrs-1.9.13.jar:
/home/hadoop/hadoop/share/hadoop/common/lib/asm-3.2.jar:
/home/hadoop/hadoop/share/hadoop/common/lib/jets3t-0.9.0.jar:
/home/hadoop/hadoop/share/hadoop/common/lib/log4j-1.2.17.jar:
/home/hadoop/hadoop/share/hadoop/common/lib/servlet-api-2.5.jar:
/home/hadoop/hadoop/share/hadoop/common/lib/slf4j-api-1.7.5.jar:
/home/hadoop/hadoop/share/hadoop/common/lib/mockito-all-1.8.5.jar:
/home/hadoop/hadoop/share/hadoop/common/lib/java-xmlbuilder-0.4.jar:
/home/hadoop/hadoop/share/hadoop/common/lib/httpclient-4.2.5.jar:
/home/hadoop/hadoop/share/hadoop/common/lib/avro-1.7.4.jar:
/home/hadoop/hadoop/share/hadoop/common/lib/activation-1.1.jar:
/home/hadoop/hadoop/share/hadoop/common/lib/jersey-json-1.9.jar:
/home/hadoop/hadoop/share/hadoop/common/lib/jaxb-api-2.2.2.jar:
/home/hadoop/hadoop/share/hadoop/common/lib/jsr305-1.3.9.jar:
/home/hadoop/hadoop/share/hadoop/common/lib/snappy-java-1.0.4.1.jar:
/home/hadoop/hadoop/share/hadoop/common/lib/junit-4.11.jar:
/home/hadoop/hadoop/share/hadoop/common/lib/hadoop-auth-2.6.0.jar:
/home/hadoop/hadoop/share/hadoop/common/lib/hadoop-annotations-2.6.0.jar:
/home/hadoop/hadoop/share/hadoop/common/lib/jackson-xc-1.9.13.jar:
/home/hadoop/hadoop/share/hadoop/common/lib/jetty-6.1.26.jar:
/home/hadoop/hadoop/share/hadoop/common/lib/jsp-api-2.1.jar:
/home/hadoop/hadoop/share/hadoop/common/lib/paranamer-2.3.jar:
/home/hadoop/hadoop/share/hadoop/common/lib/commons-beanutils-1.7.0.jar:
/home/hadoop/hadoop/share/hadoop/common/lib/jackson-core-asl-1.9.13.jar:
/home/hadoop/hadoop/share/hadoop/common/lib/jaxb-impl-2.2.3-1.jar:
/home/hadoop/hadoop/share/hadoop/common/lib/commons-lang-2.6.jar:
/home/hadoop/hadoop/share/hadoop/common/lib/curator-recipes-2.6.0.jar:
/home/hadoop/hadoop/share/hadoop/common/lib/netty-3.6.2.Final.jar:
/home/hadoop/hadoop/share/hadoop/common/lib/commons-collections-3.2.1.jar:
/home/hadoop/hadoop/share/hadoop/common/lib/commons-compress-1.4.1.jar:
/home/hadoop/hadoop/share/hadoop/common/lib/httpcore-4.2.5.jar:
/home/hadoop/hadoop/share/hadoop/common/lib/commons-beanutils-core-1.8.0.jar:
/home/hadoop/hadoop/share/hadoop/common/lib/gson-2.2.4.jar:
/home/hadoop/hadoop/share/hadoop/common/lib/commons-io-2.4.jar:
/home/hadoop/hadoop/share/hadoop/common/lib/commons-digester-1.8.jar:
/home/hadoop/hadoop/share/hadoop/common/lib/protobuf-java-2.5.0.jar:
/home/hadoop/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar:
/home/hadoop/hadoop/share/hadoop/common/lib/curator-client-2.6.0.jar:
/home/hadoop/hadoop/share/hadoop/common/lib/commons-codec-1.4.jar:
/home/hadoop/hadoop/share/hadoop/common/lib/stax-api-1.0-2.jar:
/home/hadoop/hadoop/share/hadoop/common/lib/apacheds-i18n-2.0.0-M15.jar:
/home/hadoop/hadoop/share/hadoop/common/lib/jasper-compiler-5.5.23.jar:
/home/hadoop/hadoop/share/hadoop/common/lib/xz-1.0.jar:
/home/hadoop/hadoop/share/hadoop/common/lib/api-util-1.0.0-M20.jar:
/home/hadoop/hadoop/share/hadoop/common/lib/commons-httpclient-3.1.jar:
/home/hadoop/hadoop/share/hadoop/common/lib/api-asn1-api-1.0.0-M20.jar:/
home/hadoop/hadoop/share/hadoop/common/lib/jersey-server-1.9.jar:
/home/hadoop/hadoop/share/hadoop/common/lib/jettison-1.1.jar:
/home/hadoop/hadoop/share/hadoop/common/lib/jsch-0.1.42.jar:
/home/hadoop/hadoop/share/hadoop/common/lib/commons-el-1.0.jar:
/home/hadoop/hadoop/share/hadoop/common/lib/curator-framework-2.6.0.jar:
/home/hadoop/hadoop/share/hadoop/common/lib/apacheds-kerberos-codec-2.0.0-M15.jar:
/home/hadoop/hadoop/share/hadoop/common/lib/commons-configuration-1.6.jar:
/home/hadoop/hadoop/share/hadoop/common/lib/commons-net-3.1.jar:
/home/hadoop/hadoop/share/hadoop/common/lib/commons-math3-3.1.1.jar:
/home/hadoop/hadoop/share/hadoop/common/hadoop-common-2.6.0.jar:
/home/hadoop/hadoop/share/hadoop/common/hadoop-nfs-2.6.0.jar:
/home/hadoop/hadoop/share/hadoop/common/hadoop-common-2.6.0-tests.jar:
/home/hadoop/hadoop/share/hadoop/hdfs:/home/hadoop/hadoop/share/hadoop/hdfs/lib/htrace-core-3.0.4.jar:/home/hadoop/hadoop/share/hadoop/hdfs/lib/jasper-runtime-5.5.23.jar:/home/hadoop/hadoop/share/hadoop/hdfs/lib/commons-logging-1.1.3.jar:/home/hadoop/hadoop/share/hadoop/hdfs/lib/jackson-mapper-asl-1.9.13.jar:/home/hadoop/hadoop/share/hadoop/hdfs/lib/jetty-util-6.1.26.jar:/home/hadoop/hadoop/share/hadoop/hdfs/lib/jersey-core-1.9.jar:/home/hadoop/hadoop/share/hadoop/hdfs/lib/xmlenc-0.52.jar:/home/hadoop/hadoop/share/hadoop/hdfs/lib/commons-cli-1.2.jar:/home/hadoop/hadoop/share/hadoop/hdfs/lib/guava-11.0.2.jar:/home/hadoop/hadoop/share/hadoop/hdfs/lib/asm-3.2.jar:/home/hadoop/hadoop/share/hadoop/hdfs/lib/commons-daemon-1.0.13.jar:/home/hadoop/hadoop/share/hadoop/hdfs/lib/log4j-1.2.17.jar:/home/hadoop/hadoop/share/hadoop/hdfs/lib/servlet-api-2.5.jar:/home/hadoop/hadoop/share/hadoop/hdfs/lib/jsr305-1.3.9.jar:/home/hadoop/hadoop/share/hadoop/hdfs/lib/jetty-6.1.26.jar:/home/hadoop/hadoop/share/hadoop/hdfs/lib/jsp-api-2.1.jar:/home/hadoop/hadoop/share/hadoop/hdfs/lib/xml-apis-1.3.04.jar:/home/hadoop/hadoop/share/hadoop/hdfs/lib/jackson-core-asl-1.9.13.jar:/home/hadoop/hadoop/share/hadoop/hdfs/lib/commons-lang-2.6.jar:/home/hadoop/hadoop/share/hadoop/hdfs/lib/netty-3.6.2.Final.jar:/home/hadoop/hadoop/share/hadoop/hdfs/lib/commons-io-2.4.jar:/home/hadoop/hadoop/share/hadoop/hdfs/lib/protobuf-java-2.5.0.jar:/home/hadoop/hadoop/share/hadoop/hdfs/lib/xercesImpl-2.9.1.jar:
/home/hadoop/hadoop/share/hadoop/hdfs/lib/commons-codec-1.4.jar:/home/hadoop/hadoop/share/ha

ETC 

Step 5. Start Hadoop Cluster

Lets start your hadoop cluster using the scripts provides by hadoop. Just navigate to your hadoop sbin directory and execute scripts one by one.

$ cd $HADOOP_HOME/sbin/

Now run start-dfs.sh script.

[hadoop@rklick01 hadoop]$ sbin/start-dfs.sh

Now run start-yarn.sh script.

[hadoop@rklick01 hadoop]$ sbin/start-yarn.sh

Step 6. Access Hadoop Services in Browser

Hadoop NameNode started on port 50070 default. Access your server on port 50070 in your favorite web browser.

http://24.111.123.456:50070/ 

Now access port 8088 for getting the information about cluster and all applications

http://24.111.123.456:8088/ 

Access port 50090 for getting details about secondary namenode.

http://24.111.123.456:50090/ 

Access port 50075 to get details about DataNode

http://24.111.123.456:50075/ 

EXTRA POINTS

1. How to change Hadoop user password

[root@rklick01 ~]# passwd hadoop
Changing password for user hadoop.
New password: 
BAD PASSWORD: it is based on a dictionary word
BAD PASSWORD: is too simple
Retype new password: 
passwd: all authentication tokens updated successfully.

2. Extra point to setup after deleting user hadoop

Delete hadoopdata folder before format Namenode

i.e Before running this command

[hadoop@rklick01 hadoop]$ hdfs namenode -format

You should delete hadoopdata folder.

Here these lines works for HDFS:

val hPath = s"hdfs://$host:$port$path/$filename.$format"

We would look at how we can create more useful to grow it, then we would be adding more content to it together. If you have any suggestion feel free to suggest us :) Stay tuned.

Data ingestion from Google spreadsheet to Elasticsearch

In this blog we are elaborate how to ingest data from Google spreadsheet to Elasticsearch.

So, There are 5 steps to ingest data from Google spreadsheet to Elasticsearch. Please follow the below steps:

Step – 1)  Login to your account .

Step – 2) Open Spreadsheet and follow step.

Open the spreadsheet and click on Add one and type elasticsearch in search box.You would see below screen.

SearchToAddonsElasticsearch

 

Now click to add elasticsearch plugin. After adding ,you have to give permission to it.After giving permission, elasticsearch plugin would be added into your account.

 

Step – 3) Add elasticsearch plugin :

–  Now click on  Add-ons , you would see below screen.

ClickToAddOnes.png

 

Step – 4) Fill Cluster Information :  

Click on send to cluster.Now  you would below screen

TypeHostAndPassword.png

Here ,in right hand side ,you have to type Host and Port along with Username and Password.

Step -5) Test the Connection :

Test to check connection with elasticsearch. After filling all the things, click on Test. You would see this message  “Successfully connected to your cluster”. Click to Save and click to Edit Data Details.

Step – 6) Edit Details :

After clicking Edit Data Details   ,Select id column and type index name and type name in which you want to ingest this spreadsheet data. You would  see below screen.

EditDataDetailsES.png

 

Step – 7) Push to Cluster :

After filling all the things ,click on Push to Cluster . You would see below screen

SuccessfulllIngestDateIntoES.png

 

After pushing data into cluster .You would see this message  “Success! The data is accessible here”.

Now click to link here of receive message and see your ingested data into ES.

 

Kafka & ZooKeeper | Multi Node Cluster Setup

TODO

In This blog we will explains the setup of the Kafka & ZooKeeper Multi-Node cluster on a distributed environment.

What is Apache Kafka?

A high-throughput distributed messaging system is designed to allow a single cluster to serve as the central data backbone for a large organization. It can be elastically and transparently expanded without downtime. Data streams are partitioned and spread over a cluster of machines to allow data streams larger than the capability of any single machine and to allow clusters of co-ordinated consumer.

What is ZooKeeper?

ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. All of these kinds of services are used in some form or another by distributed applications. Each time they are implemented there is a lot of work that goes into fixing the bugs and race conditions that are inevitable. Because of the difficulty of implementing these kinds of services, applications initially usually skimp on them ,which make them brittle in the presence of change and difficult to manage. Even when done correctly, different implementations of these services lead to management complexity when the applications are deployed.

Learn more about ZooKeeper on the ZooKeeper Wiki.

Prerequisites

  1. Install Java if you do not have it already. You can get it from here
  2. Kafka Binary files : http://kafka.apache.org/downloads.html

Installation

  • Now first download the Kafka Tarball or binaries on your all instances and extract them
$ tar -xzvf kafka_2.11-0.9.0.1.tgz
$ mv kafka_2.11-0.9.0.1 kafka
  • On Both the Instances, you only need two properties to be changed i.e. zookeeper.properties & server.properties

Lets start to edit “zookeeper.properties” on all the instances

$ vi ~/kafka/config/zookeeper.properties
# The number of milliseconds of each tick
tickTime=2000
 
# The number of ticks that the initial synchronization phase can take
initLimit=10
 
# The number of ticks that can pass between 
# sending a request and getting an acknowledgement
syncLimit=5

# zoo servers
server.1=x.x.x.x:2888:3888
server.2=x.x.x.x:2888:3888
server.3=x.x.x.x:2888:3888
#add here more servers if you want

Now edit all instances “server.properties” and update the following this

$ vi ~/kafka/config/server.properties
broker.id=1 //Increase by one as per node count
host.name=x.x.x.x //Current node IP
zookeeper.connect=x.x.x.x:2181,x.x.x.x:2181,x.x.x.x:2181
  • After this go to the /tmp of every instance and create following things
$ cd /tmp/
$ mkdir zookeeper #Zookeeper temp dir
$ cd zookeeper
$ touch myid  #Zookeeper temp file
$ echo '1' >> myid #Add Server ID for Respective Instances i.e. "server.1 and server.2 etc"
  • Now all is done, Need to start ZooKeeper and Kafka Server on all instances

$ bin/zookeeper-server-start.sh ~/kafka/config/zookeeper.properties

$ bin/kafka-server-start.sh ~/kafka/config/server.properties

We would look at how we can provide more useful tutorials to grow it , then we would be adding more content to it together. If you have any suggestion feel free to suggest us 🙂 Stay tuned.