Tutorial : Quick overview of Spark 1.6 Core Functionality

In this blog we will discuss about Spark 1.6 Core Functionality and provides a quick introduction to using Spark. It demonstrates the basic functionality of RDDs. Later on we demonstrate Spark SQL and DataFrame API functionality. We have tried to cover basics of Spark 1.6  core functionality and  programming contexts.

Introduction to Apache Spark

 Spark is a powerful open source processing engine built around speed, ease of use, and sophisticated analytics.It is a cluster computing framework originally developed in the AMPLab at University of California, Berkeley but was later donated to the Apache Software Foundation where it remains today. Apache Spark is a lightning-fast cluster computing technology, designed for fast computation. It is a framework for performing general data analytics on distributed computing cluster like Hadoop. The main feature of Spark is its in-memory cluster computing that increases the processing speed of an application. It provides in memory computations for increase speed and data process over map reduce.It runs on top of existing Hadoop cluster and access Hadoop data store (HDFS), can also process structured data in Hive and Streaming data from HDFS, Flume, Kafka, Twitter. Continue reading

Advertisements

ElasticSearch Basic

Elasticsearch:

 

elasticsearch.png

ElasticSearch can be described as follows:

  1.  A distributed real-time storage system.
  2. Every field is indexed and searchable.
  3. A distributed search engine with real-time analytics
  4. Capable of scaling to hundreds of servers and petabytes of structured and
    unstructured data.

And it packages up all this functionality into a standalone server that your application
can talk to via a simple RESTful API, using a web client from your favorite program‐
ming language, or even from the command line.

Installing ElasticSearch :

Step 1 — Installing Java

sudo add-apt-repository -y ppa:webupd8team/java
sudo apt-get update
sudo apt-get install openjdk-8-jdk
(or sudo apt-get -y install oracle-java8)

To verify your JRE is installed and can be used, run the command:

java -version

Step 2 — Downloading and Installing ElasticSearch

ElasticSearch can be downloaded directly from elastic.co in zip, tar.gz, deb, or rpm packages. For Ubuntu, it’s best to use the deb (Debian) package which will install everything you need to run ElasticSearch.


wget https://download.elasticsearch.org/elasticsearch/release/org/
elasticsearch/distribution/deb/elasticsearch/2.1.0/elasticsearch-2.1.0.deb

Then install it in the usual Ubuntu way with the dpkg command like this:


sudo dpkg -i elasticsearch-2.1.0.deb

Elasticsearch is now installed.

ElasticSearch Configuration :

The elasticseach configuration files are in the /elasticsearch/config directory. There are two

files :

elasticsearch.yml :

Configures the Elasticsearch server settings. This is where all options, are stored.

logging.yml :

All configuration in this file only for logging. In the beginning, you can leave all default logging options.

Edit the configuration file like this :

(a)How to change network host :

Open the elasticsearch.yml file.

Change the network host to localhost like this :


# network.host: 192.168.0.1 (Before)
network.host: localhost (After Changing)

We have restricted outside user to access Elasticsearch instance, so using HTTP API outside user’s can not read your data or shutdown your Elasticsearch cluster.

(b)How to change cluster name :

Change cluster name according to your project.Here,cluster name is akshay_es_blog

cluster.name: akshay_es_blog

 

How to run  Elasticsearch:

Elasticsearch is now ready to run. You can start it up in the foreground with this:


./bin/elasticsearch

How to Test- ElasticSearch is running :

You can open another terminal window and running the following:

curl 'http://localhost:9200/?pretty'

You should see a response like this:


{
  "name" : "Mad Jack",
  "cluster_name" : "akshay_es_blog",
  "version" : {
    "number" : "2.1.0",
    "build_hash" : "72cd1f1a3eee09505e036106146dc1949dc5dc87",
    "build_timestamp" : "2015-11-18T22:40:03Z",
    "build_snapshot" : false,
    "lucene_version" : "5.3.1"
  },
  "tagline" : "You Know, for Search"
}

This means that your Elasticsearch cluster is up and running, and we can start experi‐
menting with it.

Key Points :

We would discuss about these point one by one :

  • What is Node ?
  • How to shutdown ES?
  • What is Node Client ?
  • What is Transport Client ?
  • What is HTTP method or Verb ?
  • Example of Complete Elasticsearch request .
  • CRUD opearation using Elasticsearch.
  • Searching using Elasticsearch.

What is Node :

A node is a running instance of Elasticsearch. A cluster is a group
of nodes with the same cluster.name that are working together
to share data and to provide failover and scale, although a single
node can form a cluster all by itself.

Shutdown ES:

When Elasticsearch is running in the foreground, you
can stop it by pressing Ctrl-C; otherwise, you can shut it down with the shutdown
API:


curl -XPOST 'http://localhost:9200/_shutdown'

Talking to Elasticsearch:

How you talk to Elasticsearch depends on whether you are using Java.
Java API
If you are using Java, Elasticsearch comes with two built-in clients that you can use in
your code:

Node client:

The node client joins a local cluster as a non data node. In other words, it doesn’t
hold any data itself, but it knows what data lives on which node in the cluster,
and can forward requests directly to the correct node.

Transport client :

The lighter-weight transport client can be used to send requests to a remote cluster. It doesn’t join the cluster itself, but simply forwards requests to a node in the cluster.Both Java clients talk to the cluster over port 9300, using the native Elasticsearch transport protocol. The nodes in the cluster also communicate with each other over port 9300. If this port is not open, your nodes will not be able to form a cluster.

RESTful API with JSON over HTTP :

All other languages can communicate with Elasticsearch over port 9200 using a RESTful API, accessible with your favorite web client. In fact, as you have seen, you can even talk to Elasticsearch from the command line by using the curl command.

HTTP method or Verb :

The appropriate HTTP method or verb: GET , POST , PUT , HEAD , or DELETE .

 

Complete Elastic Search Request with all Component :

A request to Elasticsearch consists of the same parts as any HTTP request:

curl -X<VERB> '<PROTOCOL>://<HOST>/<PATH>?<QUERY_STRING>' -d '<BODY>'

The parts marked with < > above are:

VERB

The appropriate HTTP method or verb: GET , POST , PUT , HEAD , or DELETE .

PROTOCOL

Either http or https (if you have an https proxy in front of Elasticsearch.)

HOST

The hostname of any node in your Elasticsearch cluster, or localhost for a node on your local machine.

PORT

The port running the Elasticsearch HTTP service, which defaults to 9200 .

QUERY_STRING

Any optional query-string parameters (for example ?pretty will pretty-print the
JSON response to make it easier to read.)

BODY

A JSON-encoded request body (if the request needs one.)

For instance, to count the number of documents in the cluster, we could use this:


curl -XGET 'http://localhost:9200/_count?pretty' -d '
{
"query": {
"match_all": {}
}
}
'

Elasticsearch returns an HTTP status code like 200 OK and (except for HEAD requests)
a JSON-encoded response body. The preceding curl request would respond with a
JSON body like the following:

{
  "count" : 1562,
  "_shards" : {
    "total" : 3,
    "successful" : 3,
    "failed" : 0
  }
}

Elasticsearch is document oriented, meaning that it stores entire objects or documents.
It not only stores them, but also indexes the contents of each document in order to
make them searchable. In Elasticsearch, you index, search, sort, and filter documents
—not rows of columnar data. This is a fundamentally different way of thinking about
data and is one of the reasons Elasticsearch can perform complex full-text search.

JSON

Elasticsearch uses JavaScript Object Notation, or JSON, as the serialization format for
documents. JSON serialization is supported by most programming languages, and
has become the standard format used by the NoSQL movement. It is simple, concise,
and easy to read.

NOTE :

Relational DB ⇒ Databases ⇒ Tables ⇒ Rows⇒ Columns
Elasticsearch ⇒ Indices ⇒ Types ⇒ Documents ⇒ Fields

An Elasticsearch cluster can contain multiple indices (databases), which in turn con‐
tain multiple types (tables). These types hold multiple documents (rows), and each
document has multiple fields (columns).

CRUD Operation using Elasticsearch :

Indexing User’s details :

The act of stoing data in Elasticsearch is called indexing.For indexing a document, we
need to decide where to store it.

To store user details ,we should  type(i.e ‘table name’ in RDBMS manner) and index(‘database name’  in RDBMS manner).

So firstly we would decide type name and index name

Suppose index name =>Rklick

type=>user_details

Following command would store user details  in user_details type of Rklick index  :


curl -X PUT 'http://localhost:9200/rklick/user_details/1' -d  '{
"first_name" : "Akshay",
"last_name" : "Saxena",
"age" :24,
"address" :"MG road,Gurgaon"
}'

 

Notice that the path /Rklick/user_details/1  contains three pieces of information:
Rklick => Index name
user_details=>type name
1          =>    The ID of this particular employee
The request body—the JSON document—contains all the information about this
employee. His name is Akshay, he’s 25, and he is living in MG road.

We can add user details for user Himanshu in Rklick like this:


curl -X PUT 'http://localhost:9200/rklick/user_details/2' -d  '{
"first_name" : "Himanshu",
"last_name" : "Saxena",
"age" :24,
"address" :"MG road,Gurgaon"
}'

Another user is Anand ,we can add his information like this :


curl -X PUT 'http://localhost:9200/rklick/user_details/3' -d  '{
"first_name" : "Anand",
"last_name" : "Kumar",
"age" :26,
"address" :"Chatterpur,Gurgaon"
}'

Retrieving a Document(User):

We can get all user's data based on id ,elasticsearch get request should be like this: :
curl -X GET 'http://localhost:9200/rklick/user_details/1?pretty'
Now ,we get response like this :

{
  "_index" : "rklick",
  "_type" : "user_details",
  "_id" : "1",
  "_version" : 3,
  "found" : true,
  "_source":{
"first_name" : "Akshay",
"last_name" : "Saxena",
"age" :24,
"address" :"MG road,Gurgaon"
}
}

Searching Using Domain-Specific Language:

a) Searching Using Last Name :


curl -X GET 'http://localhost:9200/rklick/user_details/_search?pretty' -d '{
"query" : {
"term" : {
"last_name" : "kumar"
}
}
}'

Result would be like this :


{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 0.30685282,
    "hits" : [ {
      "_index" : "rklick",
      "_type" : "user_details",
      "_id" : "3",
      "_score" : 0.30685282,
      "_source":{
"first_name" : "Anand",
"last_name" : "Kumar",
"age" :26,
"address" :"Chatterpur,Gurgaon"
}
    } ]
  }
}

b) Searching Using Text :

curl -X GET 'http://localhost:9200/rklick/user_details/_search?pretty' -d '{
"query" : {
"match" : {
"address" : "Chatterpur"
}
}
}
'

Result would be like this :

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 0.19178301,
    "hits" : [ {
      "_index" : "rklick",
      "_type" : "user_details",
      "_id" : "3",
      "_score" : 0.19178301,
      "_source":{
"first_name" : "Anand",
"last_name" : "Kumar",
"age" :26,
"address" :"Chatterpur,Gurgaon"
}
    } ]
  }
}


c) Phrase Search :

When we want to match exact sequences of words or phrases. For example ,we want to match words “MG” and “road” in correct sequence like “MG road” then we can use match phrase query.

i) Searching for “MG road” phrase :

In this case,query would be like this :


curl -X GET 'http://localhost:9200/rklick/user_details/_search?pretty' -d '{
"query" : {
"match_phrase" : {
"address" : "MG road"
}
}
}'

As we know , two users have MG road sequence in our address field ,So Result would be like this :

{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 2,
    "max_score" : 0.30685282,
    "hits" : [ {
      "_index" : "rklick",
      "_type" : "user_details",
      "_id" : "2",
      "_score" : 0.30685282,
      "_source":{
"first_name" : "Himanshu",
"last_name" : "Saxena",
"age" :24,
"address" :"MG road,Gurgaon"
}
    }, {
      "_index" : "rklick",
      "_type" : "user_details",
      "_id" : "1",
      "_score" : 0.30685282,
      "_source":{
"first_name" : "Akshay",
"last_name" : "Saxena",
"age" :24,
"address" :"MG road,Gurgaon"
}
    } ]
  }
}


ii)Searching for “road MG” phrase :

In this case,query would be like this :


curl -X GET 'http://localhost:9200/rklick/user_details/_search?pretty' -d '{
"query" : {
"match_phrase" : {
"address" : "road MG"
}
}
}'



As we know , No user has “road MG” sequence in our address field ,So Result would be like this :


{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 0,
    "max_score" : null,
    "hits" : [ ]
  }
}


We are Continuously working to create more useful tutorials. If you have any suggestion feel free to suggest us 🙂 Stay tuned. …

MICROSERVICES : BASIC

Modern functional applications architecture requires slew of silo pieces working collectively in harmony  based on “PnP (Plug n Play)” architecture. Enterprise leans towards fluid, functional and scalable applications paradigm utilizing  “Micro Service Architecture (MSA)”. MSA implementations help  break large monolothic application tiers into smaller and manageable services. Each MSA have it’s own dedicated and non-blocking process ,communicate over REST and can contain mix-bag of multiple languages with each performing it’s own allocated tasks. One service failure have no impact on others.

Micro-Services Characteristics:

  1.   Developed and deployed independently of other.
  2.   Communicate using REST.
  3.  Each service can be written in different programming languages.
  4.  Each service can use different data storage technologies.
  5.  Ideally each service should perform only one task.

Continue reading

Dribbling with Spark 1.6 GraphX Components

GraphX provide distributed in-memory computing. The GraphX API enables users to view data both as graphs and as collections (i.e., RDDs) without data movement or duplication.

In this example, we have process a small social network with users as vertices’s and relation between users as edges and find out these details:

  • Evaluate what’s the most important users in the graph
  • Find all three users graph where every two users are connected
  • Find pair of users where connection in each direction between them

Continue reading

Play 2.4.x & RethinkDB: Classic CRUD application backed by RethinkDB

In this blog We have created Classic CRUD application using Play 2.4.x , Scala and RethinkDB. Where Scala meets Object-Oriented things in Functional way, Play is a High Velocity Web Framework For Java & Scala and RethinkDB is the open-source, scalable database that makes building realtime apps dramatically easier.

play_full_colorscala-logoquickstartwebjarsbootswatch

Continue reading