es for hadoop cluster

Es Nodes

We have 4 nodes.

  • 1 Master
  • 3 Slave
Node Ip
pem01 172.28.11.50
pes01 172.28.11.51
pes02 172.28.11.52
pes03 172.28.11.53

Accounts

root: OMAN@123
admin: OMAN@123

Software Added

RHEL Development

Using the Original Source

  • yum groupinstall 'Development Tools'
  • yum install zlib zlib-devel

Python 3.6

Python 3.6 Built from Source

- installed in /usr/local/bin
- Private Python Environment built
- Private Python Modules used to add modules

Using Private Python Environemnt

   Faker==0.8.6
   python-dateutil==2.6.1
   six==1.11.0
   text-unidecode==1.0    

Java 8

ES needs Java 8.

I pulled the latest rpm from the Oracle site, and installed it like this

rpm -i jdk-8u152-linux-x64.rpm

ElasticSearch

Using the ES supplied rpm

rpm -i elasticsearch-5.6.3.rpm

Auto Start

Using the suggested start scripts

sudo systemctl daemon-reload
sudo systemctl enable elasticsearch.service

GNU Parallel

I like GNU Parallel as it allows me to use the multi-cpu and multi core facilities.

This is how to install it.

 wget gnu_parallel.bz2
 gunzip gnu_parallel.bz2
 tar -xvf gnu_parallel.bz2
 cd gnu_parallel_20171102
 ./configure
 make
 su root
 make install

 Ctrl-D

 parallel citation
 will cite

At this point we should have GNU Parallel installed and ready to use in a script.

Test Loading

Building data

To build some test data I used my custom script called DataGen.py

./DataGen.py 10 500000

Starting from index position 10, create 1 Million records (500000*2: Half English, Half Arabic)

Apache

Mount the RHEL Image and issues the command

yum insall httpd

Thats it

ElasticSearch Head

I downloaded from github the elastic-search header, and then tried to open the web page.

However inside firfox I was getting an error

Cross-Origin Request Blocked: The Same Origin Policy disallows reading the remote resource at http://127.0.0.1:9200/_all. This can be fixed by moving the resource to the same domain or enabling CORS.

We can apparactly allow access by creating a .htaccess file which looks like this

Header set Access-Control-Allow-Origin "*"

This can also be placed in a Directory section of the server config file (httpd.conf usually).

I think this is best fixed however from ElasticSearch

cd /etc/elasticsearch
vi elasticsearch.yml

At the bottom of this file I placed

http.cors.enabled: true
http.cors.allow-origin: "*"
http.cors.allow-headers: "X-Requested-With, Content-Type, Content-Length, X-User"

Restart the elastic-search

service elasticsearch restart

RHEL 7.1 Firewal Config

to See the firewall status

firewall --status

To See what is the setting

 firewall-cmd --list-all-zones

To Allow Http Port 80 and https

 firewall-cmd --perm --add-service=http
 firewall-cmd --perm --add-service=https
 firewall-cmd --reload

To Allow 9200 (ElasticSearch)

 firewall-cmd --perm --add-port=9200/tcp
 firewall-cmd --perm --add-port=9300/tcp
 firewall-cmd --reload

Elasticsearch Slave Nodes

These are the steps

  • mkdir /media/dvd
  • mount -t iso9660 file /media/dvd
  • vi /etc/yum.repos.d/media
  • yum groupinstall 'Development Tools'
  • yum install zlib zlib-devel
  • rpm -i jdk-8u152-linux-x64.rpm
  • rpm -i elasticsearch-5.6.3.rpm
  • mkdir /esdata
  • mkdir /eslog
  • chown elasticsearch:elasticsearch /esdata
  • chown elasticsearch:elasticsearch /eslog

Elasticearch Config file

The host IP Address needs changing

cluster.name: eq 
node.name: master
node.master: false
node.data: true
node.ingest: false
path.data: /esdata
path.logs: /eslog
network.host: 172.28.11.51
http.cors.enabled: true
http.cors.allow-origin: "*"
http.cors.allow-headers: "X-Requested-With, Content-Type, Content-Length, X-User"

Performance

Generally it has been excellent - with fast loading and very fast retreival. However at around 1.4B records, I am starting to notice that things are slowing down a little.

Shard Time ?

So with 1.4B records in an index called eia, I will now start to send this data into a new shard.

I created the shard like this

curl -XPUT 'pem01:9200/eia2?pretty' -H 'Content-Type: application/json' --data-binary @eia.json

And the Data defition is

{
    "settings" : {
        "number_of_shards" : 3
    },
    "mappings" : {
        "type1" : {
            "properties" : {
                "name" : { "type" : "text" },
                "from" : { "type" : "integer" },
                "to"   : { "type" : "integer" },
                "msg"  : { "type" : "text" }
            }
        }
    }
}

You will notice this is exactly the same data defintiion file as I created earlier - I just changed the URL for the CURL.