tfc - tim

What follows are a set of mini courses/tutorials that the STC staff have asked me to present to them. They are listed in the order of priority.

SMS-how does it all work ?

The SMS system is split into five phases

  • acquisition
  • converting
  • aggregating
  • loading
  • mining

Each phase requires data from the previous phase, For example unless the acquisition process gets Data there is nothing for the converting process to convert.

These processors all share one thing in common

message queue

Specifically rabbitMQ.

We will discuss rabbitMQ later in another tutorial, but for the moment you just need to understand that MQ (message queue) is the way of achieving process synchronisation.

Look at each one of those phases in some more detail

Acquisition

Acquisition means "getting data", in this particular case that means getting five different types of data for my server which is owned and controlled by another department.

Because we do not have control over the server we are rather limited in how we may connect to it.

The code that carries this out is actually a crontab job

crontab -l | grep Ingest | Process

You should see

1,10,20,30,40,50 * * * * source /home/biadmin/.myenv; cd /home/biadmin/Injest;./Process.sh

Which is the command that is executed every 10 minutes.

Process.sh

We need to look at this file...

``bash cd /home/biadmin/Injest cat Process.sh

Here you can see that **Process.sh** itself splits into 3 phases

* 1_GetDataFiles.sh
* 2_Gen_File_List.sh
* 3_AddtoQ.sh

I would hope that no further explanation of what each phase is necessary

### 1_GetDataFiles.sh

This process is normall called without ANY parameters... in which case the script calculates - the Current Year and the Current Month **This is VERY Important to note**.

With the Current Year and Month - the script then logs into 5 FTP accounts - and does a **wget** on the target directory

#### Why Wget for Year and Month ?
The Data delivery to the machine downstairs seems to be rather inconsistent, and if I just request data for today or this hour sometimes we will miss Data. However by requesting data for the combination of year and month - I do not care how reliable the data service downstairs is.

You should note that wget only gets **new or missing files** , and wgets puts the results into a log file.

### 2_Gen_File_List.sh

This process looks at the log file that the previous W get produced, and simply generate statistics for **ganglia**. In doing so a file called **LatestFiles.txt** is generated - this is used in part three of this process.

As we will comeback and look at ganglia again, I will not discuss this module any more.

### 3_AddToQ.sh

This is a very simple module

It "greps" for each of the input files - does some filtering and then loads in parallel to the MQ.


## Acquisition Overview

* wget the data and logit
* generate statistics from blogs
* using the latest.txt load new files into specific queues


# Converting

Converting is a complex and tricky operation however this can be simplified into two parts

* Perl converting
* ASN1 converting

Because both of these Technologies require good programming skills and understanding of the technology I cannot cover them in a one hour briefing.

This Converting is called by a crontab job - **ClusterStart.sh**

##ClusterStart.sh
ClusterStart makes processes run on 10 machines all at the same time.

###How do you run on 10 Machines ?
By using **ssh**  you call **ProcessQ.sh** 
The MQ is available via an IP Address
The Data is hosted on a **NFS Share** (The NFS Share has to be mounted on all 10 Machines)

*This is similar to the HDFS concept*

###What does ProcessQ.sh do ?

It calls **2_Cons.py** - a Process that *Consumes* a Queue.

### What does 2_Cons.py do ?
This is the Perl Version of **ReadQ** (A Python Script some of you may know).

It reads a Queue - and then is the QName is "OM" calls a specic OM process - else is calls a *Perl Converting* process (Discussed Later).

### What is output ?
Both the *Om* and *Perl Converting* processes process the file that is supplied from the MessageQ - and copy the output back to **/home/May2014/Run**....

Each File has a name called

* Node
* Year
* Month
* Day
* Hour
* Min
* Millisecond
* .utftxt

The file is left in this directory for the next phase.

## Data Formats
The SMS files

## ASN1
There is only one input format that uses ASN1, and whilst ASN1 is difficult to understand initially, when it starts to work it is a very very reliable; because the output is either perfect or rubbish.

This is only used by "Om" - **not OMInt**

## Perl converting

The perl converting process - was produced over several months-as each one of the encoding schemes inside of the SMS format was uncovered.

This has led to the Perl code being rather over complex.


There are two Perl data formats - Nw, and **all** the "Int" types.

## SMS data types

There are four SMS datatypes - listed in difficulty of processing

* ASCII
* UTF8
* UCS-2
* Binary

We currenty are able to process the first 3. 

Before you try and do any work with the converting process you should understand how these types work.


# Aggregating
The aggregating process combines all the smaller files from the converting process - and makes a larger file.

##Why ?
It is more efficient to process a large file - when it comes to data loading than lots of small files.

Which is better

```bash
    for a in $(seq 100)
    do
       mysql -u X --pass=Y <"Insert 1 into Junk;"
    done

Or

mysql -u X --pass-Y <"Insert 1 Into Junk; Insert 2 into Junk;"

The 2nd way will be MUCH quicker as we only do 1 Connect/Authentication/Disconnect - the first mechanism does this for each record.

PostProcess.sh

This aggregator processed is called PostProcess.sh - it also runs every 10 minutes.

Of course it is not just as simple as "make the files bigger"

Logic

  • Find max 1000 utftxt files more then 10 mins old
  • Put these files into a TimeStamp File
  • Find TimeStamp file modified more than 10 mins ago
  • Compress TimeStamp to TimeStamp.gz
  • Find TimeStamp.gz modified more than 10 mins ago
  • Move to /data/ToLoad
  • Load to MQ
    • HBaseQ
    • SearchQ
    • DataMineQ
    • SMSWordWeb (Q)

This logic is there to prevent files from being copied as they are being compressed.

Loading

Loading takes a file referenced by a MQ (HBaseQ) - and then uses a Python script to push the data into HBase.

This is done using this combination

cd /home/May2014/code
QRead.sh HBaseQ ./Load.sh

Load.sh

Expects parameter - the load file Then

  • expands the GZ file
  • Runs LoadOm.py
  • Removes the temporary file

Please note: Despite the name LoadOm.py - This is the ONLY loading format.

LoadOm.py

A python script using HappyBase

  • Generate CurrentShardName
  • Check Table Exists
  • Load file to Table
  • Generate some Ganglia Stats

You need some Python and Hbase knowledge to modify this code.

Mining

The mining phase goes and tries and finds the things like

  • Car Reg
  • Bank Details
  • Missed Calls
  • Salary

MineData.sh

This code is in a different area - please look at

/home/biadmin/TkbIngest

Using several combinations of

  • grep
  • regex
  • Perl
  • sed

The required data can be extracted

These data files are then

  • Loaded as a Fat Row
  • used to Generate Ganglia Stats