back loading

To back load messages into the EQ system you first must understand how messages are loaded in a normal manner. I cannot stress this point strongly enough !!

So I am now going to assume that you are familiar with the normal loading procedure.

You will recall that the normal loading procedure has five phases to it.

  • acquisition
  • converting
  • aggregating
  • loading
  • mining

For back loading data we need to slightly modified four of these processes.

  • acquisition
  • converting
  • aggregating
  • loading

I shall therefore be showing you how to modify each process.

Before however we start we must disable the "Real time" message processing, the easiest way to do this is by commenting out the correct lines in the crontab file.

Acquisition

You will recall that the acquisition process connects to a server run by another area of the company. You hopefully will also remember that's in normal mode the main script by default generates today's year and month.

You also hopefully remeber that the acquisiton phase has 3 parts too it (Check Process.sh in /home/biadmin/Injest)

  • 1_GetDataFiles.sh
  • 2_Gen_File_List.sh
  • 3_AddtoQ.sh

We need to do these steps Manually - but as Backload is not a common operation - I do not see an issue with this.

1_GetDataFiles.sh

We need to call this Script in a special manner.

The script was written so that if two parameters were passed to it it would use those parameters.

So to load January 2011 you would call the scripts like this

   cd /home/biadmin/Injest
   ./1_GetDatFiles.sh 2011 01

NB - the Value 01 is not the same as 1 !!! You are talking to a rather dumb Windoz FTP Server :)

Because all of the finals of that month should already exist this command will take several hours.

2_Gen_File_List.sh

You do not have to do anything special here - nust run the script

It Generates LatestFiles.txt

3_AddtoQ.sh

Again you just run this script - It will add files - lots and lots of files to the Queues.

Convert Phase

When you see data starting to appear in the all Queues, you can start the next part.

This again - does not need to be modified, So just run it.

ClusterStart.sh

You may want to run this command 2 tiles - and get 30+ threads running per node. Do not go stupid however...

Aggregating

If you have run ClusterStart a few times - then you will need to run the Aggregator process in a different manner.

Normal Mode

You just do a command line

PostProcess.sh

That processes files using a 10 minute rule.

BackLoad Mode

You need to process/compact files as fast as possible - you do not have enough disk space (500Gb will do V fast) - because you have 200+ cores all firing data at the HInjest disk.

for a in seq(1000)
do
  ./PostProcess.sh 1
done

Which means - run 1000 times - PostProcess with the value of 1.

What does 1 mean ? The value 1 means only wait for the utftxt files to be 1 minute old.

This will still compress, gz and move the files. And also load the Final queues.

Loading

The Loading script is called OmLoad.py - the easiest way to load to a specific table is to HACK the script LoadOm.py - and hard code the table name.

You obviously need to make sure that the HBase table exists.

Again you may want to run this as a Loop - NOT as a parralel.

cd /home/May2014/code
for a in seq(30)
do
    QRead.sh HBaseQ ./Load.sh  
done

You can see the IO/sec on the screen and check that the Numbers in the HBaseQ is dropping.

All over ?

When your backload is all done. You should easily be able to process 6 Months of Data per day. You need to revert the system to regular operations.

  • Make sure the LoadOm.py is back to Normal
  • Un-Comment the crontab jobs

If this Backloading has taken you several days - there could be a LARGE amount of data waiting - so just watch the disk space and process levels.

Very quickly however the system will catch up.

If there are still several HbaseQ files... you can manually load them by running

QRead.sh HBaseQ ./Load.sh