text searching

The process of text searching is in theory quite easy, and you've probably been doing this for some time without really thinking about some of the options that you have.

What I will try and do here is present if you options and solutions so you may come up with a practical framework too add text searching.

Text files

If have a large amount of data Instant text files and and are running on a LINUX/UNIX system then it is essential that you learn how to use grep and egrep.

The basic syntax is as follows

grep tim <file>

This will search for the string Tim in the specified file. Please be aware that this is case sensitive

To find matches such as

  • Tim
  • tiM
  • TiM
  • TIM

You would slightly modify your command are to being

grep  -i tim <file>

Problems with grep

Unfortunately the users unlikely to want to learn this syntax, and you would need to allow them access to where the text files are.

solution for grep

  • develop a Front end masking the complexities of grep
    • QT
      • the directory probably will need to be accessed via NFS
      • QT program can search and display to the users desktop
    • Web
      • a single Web service needs to be developed
      • the Web service should be hosted on the machine with the source files
      • no deployment software required just a URL

That solution sounds as though it would work well however it may be quite slow and it will require the users session to be maintained.

Off-line searching

In order to continue to work when the user is not logged on a batch process needs to be created. There are many ways to do this, however building on some previous rabbit MQ discussions, I would submit the search parameters into queue.

  • user enters the search via web interface
  • web interface loads the message queue

The second part then should also be familiar to you

  • cron job checks the queue
  • if query
    • carries out query
    • saves outputs
      • common folder
      • users directory (needs security privileges)

This option is also easy to implement, but will not scale very well. The users may also be worried about security with their search results.

Big data solutions

I can think of three quite easy to implement Solutions, however there can easily be more than this.

Hive & HDFS

Place the text files in an HDFS directory, and then map a hive definition on the text files.

This would allow you to query the data using and SQL like query language.

The Good points

  • simple
  • Expandable
  • uses SQL like queries
    • would allow AND OR and NOT
  • output placed in a specific directory
  • query again could be handled by a web front end, and submitted to hiveQL

The bad points

  • rather slow
  • all users would be able to see all query results, unless each user had a separate account.
    • No LDAP
    • Hassle

HBase & Reverse Index

If a fast search is required, than nothing Will beat (possibly not true) hBase.

By taking the Messages and inverting their storage a very fast indexing mechanism can be used.

However the text that we wished to index potentially has some words which are so repetitive, that they would fill up the hBase row allocation of 2 GB very quickly. I estimate two weeks !!

The solution therefore it would be to shard on a two week basis.

The Good points

  • good performance
  • simple query
  • QT Front end would be simple/quick

The Bad Points

  • Limited search capability
    • No containing or like
  • Will consume lots of memory
  • maybe more complex trying to join the output results.

Apache Storm

Apache Storm is the latest upgrade in Big Data/Hadoop - it offers

  • speed
  • Multiple Development languages
  • Libraries

I am not going to spend ages trying to tell you how to do this in spark - But it will require you to load the text files into HDFS - then using the Cluster - develop a spark program.

Spark should be up to 100 times faster than Map-reduce - But there needs to be care in how you develop this.

The suggested Storm would do (although there are several options here)

  • Create an SQL representation of the data in HDFS
  • Execute the Query on many nodes at the same time
  • Send the output via Pandas to a Data Frame

Returning results to the users

Unless your query is synchronous (linked) to the user - and for this you need real-time (or at least very fast) queries - you will have to somehow connect the user to the output from a MR type result.

HDFS Viewer

You could connect the user via a web interface - and then they go to the HDFS directory.

This probably will require HDFS security being setup - which due to lack of LDAP means seperate account/user systems.

Read HDFS Output - and relocate

A possibly easier way is to get the MR output - and store it in a MySql database - as the user who requested this is know - the table should be defined something like this

  • DateTime - datetime
  • Requesting_User - varchar(20)
  • PartNo - int
  • Read - bool
  • Data - (Varchar 65535)