omaima suggested plan

I would strongly suggest you attach your Spark issue like this

Basic Skills

The first 2 sets of skills you will need are


Please make sure you use Python3. I have included lots of Python books in your data folder.

You download Python from

Python Lambda Functions

You need to be proficient in using this way of doing a function call. Spark is function calling function calling function - which can be a little difficult to understand to start with


This is a sublanguage/Module for Python which you need to learn.

Suggested Plan

  • Python

    • You need to learn the basics of the Language
    • Lambda Functions
    • These are slightly more complex to understand
    • BUT you NEED This for pyspark
  • Spark Environment

    • Beginners
    • Just install the following Products

      • Python3
      • Java
      • Mavern
      • Spark

      You now can run Spark in a simple manner (1 Machine no cluster) - but with a SMALL Amount of data - this will allow you to do development without worrying about the whole Hadoop thing.

  • Full Hadoop Thing At this point I would expect you to have done and finished the following stuff.

       - You have a small amount of real data
       - You can process your data (using pyspark) and you can get the answers that you are expecting.


    Well done, so you now need to download a Haddop Distribution. You can use either the IBM BigInsights or Cloudera - I personally would use Coudera. It now is a case of getting your Haddop Enviroment up and running (Single or Multiple Nodes?) - then bringing in your small data - and then running the same queries you did on your small/starter system. If the output is the same - then carry on.

From Omaima

Look for the function called weka ?