I would strongly suggest you attach your Spark issue like this
The first 2 sets of skills you will need are
Please make sure you use Python3. I have included lots of Python books in your data folder.
You download Python from http://www.python.org
Python Lambda Functions
You need to be proficient in using this way of doing a function call. Spark is function calling function calling function - which can be a little difficult to understand to start with
This is a sublanguage/Module for Python which you need to learn.
- You need to learn the basics of the Language
- Lambda Functions
- These are slightly more complex to understand
- BUT you NEED This for pyspark
Just install the following Products
You now can run Spark in a simple manner (1 Machine no cluster) - but with a SMALL Amount of data - this will allow you to do development without worrying about the whole Hadoop thing.
Full Hadoop Thing At this point I would expect you to have done and finished the following stuff.
- You have a small amount of real data - You can process your data (using pyspark) and you can get the answers that you are expecting.
IF YOU HAVE NOT GOT THESE 2 THINGS - STOP
Well done, so you now need to download a Haddop Distribution. You can use either the IBM BigInsights or Cloudera - I personally would use Coudera. It now is a case of getting your Haddop Enviroment up and running (Single or Multiple Nodes?) - then bringing in your small data - and then running the same queries you did on your small/starter system. If the output is the same - then carry on.
Look for the function called weka ?