This blog series is my attempt towards collecting all the useful resources that are required to learn the big data technologies. I tried to search the internet but couldn’t find the consolidated information anywhere that lists down the steps. I have a very good exposure to programming in general and has very vast programming experience. So I dont’t require very basic steps and tutorial. I want to learn the technologies by doing home practice problems and want to focus on following concepts.
- Setting up a hdfs cluster
- Running a problem of map-reduce on hadoop cluster
- Running the same problem on spark cluster
- Running the same problems using pig / hive?
- Mesos vs YARN
- Eventually running the ML problems on this cluster
I want to run map-reduce problems on some large and real data sets. After some analysis, I thought of using the stackoverflow data sump. Also I want to run these problems on a real cluster rather than running the problems on a single node. I am linking few posts that can help you setup your own cluster and dump the stackoverflow data. My language of choice is Python. I have not done any Python programming till now. I have a real solid experience in Java but I believe for machine learning and big data Python is the right choice. One might run the big data problems in Java but for machine learning one will require the support of ML libraries. Currently sci-kit and tensorflow are the best ML libraries and these have been written in Python. I am not sure how this experience will unfold, let’s see.
Hadoop cluster setup
Map-Reduce in python
Data analysis on stack exchange dump
Stack exchange data dump