|Photo: Somnath Banerjee|
Someone even made this really cool infographic.
Choose the Building Blocks
Learn a few key python modules:
- Numpy - adds support for large, multi-dimensional arrays and matrices, along with a large library of math functions. The best way to think about this is it changes the slow dynamic nature of Python into an extremely fast number cruncher. Jake Vanderplas has this really helpful video to illustrate the point - Numerical computing with Numpy
- Pandas - a BSD-licensed library providing high-performance, easy-to-use data structures and analysis tools. There are several online materials, but personally I like - Manish Amde's Top-10 Python and Pandas Techniques.
- MatPlotLib - a plotting library which comes with an Object Oriented API and distributed under BSD license, is packed with a lot of handy visualization tools to make your Exploratory Data Analysis (EDA) simple!
There are many, many more valuable modules in the python ecosystem, but these make a solid starter.
There are two possible routes. Simplest starter is Scikit-learn!
Simple and easy to get started Scikit-learn even provides a roadmap as a flow chart for different techniques:
Here is a very detailed post from Ben Lorica on his reasons to use Scikit Learn.
The other route which is suitable for VERY LARGE data sets hosted on Cluster computers is Apache Spark. I will cover Apache Spark in a later post.
Somnath's best wishes for making your machine learn faster than you do :-)