Translate to multiple languages

Subscribe to my Email updates
Enjoy what you've read, make sure you subscribe to my Email Updates

Friday, June 19, 2015

So You Want to Do Machine Learning?

Photo: Somnath Banerjee
"The noise and chatter of Big Data and Machine Learning can confuse anyone and keep them wondering on how to get started. The chaos of tools, libraries, frameworks and technology stacks make it overwhelming for a beginner."

Someone even made this really cool infographic. 

Choose the Building Blocks
Learn a few key python modules:
- Numpy - adds support for large, multi-dimensional arrays and matrices, along with a large library of math functions.  The best way to think about this is it changes the slow dynamic nature of Python into an extremely fast number cruncher.   Jake Vanderplas has this really helpful video to illustrate the point - Numerical computing with Numpy 
- Pandas - a BSD-licensed library providing high-performance, easy-to-use data structures and analysis tools.  There are several online materials, but personally I like - Manish Amde's Top-10 Python and Pandas Techniques.
- MatPlotLib - a  plotting library which comes with an Object Oriented API and distributed under BSD license, is packed with a lot of handy visualization tools to make your Exploratory Data Analysis (EDA) simple!
There are many, many more valuable modules in the python ecosystem, but these make a solid starter.

Machine Learning
There are two possible routes.   Simplest starter is Scikit-learn!
Simple and easy to get started Scikit-learn even provides a roadmap as a flow chart for different techniques:

Here is a very detailed post from Ben Lorica on his reasons to use Scikit Learn.

The other route which is suitable for VERY LARGE data sets hosted on Cluster computers is Apache Spark.  I will cover Apache Spark in a later post.
Somnath's best wishes for making your machine learn faster than you do :-) 

Source: Pulse