## Wednesday, July 06, 2016

### The difference between Statistical Modeling and Machine Learning, as I see it | Pulse - LinkedIn Photo: Oliver Schabenberger
Oliver Schabenberger, Ph.D., Senior Research Statistician at SAS Institute and has been using SAS software since 1991, frequently get asked about the differences between Statistics (statistical modeling in particular), Machine Learning and Artificial Intelligence. There is indeed overlap in goals, technologies and algorithms. Confusion arises not only from this overlap, but from the buzzword salad we are being fed in non-scientific articles.

Statistical Modeling
The basic goal of Statistical Modeling is to answer the question, “Which probabilistic model could have generated the data I observed?” So you:
• Select a candidate model from a reasonable family of models
• Estimate its unknown quantities (the parameters; aka fit the model to data)
• Compare the fitted model to alternative models
For example, if your data represent counts, such as the number of customers churned or cells divided, then a model from the Poisson family, or the Negative Binomial family, or a zero-inflated model might be appropriate.

Once a statistical model has been chosen, the estimated model serves as the device for inquiries: testing hypotheses, creating predicted values, measures of confidence. The estimated model becomes the lens through which we interpret the data. We never claim that the selected model generated the data but view it as a reasonable approximation of the stochastic process on which confirmatory inference is based...

Classical machine learning
Classical machine learning is a data-driven effort, focused on algorithms for regression and classification, and motivated by pattern recognition. The underlying stochastic mechanism is often secondary and not of immediate interest. Of course, many machine learning techniques can be framed through stochastic models and processes, but the data are not thought in terms of having been generated by that model. Instead, the primary concern is to identify the algorithm or technique (or ensemble thereof) that performs the specific task: Are customers best segmented by k-means clustering, or DBSCAN, or a decision tree, or random forest, or SVM?

In a nutshell, for the Statistician the model comes first; for the Machine Learner the data are first. Because the emphasis in machine learning is on the data, not the model, validation techniques that separate data into training and test sets are very important. The quality of a solution lies not in a p-value, but in proving how well the solution performs on previously unseen data. Fitting a statistical model to a set of data and training a decision tree to a set of data involves estimation of unknown quantities. The best split points of the tree are determined from the data as are the estimates of the parameters of the conditional distribution of the dependent variable...

Modern Machine Learning
A machine learning system is truly a learning system if it is not programmed to perform a task, but is programmed to learn to perform the task. I refer to this as Modern Machine Learning. Like the classical variant, it is a data-driven exercise. Unlike the classical variant, modern machine learning does not rely on a rich set of algorithmic techniques. Almost all applications of this form of machine learning are based on deep neural networks.

Source: Pulse - LinkedIn