[#4] Machine Learning versus Statistics
How do machine learning practitioners differ from statisticians in their approach to a problem?
In this post, I would like to highlight one article that I found really interesting. For some context, in my very first post, I give a brief history of data science and the different fields that have converged into data science. Two big components of modern data science are statistics and machine learning. It is important to understand the difference between machine learning and statistics, or between machine learning practitioners and statisticians, to be sure we are all speaking the same language when we discuss these topics. It is also important for understanding different peoples’ approach to a problem. The way a statistician might approach a problem will likely be different than the way a machine learning practitioner will approach that same problem.
The article I am featuring here is called Machine Learning vs. Statistics, and was authored by Tom Fawcett, a machine learning practitioner, and Drew Hardin, a statistician. I recommend going ahead and reading the article, but I include some quotes from the article and an overview here.
In statistics, the goal of modeling is approximating and then understanding the data-generating process, with the goal of answering the question you actually care about.
[T]he Statistician is concerned primarily with model validity, accurate estimation of model parameters, and inference from the model. However, prediction of unseen data points, a major concern of Machine Learning, is less of a concern to the statistician. Statisticians have the techniques to do prediction, but these are just special cases of inference in general.
When I think of statistics, I think first of descriptive statistics, which describe the data in hand in some way (for example, the mean or the variance of the data). As mentioned in the quote above, statisticians have techniques for making some prediction based on data, but there are other goals in statistics, often involving either understanding some data set in depth or comparing two data sets to see if there is some significant difference between the two.
In Machine Learning, the predominant task is predictive modeling: the creation of models for the purpose of predicting labels of new examples.
The model does not represent a belief about or a commitment to the data generation process. Its purpose is purely functional.
ML practitioners are freed from worrying about difficult cases where assumptions are violated, yet the model may work anyway.
In summary, the difference between statistics and machine learning, as the authors see it:
is not one of algorithms or practices but of goals and strategies. Neither field is a subset of the other, and neither lays exclusive claim to a technique.
Furthermore, the way one trains as either a statistician or a machine learning practitioner is different, as:
Computer scientists are taught to design real-world algorithms that will be used as part of software packages, while statisticians are trained to provide the mathematical foundation for scientific research.
The way I would summarize all of this is, let's say, I have a simple data set that contains the square footage of a list of houses, along with their selling prices. In this simplified world, where the selling price of the house is correlated only with their size, I want to fit a model to the relationship between square footage and selling price, so that I can predict the selling price when a new house goes on the market with a previously unseen value of square footage.
A purely machine learning approach would take the sample data, split it into a training and a hold-out test set, and try different models, with different sets of hyperparameters, to see what gives the best performance (i.e., the most accurate predictions) on the test set.
A more statistical approach might be to try to understand the intrinsic relationship between size and selling price, to understand the assumptions that are involved in each model given the provided data, and to try to justify the specific model that you report as the one that best-fits the data, beyond just the performance or accuracy of the model.
Ultimately, statistics and machine learning are two separate fields, with perhaps some overlap, that complement each other well. As a data scientist, statistics provides the tools to investigate and try to understand a given data set, and machine learning provides solid tools for taking that data and making predictions. In other words, statistics provides the tools for engineering high-quality features to input into machine learning models to make accurate and useful predictions on previously unseen data.
As many data scientists, including myself, are using Python, I will occasionally include some tips and tricks or interesting packages to share.
If you use Python to train machine learning models, how do you keep track of the models you have run, the hyperparameters you have experimented with, etc.? The Replicate Python package brings version control to training machine learning models. I just found out about this package, so I have not used it yet, but it's definitely worth checking out.
Enjoy the rest of your week!