[#5] The Top Statistics Ideas of the Last 50 Years
The most important developments in statistics since 1970 and their connection to modern data science
While we are all inundated with the usual end-of-the-year, "best of 2020" lists, this article [original on arXiv.org] looks at the top statistical ideas of the past half-century. It is written by a statistics professor, Andrew Gelman, from Columbia University, and a computer science professor, Aki Vehtari, from Aalto University. I will start here by summarizing their list.
The Most Important Statistical Ideas of the Past 50 Years (according to Gelman and Vehtari)
Counterfactual Causal Inference
Bootstrapping and Simulation-Based Inference
Overparameterized Models and Regularization
Multilevel, or Hierarchical, Models
Generic Computation Algorithms
Adaptive Decision Analysis
Exploratory Data Analysis
Connection to Machine Learning and Deep Learning
One common thread among many of these ideas is how they take advantage of the advances in computing over the past 50 years. Iterative algorithms, like bootstrapping, become practical when a computer can relatively quickly run multiple iterations of the same experiment.
In last week's post, I discuss differences between a pure statistics approach and a pure machine learning approach to tackling a problem. One of the points there is that the main difference "is not one of algorithms or practices but of goals and strategies." A lot of the statistics ideas here have found applications in machine learning. Number 3 on the list above, overparameterized models and regularization, is central to machine learning and deep learning. According to Gelman and Vehtari, "a major change in statistics since the 1970s, coming from many different directions, is the idea of fitting a model with a large number of parameters — sometimes more parameters than data points — using some regularization procedure to get stable estimates and good predictions." One of the biggest challenges in machine learning and deep learning is overfitting, where we train a model that performs amazingly well on our training dataset, but performs poorly on our unseen holdout dataset. Regularization, of different varieties, is used widely by data scientists to address overfitting.
Furthermore, bootstrapping, which is number 2 on Gelman and Vehtari's list, is an important component of random forest models. In random forest, an ensemble of trees is built, where each tree uses a bootstrap sample from the original training set. This helps to control overfitting, as each tree in the ensemble is trained on a slightly different version of the original dataset.
In terms of number 6 on the list, a relatively famous example of adaptive decision analysis these days is reinforcement learning.
I was a little surprised to see exploratory data analysis on Gelman and Vehtari's list (coming in at number 8), but it certainly falls within the realm of statistics, and again, advances in computing, and the proliferation of personal computers, allows the average researcher or analyst to generate a range of plots, much more easily than if one is creating plots with pen and paper. The ease with which plots can be made with various tools means there is little excuse not to perform a rigorous EDA on your dataset before going to the modeling phase.
Other Statistical Methods
Of the other ideas on the list, the one that I see mentioned often is causal inference. In particular, Gelman and Vehtari mention counterfactual causal inference. According to the Stanford Encyclopedia of Philosophy, "the basic idea of counterfactual theories of causation is that the meaning of causal claims can be explained in terms of counterfactual conditionals of the form 'If A had not occurred, C would not have occurred'". I hope to have a post or two about causal inference in the not-too-distant future, but until then, here is a fun post on the topic.
If you're interested in more detail on some of the other statistical ideas on Gelman and Vehtari's list, take a look at the original paper for more details and references.
As many data scientists, including myself, are using Python, I will occasionally include some tips and tricks or interesting packages to share.
Since exploratory data analysis is on the list above of the most important statistical ideas of the past 50 years, I thought I would include here a couple useful packages for performing this important task.
Seaborn: This package for creating statistical data visualizations is based on the popular matplotlib package, and it works seamlessly with Pandas dataframes. In a couple lines, one can create professional-looking statistical graphics that can be used both for exploration and for presentations. Of course, these things can be done with matplotlib itself, but seaborn makes it much easier.
Bokeh: Bokeh is great for making interactive data visualizations that you can easily share with others. Plots that you create in Bokeh get exported as html files, and you can view them in any web browser and interactively zoom in and out, etc. And, since you have the plot in an html file, you can easily send the file to someone else to view and interact with, or even publish the plots online.
Enjoy the rest of your week!