[#1] A Brief History of Data Science
Statistics, Data Analysis, and Coding converge to form a new career for the 21st century
Welcome to my new newsletter "Featuring Data"! This is a newsletter about all things data science. I seek to filter through the vast amounts of material, articles, and resources out there on data science, and distill it into a weekly, easily digestible format.
I really wanted to kick-off this newsletter with a (brief) look at the history and origin of data science, both as a field and as a profession. However, it turns out that just defining what data science is is fairly complex in and of itself.
My first step is to, in a paragraph or two, summarize what data science is, based on various experts in the field. Then, I want to go into some of the history of the different fields that have essentially converged into what we now call "data science".
So, what is data science? Is it just a rebranding of statistics, as some statisticians like Nate Silver claim [1], or is it really something new and different [2]?
In terms of an "official" definition, several definitions of data science have been offered over the last few decades. The computer scientist Peter Naur offers a definition in his 1974 book, Concise Survey of Computer Methods, that data science is "the science of dealing with data” [3]. The implication here is that data science is not just about "analyzing" data, which is in the realm of statistics, but rather has to do with "dealing" with data - as in using a computer to clean, process, store, and manipulate data - as well as analyzing the data [2].
The term obviously did not catch on back then. Two decades later, in a 1998 publication, Chikio Hayashi defines "data science" as a "concept to unify statistics, data analysis, and their related methods" [4]. More recently, in an introductory lecture on data science, Steve Brunton describes "data-intensive" science or "data-driven" inquiry, which is, essentially, performing scientific research based on data. He describes the phrase data science as an "emerging scientific discipline which is motivated by data-intensive science.”
Therefore, data science is the "science of how you handle data - collect, clean, store, visualize, and model with data.”
This modeling of the data in the modern context of data science is often "machine learning” [5].
Therefore, let us first dive into a brief history of the main components - statistics and data analysis. After that, we will touch on the emergence of machine learning and neural networks, two methods that have become major components of many data scientists' toolkits.
Statistics (~700s to present)
Some of the earliest writings in statistical inference appeared over 1200 years ago, in the 8th century [6]. Two of the major branches of statistics, Bayesian and frequentist statistics, originated in the past ~200 years.
The origin of Bayesian statistics can be traced back to a paper by Thomas Bayes, which was published (posthumously) in 1763. From the late 1700s to early 1800s, LaPlace expounded on his work, and, in 1812, defined Bayes' theorem in the book Théorie Analytique des Probabilités [7].
Ronald Fisher and his contemporaries had issues with the Bayesian approach and thus started the frequentist branch of statistics in the early 1900s [8]. In fact, it was Fisher, in 1950 [pdf], who first used the term Bayesian, although he did so to criticize the Bayesian approach [9].
Some important concepts we use today in statistics originate from the work of Francis Galton in the late 1800s, such as the Pearson correlation coefficient [10]. In 1935, Ronald Fisher published the book The Design of Experiments, in which he introduces the term null hypothesis, which is a critical component of hypothesis testing and scientific experiments [11].
Data Analysis
In 1961, the statistician John Tukey gave the following definition for "data analysis": "procedures for analyzing data, techniques for interpreting the results of such procedures, ways of planning the gathering of data to make its analysis easier, more precise or more accurate, and all the machinery and results of (mathematical) statistics which apply to analyzing data" [12].
As Tukey was a statistician, clearly statistics is a central part of his definition of data analysis. This definition, however, is getting very close to the aforementioned definitions of data science from Hayashi and Brunton.
In the second half of the 20th century, we start to have computational tools to perform data analysis, such as SPSS and SAS, which were first developed in the 1960s [13] [14], and, in 1993, Ross Ihaka and Robert Gentleman announced the R programming language [15]. Perhaps the most famous data analysis tool, Microsoft Excel, was first released in 1987. All of these tools are still used by data professionals today.
Machine Learning (1950s to present)
If you look at job descriptions of data analysts versus data scientists, the biggest difference is that almost all data scientist job descriptions include machine learning in some way.
The idea of machine learning and AI dates back to 1950, when the famous computer scientist Alan Turing proposes the idea of a "learning machine" [16]. Then, in 1951, the first neural network was built by Marvin Minsky and Dean Edmonds, followed by the invention of the perceptron (an element of most modern neural networks) by Frank Rosenblatt in 1957 [17] [18]. For more on the history of neural networks, see this video from Duke's Coursera course [19].
In parallel to the development of modern neural networks, other very popular machine learning algorithms were developed. In particular, papers describing random forests (by Tin Kam Ho) and support vector machines (SVMs; by Corinna Cortes and Vladimir Vapnik) were both published in 1995 [20] [21].
Bringing it all Together (~2008 to present)
While statistics has been advancing over the course of at least a millennium, there was an acceleration in the development of concepts and techniques that we still rely upon today, starting around the late 19th century and continuing though the first half of the 20th century. And then, during the second half of the 20th century, and into the 21st century, various software tools are introduced for the statistical analysis of data, and, at the same time, machine learning and neural network techniques are developed.
In the inaugural episode of the Dataframed podcast, data scientist Hilary Mason explains that the profession of data scientist appears on the scene around the year 2008. Around this time, "technology had progressed to the point where the multiple things that a data scientist does could be combined in one professional role." These three elements are (1) the ability to write code and build models, (2) the increasing amount of data that was becoming more easily available for analysis and building these models, and (3) "a set of problems and processes and ways of thinking about the world that" allows one to put all these pieces together. These different elements were, by themselves, not necessarily new (except for the relative explosion in the amounts of data available), but
"it was newly affordable and newly so easy that one person could take on everything from the problem formulation, through to the analysis, to the visualizations and communications..., in a way that it just hadn't been before." This convergence "opened the door to the creation of this new job role of" data scientist [22].
Consistent with this timeline is the emergence of Python as one of the most widely used tools in data science. While Python, the programming language, emerged in the early 1990s, it was only in 2010 that the popular machine learning package scikit-learn was introduced. With tools like scikit-learn, a data scientist can train a machine learning model in just a few lines of code [23].
This role of data scientist continues to evolve, and in later posts, we will discuss some of the different flavors of data scientists and other roles such as machine learning engineer.
Book Recommendation
The Signal and the Noise by Nate Silver.
Whether or not you are a data scientist, this book is essential reading for everyone. It explains how people in fields from meteorology to finance make predictions. The book also introduces concepts that are fundamental to data science, such as overfitting. He also goes into some of the history of statistics, and in particular the divergence between frequentist and Bayesian statistics.
Notes and References
[2] https://priceonomics.com/whats-the-difference-between-data-science-and/
[3] http://www.naur.com/Conc.Surv.html
[4] https://www.springer.com/gp/book/9784431702085
[5] Steve Brunton on YouTube - video 1 and 2
[6] Broemeling 2012 - https://www.tandfonline.com/doi/abs/10.1198/tas.2011.10191
[7] LaPlace 1812 - https://books.google.fr/books?id=BkTVqwQW6loC
[8] https://openlibrary.org/works/OL16700318W/The_Signal_and_the_Noise
[9] http://users.stat.ufl.edu/~aa/articles/agresti_hitchcock_2005.pdf
[10] http://jse.amstat.org/v9n3/stanton.html
[11] Fisher 1935 - https://archive.org/details/in.ernet.dli.2015.502684
[12] Tukey 1961 - https://projecteuclid.org/download/pdf_1/euclid.aoms/1177704711
[13] SPSS - https://www.ibm.com/products/spss-statistics, https://www-03.ibm.com/press/us/en/pressrelease/27936.wss
[14] SAS - https://www.sas.com/en_us/home.html, https://web.archive.org/web/20131023182559/http://www.sas.com/company/about/history.html
[15] R - https://bookdown.org/rdpeng/rprogdatascience/history-and-overview-of-r.html
[16] https://academic.oup.com/mind/article/LIX/236/433/986238
[17] https://www.webofstories.com/play/marvin.minsky/136
[18] https://www.newyorker.com/magazine/1958/12/06/rival-2
[19] https://www.coursera.org/learn/machine-learning-duke/lecture/vHefr/early-history-of-neural-networks
[20] https://ieeexplore.ieee.org/document/598994
[21] https://link.springer.com/article/10.1007/BF00994018
[22] https://www.datacamp.com/community/podcast/data-science-past-present-and-future
[23] Python - https://docs.python.org/2.0/ref/node92.html; scikit-learn - https://scikit-learn.org/stable/about.html