[#14] Occam's Razor

Sometimes the simplest answer is the best answer

Mar 23, 2021

In a world where there is a lot of hype around machine learning, deep learning, and AI, there is a tendency to run towards the latest, most sophisticated algorithms and throw them at any problem.

However, looking at the last four years of Kaggle surveys, I found that linear and logistic regression are still actively used by 80% of data scientists in their day-to-day work. In fact, traditional dense neural networks seem to be falling out of favor, in relation to more specialized neural networks like CNNs and RNNs.1

One area where machine learning and AI could have a large positive impact is in the field of medicine. I found some academic papers that directly compared logistic regression models with neural networks on the same data sample to perform some classification, such as predicting whether cancer patients will survive the next 30 days. Interestingly, the results were mixed, with some studies finding better results with a neural network (as measured by accuracy or the AUC score)2 and another study where the results were the same.3

In a paper that reviewed 71 studies that considered both logistic regression and some machine learning or neural network model, they found “no performance benefit of machine learning over logistic regression for clinical prediction models.”4

In yet another paper that looked specifically at recommender models, the authors found that:

None of the computationally complex neural methods was actually consistently better than already existing learning-based techniques, e.g., using matrix factorization or linear models.5

Start Simple, then Increase the Complexity

One interesting point from the article on recommender models is the issue of what baselines are used for comparing to when deciding that a complex model is performing better than other models would.

For some insight in choosing baselines for evaluating a model’s performance, take a look at this video from a machine learning course. Essentially, if you jump straight to the neural network approach, how will you know if it performs better than a simpler model with well-thought-out feature engineering?

Another issue is that there are other considerations besides just the accuracy of the model. Many data scientists don’t come from a software engineering background, but it is important to keep in mind how our models will be implemented in production. From a technical perspective, it is much easier, and much more efficient, to run a regression model in production than a complex neural network model. Even if a neural network model gives better predictions, one needs to weigh whether the improvement in accuracy is enough to warrant implementing a much more complicated model in production.

Therefore, the recommendation here is to start with a regression model, where you carefully examine your input data and decide how to engineer your features to get a good result. After that, move to a more advanced machine learning (ML) or deep learning (DL) model.

In terms of choosing which model to try next, after a regression model, a general rule of thumb is to use ML models (like random forest or XGBoost) for structured, tabular data and DL models for unstructured data (such as images or text documents.6

It is worth keeping in mind, though, that there are many datasets for which either an ML or a DL model would give good results. So, it is probably worth trying a model like XGBoost first, with the appropriate feature engineering, then move on to a neural network and compare the results.

In summary, while there has been a lot of amazing progress in the world of neural networks and DL, it is important to make sure you are using the best solution for your problem at hand. While it is tempting to go after the latest and greatest approach, sometimes the best solution is the simplest one.

Python Corner

This article points out a potential security issue with Python packages, or rather the process of installing and updating them. According to the article, there is a staggering ~300,000 packages hosted on PyPI, the Python Package Index! With such a large number of packages, it is easy for someone to sneak in a fake or altered version of a popular package, such as a package called beauitfulsoup4 instead of the popular beautifulsoup4.

At this point, it seems that most of the fake packages were submitted to PyPI to make a point about potential security issues. But, it is worth keeping this in mind and making sure that when you are installing a new package, you're getting the correct one.

I also want to quickly mention the PyCoders newsletter, which is a weekly newsletter about Python. This newsletter is where I found this article about Python packages. It is worth subscribing to if you are an active Python user.

https://towardsdatascience.com/data-science-trends-based-on-4-years-of-kaggle-surveys-60878d68551f

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7382770/

https://pubmed.ncbi.nlm.nih.gov/25081718/

https://pubmed.ncbi.nlm.nih.gov/30328607/

https://pubmed.ncbi.nlm.nih.gov/30763612/

https://arxiv.org/abs/1911.07698

https://www.rtinsights.com/machine-learning-vs-deep-learning-which-is-best/

Featuring Data

Discussion about this post