Most everyone realizes that statistics and data science share a lot in common. Sometimes it is helpful to understand the differences. While it’s true that data science can’t be done without statistics, it is also true that data science involves a great deal more — statistics plays a significant part in data science’s much larger undertaking.
I intend to update and expand on this post over time. But for now allow me to point to a helpful post that develops this point — and begins to clarify the nature of data science’s “larger undertaking.”
The Difference Between Statistics and Machine Learning
In his post, The Actual Difference Between Statistics and Machine Learning, Matthew Stewart helpfully explains how statistics differs from another key part of the data science toolkit: machine learning. Data science is still a larger than machine learning. But it’s very appropriate to say something very similar about the relationship between the two as we said above: Data science can’t be done without machine learning.
Both statistics and machine learning are part and parcel of the data science toolkit. And each plays a somewhat different role. Explaining the difference is helpful.
Stewart summarizes the difference like this:
- Statistical modeling aims first and foremost for understanding and explaining relationships between variables. Predictive power is a secondary consideration.
- Machine learning aims first and foremost for effective prediction. Some machine learning algorithms are easy to interpret, and some are not.
Thus, if you are writing a scientific paper that needs to explain the relationships between variables, statistical modeling is probably the best route.
However, if the point of your work is to produce actionable results that translate into greater efficiency and effectiveness achieving the mission of your organization — machine learning is often the better route.
In Stewart’s own words:
Machine learning is all about results, it is likely working in a company where your worth is characterized solely by your performance. Whereas, statistical modeling is more about finding relationships between variables and the significance of those relationships, whilst also catering for prediction.
He goes further to develop a helpful analogy:
By day, I am an environmental scientist and I work primarily with sensor data. If I am trying to prove that a sensor is able to respond to a certain kind of stimuli (such as a concentration of a gas), then I would use a statistical model to determine whether the signal response is statistically significant. I would try to understand this relationship and test for its repeatability so that I can accurately characterize the sensor response and make inferences based on this data. Some things I might test are whether the response is, in fact, linear, whether the response can be attributed to the gas concentration and not random noise in the sensor, etc.
Statistical analysis is great in such a case. It’s the right tool for the job.
But what if the nature of the problem is slightly different, and the goals are different?
In contrast, I can also get an array of 20 different sensors, and I can use this to try and predict the response of my newly characterized sensor. This may seem a bit strange if you do not know much about sensors, but this is currently an important area of environmental science. A model with 20 different variables predicting the outcome of my sensor is clearly all about prediction, and I do not expect it to be particularly interpretable. This model would likely be something a bit more esoteric like a neural network due to non-linearities arising from chemical kinetics and the relationship between physical variables and gas concentrations. I would like the model to make sense, but as long as I can make accurate predictions I would be pretty happy.
This nails it home nicely. In the case of machine learning, our interest is in the results: How can we make the most accurate predictions? And moreover, do these predictions yield benefits for the mission of our organization?
Perhaps said otherwise, statistics is more about understanding — helping to answer the question, What’s really happening here? Machine learning is more about driving action — helping to answer the question, What can we anticipate next? — and by extension enabling efficient and effective responses.
So that’s a good start on understanding the differences between statistics and data science. There’s more to be said about that …
And I hope to return to develop the rest of this reflection one day soon.