On the Differences between Statistics and Machine Learning

In his post, The Actual Difference Between Statistics and Machine Learning, Matthew Stewart helpfully explains how statistical analysis differs from machine learning. Data science is still a larger than machine learning. But it’s very appropriate to say something very similar about the relationship between the two as we said above: Data science can’t be done without machine learning.

Both statistics and machine learning are part and parcel of the data science toolkit. And each plays a somewhat different role. Explaining the difference is helpful.

Stewart summarizes the difference like this:

  • Statistical modeling aims first and foremost for understanding and explaining relationships between variables. Predictive power is a secondary consideration.
  • Machine learning aims first and foremost for effective prediction. Some machine learning algorithms are easy to interpret, and some are not.

Thus, if you are writing a scientific paper that needs to explain the relationships between variables, statistical modeling is probably the best route.

However, if the point of your work is to produce actionable results that translate into greater efficiency and effectiveness achieving the mission of your organization — machine learning is often the better route.

In Stewart’s own words:

Machine learning is all about results, it is likely working in a company where your worth is characterized solely by your performance. Whereas, statistical modeling is more about finding relationships between variables and the significance of those relationships, whilst also catering for prediction.

He goes further to develop a helpful analogy:

By day, I am an environmental scientist and I work primarily with sensor data. If I am trying to prove that a sensor is able to respond to a certain kind of stimuli (such as a concentration of a gas), then I would use a statistical model to determine whether the signal response is statistically significant. I would try to understand this relationship and test for its repeatability so that I can accurately characterize the sensor response and make inferences based on this data. Some things I might test are whether the response is, in fact, linear, whether the response can be attributed to the gas concentration and not random noise in the sensor, etc.

Statistical analysis is great in such a case. It’s the right tool for the job.

But what if the nature of the problem is slightly different, and the goals are different?

In contrast, I can also get an array of 20 different sensors, and I can use this to try and predict the response of my newly characterized sensor. This may seem a bit strange if you do not know much about sensors, but this is currently an important area of environmental science. A model with 20 different variables predicting the outcome of my sensor is clearly all about prediction, and I do not expect it to be particularly interpretable. This model would likely be something a bit more esoteric like a neural network due to non-linearities arising from chemical kinetics and the relationship between physical variables and gas concentrations. I would like the model to make sense, but as long as I can make accurate predictions I would be pretty happy.

That brings it home nicely. In the case of machine learning, our interest is in the results: How can we make the most accurate predictions? And moreover, do these predictions yield benefits for the mission of our organization?

Perhaps said otherwise, statistics is more about understanding — helping to answer the question, What’s really happening here? Machine learning is more about driving action — helping to answer the question, What can we anticipate next? — and by extension enabling efficient and effective responses.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s