Statistics for Data Science

This post is just a stub at the moment. I intend to return and develop it in the coming months. 

For now, let me feature a good quote and then provide a recommended source.

The quote is from Vincent Granville, who wrote this in 2014:

Data science barely uses statistical science and techniques.

He goes on to clarify:

The truth is actually more nuanced …

And in the ensuing post he lists a series of new statistical concepts that are frequently useful in data science, followed by a series of old statistical concepts that are also often useful.

But then he follows up with this sentence: 

From a typical 600-pages textbook on statistics, about 20 pages are relevant to data science, and these 20 pages can be compressed in 0.25 page.

Granville’s post is worth a read, as he goes into some reasons why old-school statistics proper is increasingly less useful in the world most of us live and work in, while machine learning techniques are becoming much more useful.

And I will add: Granville’s post, while framed somewhat controversially, fairly well summarizes the realities of data science work. There are several statistical concepts that are indeed useful when doing this work. But we have powerful software tools ready at hand to do much of that work — often using new techniques that yield better predictive results than older statistical approaches do.

Yes, we still often need to understand the meaning and implications of a range of statistical insights in relation to our data. But we can get lots of great work done with an intuitional understanding of those concepts. Thus, we can start with statistical fundamentals, use them as needed, and then expand our knowledge when the situation calls for it.

One last sentence from Granville summarizes this data-sciencey attitude toward stats:

I believe that you can explain the concept of random variable and distribution (at least what you need to understand to practice data science) in about 4 lines, rather than 150 pages. The idea is to explain it in plain English with a few examples.

Thus, Granville suggests that when time allows he may write a “statistics cheat sheet for data scientists,” and do it in a single page. (Turns out he wrote a Machine Learning Cheat Sheet that covers many data sciencey things, but not statistics.)

Meanwhile, in 2017, O’Reilly published a nice handbook of 318 well-organized, succinct and readable pages to fill the gap: Practical Statistics for Data Scientists, by Peter Bruce and Andrew Bruce. I recommend it:

Practical Statistics for Data Science cover

In my view, a work like this does a great job of bridging the gap for those who are coming to data science from a variety of fields.

So there you are. Obviously there are many opportunities for disagreement on these points. But the fact remains, data science uses statistics differently than many other sciences. And there are a huge range of things that we need to know and be good at that lie outside of statistics proper.

That’s it for now.

If time and occasion allow, I’ll circle back and beef this up later. 

On the Differences between Statistics and Machine Learning

Most everyone realizes that statistics and data science share a lot in common. Sometimes it is helpful to understand the differences. While it’s true that data science can’t be done without statistics, it is also true that data science involves a great deal more — statistics plays a significant part in data science’s much larger undertaking.

I intend to update and expand on this post over time. But for now allow me to point to a helpful post that develops this point — and begins to clarify the nature of data science’s “larger undertaking.”

The Difference Between Statistics and Machine Learning

In his post, The Actual Difference Between Statistics and Machine Learning, Matthew Stewart helpfully explains how statistics differs from another key part of the data science toolkit: machine learning. Data science is still a larger than machine learning. But it’s very appropriate to say something very similar about the relationship between the two as we said above: Data science can’t be done without machine learning.

Both statistics and machine learning are part and parcel of the data science toolkit. And each plays a somewhat different role. Explaining the difference is helpful.

Stewart summarizes the difference like this:

  • Statistical modeling aims first and foremost for understanding and explaining relationships between variables. Predictive power is a secondary consideration.
  • Machine learning aims first and foremost for effective prediction. Some machine learning algorithms are easy to interpret, and some are not.

Thus, if you are writing a scientific paper that needs to explain the relationships between variables, statistical modeling is probably the best route.

However, if the point of your work is to produce actionable results that translate into greater efficiency and effectiveness achieving the mission of your organization — machine learning is often the better route.

In Stewart’s own words:

Machine learning is all about results, it is likely working in a company where your worth is characterized solely by your performance. Whereas, statistical modeling is more about finding relationships between variables and the significance of those relationships, whilst also catering for prediction.

He goes further to develop a helpful analogy:

By day, I am an environmental scientist and I work primarily with sensor data. If I am trying to prove that a sensor is able to respond to a certain kind of stimuli (such as a concentration of a gas), then I would use a statistical model to determine whether the signal response is statistically significant. I would try to understand this relationship and test for its repeatability so that I can accurately characterize the sensor response and make inferences based on this data. Some things I might test are whether the response is, in fact, linear, whether the response can be attributed to the gas concentration and not random noise in the sensor, etc.

Statistical analysis is great in such a case. It’s the right tool for the job.

But what if the nature of the problem is slightly different, and the goals are different?

In contrast, I can also get an array of 20 different sensors, and I can use this to try and predict the response of my newly characterized sensor. This may seem a bit strange if you do not know much about sensors, but this is currently an important area of environmental science. A model with 20 different variables predicting the outcome of my sensor is clearly all about prediction, and I do not expect it to be particularly interpretable. This model would likely be something a bit more esoteric like a neural network due to non-linearities arising from chemical kinetics and the relationship between physical variables and gas concentrations. I would like the model to make sense, but as long as I can make accurate predictions I would be pretty happy.

This nails it home nicely. In the case of machine learning, our interest is in the results: How can we make the most accurate predictions? And moreover, do these predictions yield benefits for the mission of our organization?

Perhaps said otherwise, statistics is more about understanding — helping to answer the question, What’s really happening here? Machine learning is more about driving action — helping to answer the question, What can we anticipate next? — and by extension enabling efficient and effective responses.

So that’s a good start on understanding the differences between statistics and data science. There’s more to be said about that …

And I hope to return to develop the rest of this reflection one day soon.