Statistics for Data Science

This post is just a stub at the moment. I intend to return and develop it in the coming months. 

For now, let me feature a good quote and then provide a recommended source.

The quote is from Vincent Granville, who wrote this in 2014:

Data science barely uses statistical science and techniques.

He goes on to clarify:

The truth is actually more nuanced …

And in the ensuing post he lists a series of new statistical concepts that are frequently useful in data science, followed by a series of old statistical concepts that are also often useful.

But then he follows up with this sentence: 

From a typical 600-pages textbook on statistics, about 20 pages are relevant to data science, and these 20 pages can be compressed in 0.25 page.

Granville’s post is worth a read, as he goes into some reasons why old-school statistics proper is increasingly less useful in the world most of us live and work in, while machine learning techniques are becoming much more useful.

And I will add: Granville’s post, while framed somewhat controversially, fairly well summarizes the realities of data science work. There are several statistical concepts that are indeed useful when doing this work. But we have powerful software tools ready at hand to do much of that work — often using new techniques that yield better predictive results than older statistical approaches do.

Yes, we still often need to understand the meaning and implications of a range of statistical insights in relation to our data. But we can get lots of great work done with an intuitional understanding of those concepts. Thus, we can start with statistical fundamentals, use them as needed, and then expand our knowledge when the situation calls for it.

One last sentence from Granville summarizes this data-sciencey attitude toward stats:

I believe that you can explain the concept of random variable and distribution (at least what you need to understand to practice data science) in about 4 lines, rather than 150 pages. The idea is to explain it in plain English with a few examples.

Thus, Granville suggests that when time allows he may write a “statistics cheat sheet for data scientists,” and do it in a single page. (Turns out he wrote a Machine Learning Cheat Sheet that covers many data sciencey things, but not statistics.)

Meanwhile, in 2017, O’Reilly published a nice handbook of 318 well-organized, succinct and readable pages to fill the gap: Practical Statistics for Data Scientists, by Peter Bruce and Andrew Bruce. I recommend it:

Practical Statistics for Data Science cover

In my view, a work like this does a great job of bridging the gap for those who are coming to data science from a variety of fields.

So there you are. Obviously there are many opportunities for disagreement on these points. But the fact remains, data science uses statistics differently than many other sciences. And there are a huge range of things that we need to know and be good at that lie outside of statistics proper.

That’s it for now.

If time and occasion allow, I’ll circle back and beef this up later. 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s