Statistics for Data Science

Statistical knowledge is immensely valuable to our work in data science. Indeed, the field of statistics has helped shape the realities we work in, including the software tools and algorithms we have available. Those with deep statistical knowledge play key roles in shaping the future of the field.

However, the individual practitioner in Data Science need not have a PhD in statistics or mathematics to be successful. Indeed, our everyday use of statistics proper is often strategic, empowered by software, and requires more of an intuitional grasp of key statistical concepts than deep knowledge.

As Vincent Granville wrote in 2014:

Data science barely uses statistical science and techniques.

He goes on to clarify:

The truth is actually more nuanced …

In the ensuing post he lists a series of new statistical concepts that are frequently useful in data science, followed by a series of old statistical concepts that are also often useful.

Then he follows up with this sentence:

From a typical 600-pages textbook on statistics, about 20 pages are relevant to data science, and these 20 pages can be compressed in 0.25 page.

Granville’s post is worth a read, as he goes into some reasons why old-school statistics proper is increasingly less useful in the world most of us live and work in, while machine learning techniques are becoming much more useful.

And I will add: Granville’s post, while framed somewhat controversially, fairly well summarizes the realities of data science work. There are several statistical concepts that are indeed useful when doing this work. But we have powerful software tools ready at hand to do much of that work — often using new techniques that yield better predictive results than older statistical approaches do.

Yes, we still often need to understand the meaning and implications of a range of statistical insights in relation to our data. But we can get lots of great work done with an intuitional understanding of those concepts. Thus, we can start with statistical fundamentals, use them as needed, and then expand our knowledge when the situation calls for it.

One last sentence from Granville summarizes this data-sciencey attitude toward stats:

I believe that you can explain the concept of random variable and distribution (at least what you need to understand to practice data science) in about 4 lines, rather than 150 pages. The idea is to explain it in plain English with a few examples.

Granville expressed an intention to draft a “statistics cheat sheet for data scientists,” and do it in a single page. If he ever wrote that, I’ve not found it.  Turns out he wrote a Machine Learning Cheat Sheet that covers many data sciencey things, but not statistics.

Meanwhile, in 2017, O’Reilly published a nice handbook of 318 well-organized, succinct and readable pages to fill the gap: Practical Statistics for Data Scientists, by Peter Bruce and Andrew Bruce. I recommend it:

Practical Statistics for Data Science cover

For most of us, a work like this does a great job of bridging the gap for those who are coming to data science from a variety of fields.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s