Getting a Grasp on Data Science: David Donoho’s Six-Part Definition

The discipline of data science is notoriously difficult to define, and yet it is perhaps not impossible. I am currently working to gear more of my thought and instruction around Stanford statistician David Donoho’s definition of Greater Data Science (GDS). Here I’ll provide a brief summary of three key contributions of Donoho’s argument, ending with his six-part definition of the discipline.

First, Donoho appropriately connects the current expansive discipline of Data Science to its roots in 50+ years of work by statisticians, beginning with John Tukey in the 1960s. On Donoho’s telling, contemporary Data Science should understand itself not in opposition to the discipline of statistics but as an outgrowth and extension of the long tradition of statistics. More than historical recognition, this extends to appreciating the contemporary relevance of both traditional statistical analysis and predictive modeling with machine learning. Just as traditional statisticians must acknowledge and embrace the importance and relevance of contemporary machine learning methods, “cutting edge” data scientists must appropriately recognize the continued role and relevance of traditional statistical analysis.

Second, Donoho helpfully identifies the Common Task Framework methodology that undergirds the successes of contemporary predictive modeling. This methodology includes (a) a publicly available training dataset, (b) a competitive multi-party approach to predictive modeling, and (c) a scoring referee or system for evaluating the competing models against a test dataset unavailable to the competitors. He cites the Netflix Challenge as a famous example of this approach.

Third, and most importantly, Donoho builds on the work of John Chambers and Bill Cleveland to outline a definition of Greater Data Science (GDS), which includes the following six sub-fields:

Data Exploration and Preparation
Data Representation and Transformation
Computing with Data
Data Modeling
Data Visualization and Presentation
Science about Data Science

This definition is so apt that it seems common sense to those who practice in the field. But the complexity of the work involved in data science has meant that reaching the clarity offered by this definition has not been easy. Not content simply to define, Donoho devotes the remainder of his piece to discussing and illustrating some of the key practices included under each sub-field. I will number each sub-field as he does, using GDS for Greater Data Science.

GDS1: Data Exploration and Preparation. Frequently requiring upwards of 80% of the work involved in data science, this sub-field is too often neglected in the teaching of data science and merits greater attention in the future. It includes the many steps of curating data, dealing with anomalies, and pulling into the shape needed for analysis.

GDS2: Data Representation and Transformation. This includes the problem of data storage and requires that a data science be fluent in current database technologies. As of 2021 that includes SQL and NoSQL databases, distributed (cloud) systems, etc.

GDS3: Computing with Data. This includes necessary knowledge of languages like Python or R and related current software used in preparation, analysis, and modeling, as well as understanding of the workflows used to in the development of an analytical process.

GDS4: Data Visualization and Presentation. This sub-field addresses the importance of visual analysis methods, from standard plots used in Exploratory Data Analysis (EDA) to advanced charts used to crystalize understanding of specific important features to interactive data dashboards.

GDS5: Data Modeling. This sub-field should rightfully include both traditional statistical approaches and contemporary predictive modeling with machine learning.

GDS6: Science about Data Science. Key to making data science a true science, science of data science investigates the real-world work of data scientists “in the wild” and contributes to the documentation, description, analysis, and evaluation of those real-world practices, with the express aim of discerning the more fruitful practices that show merit for leading the discipline of data science forward to greater promise and productivity.

Given the sheer complexity of data science and the astounding speed of its ongoing development, it is difficult to overstate the value of Donoho’s six-part definition of the discipline. For myself, I will be contemplating and engaging the implications of this definition for months and years to come.

Allow me simply to recommend Donoho’s article as a read of incredible value, and recommend his helpful discussions of these six sub-fields as a beginning point for others as we work together to take the discipline forward.

SettingWithCopyWarning? Try using .copy()

What’s the deal with the SettingWithCopyWarning?

You may have noticed this popping up on occasion, usually with a pink background:

/opt/conda/lib/python3.7/site-packages/ipykernel_launcher.py:6: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

This warning can be a strange one, since it can crop up unexpectedly and sometimes seems (or is) nearly random when it does. Indeed, it is especially confounding when it happens even when we are using using the .loc accessor.

Now, the good thing about it is that it is only a warning. The operation you desired to perform most likely worked just fine. But you may be tired of the pink hued warning cropping up all the time. And in certain circumstances there’s a chance that things may not work as they should.

Here’s a way to grasp the problem and fix it.

What’s Happening

The culprit is typically in a prior step. In my recent experience, these steps have sometimes resulted in the behavior occurring a few steps later:

df = df[['title','date','budget','revenue']]

Or:

df = df[df['budget'] > 0]

It seems rather simple: I want to update the dataframe itself so that it has fewer columns or only a filtered set of records. And I’m overwriting the original dataframe with the new, assigning it to become the new df.

And then, at a later step, I sometimes start getting the dreaded SettingWithCopyWarning.

Why is this Happening?

Under certain circumstances, when we update a dataframe and save over the original variable, pandas stores this as a *view* of the original dataframe. In pandas memory, it retains a connection to the dataframe as it was before. Thus this view is, in the words of the warning, “a copy of a slice” of the original dataframe.

What to Do About It

Here’s a quick and effective way to deal with it. When you store a new version of the dataframe to a variable, chain the .copy() method on the end of the operation. This severs the connection to the original dataframe and makes it an entirely new object.

For example:

df = df[['title','date','budget','revenue']].copy()

Or:

df = df[df['budget'] > 0].copy()

When we use .copy(), this forces pandas to wipe the old dataframe from memory and re-assign df as an entirely new dataframe, with no connection to a prior version.

Operations you perform after that point should no longer provoke the dreaded SettingWithCopyWarning.

Try it for yourself. It should help!

References

This article discusses the oddness of the behavior and hopes it will be changed: Views and Copies in pandas — Practical Data Science
pandas.DataFrame.copy() — pandas Documentation

Statistics for Data Science

Statistical knowledge is immensely valuable to our work in data science. Indeed, the field of statistics has helped shape the realities we work in, including the software tools and algorithms we have available. Those with deep statistical knowledge play key roles in shaping the future of the field.

However, the individual practitioner in Data Science need not have a PhD in statistics or mathematics to be successful. Indeed, our everyday use of statistics proper is often strategic, empowered by software, and requires more of an intuitional grasp of key statistical concepts than deep knowledge.

As Vincent Granville wrote in 2014:

Data science barely uses statistical science and techniques.

He goes on to clarify:

The truth is actually more nuanced …

In the ensuing post he lists a series of new statistical concepts that are frequently useful in data science, followed by a series of old statistical concepts that are also often useful.

Then he follows up with this sentence:

From a typical 600-pages textbook on statistics, about 20 pages are relevant to data science, and these 20 pages can be compressed in 0.25 page.

Granville’s post is worth a read, as he goes into some reasons why old-school statistics proper is increasingly less useful in the world most of us live and work in, while machine learning techniques are becoming much more useful.

And I will add: Granville’s post, while framed somewhat controversially, fairly well summarizes the realities of data science work. There are several statistical concepts that are indeed useful when doing this work. But we have powerful software tools ready at hand to do much of that work — often using new techniques that yield better predictive results than older statistical approaches do.

Yes, we still often need to understand the meaning and implications of a range of statistical insights in relation to our data. But we can get lots of great work done with an intuitional understanding of those concepts. Thus, we can start with statistical fundamentals, use them as needed, and then expand our knowledge when the situation calls for it.

One last sentence from Granville summarizes this data-sciencey attitude toward stats:

I believe that you can explain the concept of random variable and distribution (at least what you need to understand to practice data science) in about 4 lines, rather than 150 pages. The idea is to explain it in plain English with a few examples.

Granville expressed an intention to draft a “statistics cheat sheet for data scientists,” and do it in a single page. If he ever wrote that, I’ve not found it. Turns out he wrote a Machine Learning Cheat Sheet that covers many data sciencey things, but not statistics.

Meanwhile, in 2017, O’Reilly published a nice handbook of 318 well-organized, succinct and readable pages to fill the gap: Practical Statistics for Data Scientists, by Peter Bruce and Andrew Bruce. I recommend it:

For most of us, a work like this does a great job of bridging the gap for those who are coming to data science from a variety of fields.

On the Differences between Statistics and Machine Learning

In his post, The Actual Difference Between Statistics and Machine Learning, Matthew Stewart helpfully explains how statistical analysis differs from machine learning. Data science is still a larger than machine learning. But it’s very appropriate to say something very similar about the relationship between the two as we said above: Data science can’t be done without machine learning.

Both statistics and machine learning are part and parcel of the data science toolkit. And each plays a somewhat different role. Explaining the difference is helpful.

Stewart summarizes the difference like this:

Statistical modeling aims first and foremost for understanding and explaining relationships between variables. Predictive power is a secondary consideration.
Machine learning aims first and foremost for effective prediction. Some machine learning algorithms are easy to interpret, and some are not.

Thus, if you are writing a scientific paper that needs to explain the relationships between variables, statistical modeling is probably the best route.

However, if the point of your work is to produce actionable results that translate into greater efficiency and effectiveness achieving the mission of your organization — machine learning is often the better route.

In Stewart’s own words:

Machine learning is all about results, it is likely working in a company where your worth is characterized solely by your performance. Whereas, statistical modeling is more about finding relationships between variables and the significance of those relationships, whilst also catering for prediction.

He goes further to develop a helpful analogy:

By day, I am an environmental scientist and I work primarily with sensor data. If I am trying to prove that a sensor is able to respond to a certain kind of stimuli (such as a concentration of a gas), then I would use a statistical model to determine whether the signal response is statistically significant. I would try to understand this relationship and test for its repeatability so that I can accurately characterize the sensor response and make inferences based on this data. Some things I might test are whether the response is, in fact, linear, whether the response can be attributed to the gas concentration and not random noise in the sensor, etc.

Statistical analysis is great in such a case. It’s the right tool for the job.

But what if the nature of the problem is slightly different, and the goals are different?

In contrast, I can also get an array of 20 different sensors, and I can use this to try and predict the response of my newly characterized sensor. This may seem a bit strange if you do not know much about sensors, but this is currently an important area of environmental science. A model with 20 different variables predicting the outcome of my sensor is clearly all about prediction, and I do not expect it to be particularly interpretable. This model would likely be something a bit more esoteric like a neural network due to non-linearities arising from chemical kinetics and the relationship between physical variables and gas concentrations. I would like the model to make sense, but as long as I can make accurate predictions I would be pretty happy.

That brings it home nicely. In the case of machine learning, our interest is in the results: How can we make the most accurate predictions? And moreover, do these predictions yield benefits for the mission of our organization?

Perhaps said otherwise, statistics is more about understanding — helping to answer the question, What’s really happening here? Machine learning is more about driving action — helping to answer the question, What can we anticipate next? — and by extension enabling efficient and effective responses.

Tom Khabaza’s Nine Laws of Data Mining

Those who work in data mining or predictive analytics are familiar with the CRISP-DM process. Metaphorically, if not literally, that process description is taped to our wall. Tom Khabaza’s Nine Laws of Data Mining should be taped up right next to it.

Khabaza has published those laws as a series of blog posts, here. For each law, he has provided a short name, followed by a one-sentence summary, supported by a few paragraphs of explanation.

The value of these laws is that they help prepare us for what to expect as we do the work — and then they remind us of what we should have expected if we occasionally forget!

As I am a fan of brevity, I’m creating this post as a list of the single-sentence summaries. Occasionally I’ll add a short clarifying note. Here they are:

Tom Khabaza’s Nine Laws of Data Mining

Business objectives are the origin of every data mining solution.
Business knowledge is central to every step of the data mining process.
Data preparation is more than half of every data mining process.
The right model for a given application can only be discovered by experiment (aka “There is No Free Lunch for the Data Miner” NFL-DM).
There are always patterns (aka “Watkin’s Law).
Data mining amplifies perception in the business domain.
Prediction increases information locally by generalization.
The value of data mining results is not determined by the accuracy or stability of predictive models. (Rather, their value is found in more effective action and improved business strategy.)
All patterns are subject to change. (Thus, data mining is not a once-and-done kind of undertaking.)

These laws, as Khabaza points out, are not telling us what we should do. Rather they are “simple truths,” describing brute facts that give shape to the landscape in which data mining is done. Their truth is empirical, discovered and verified by those who’ve been doing the work. So it’s best to keep these truths in mind and adapt our efforts accordingly, lest we pay the price for failing to acknowledge reality as it is.

If you’re intrigued, and want to read further, view Khabaza’s full post here. His exposition of these points is more than worth the time!

Two Cheers for Penguins Data!!

I’m excited about this penguins data set which has just been made publicly available. This will be much more fun for student projects than the old standard iris data set.

The data is from a published study on Antarctic penguins. It offers great opportunities for regression analysis, cluster analysis, etc. Here are two sample charts from the Github Readme:

What’s a culmen you may ask? They’ve illustrated that nicely:

Links and Credits

The data set is available at Github here: https://github.com/allisonhorst/penguins

A CSV file of the full data set is available in the data-raw sub-directory. Here is a direct link. (You can view the raw version and then save it as a CSV file from your browser.)

Data were collected and made available by Dr. Kristen Gorman and the Palmer Station, Antarctica LTER, a member of the Long Term Ecological Research Network.

The data was used in the published study freely available here:

Gorman KB, Williams TD, Fraser WR (2014) Ecological Sexual Dimorphism and Environmental Variability within a Community of Antarctic Penguins (Genus Pygoscelis). PLoS ONE 9(3): e90081. doi:10.1371/journal.pone.0090081

Why does Excel keep mangling my date formats! When my date range spans multiple centuries …

When working with a date range that spans multiple centuries (for instance, late 1800s to present), it’s important to know a few things before viewing or saving the data in Excel. (I’m currently working with Excel 365 for Mac and Excel 2016 for Windows.)

Suppose you’re working with data stored in a CSV file and want to examine it in Excel. Here is a short list of things to watch out for:

Excel for Mac automatically formats dates in m/d/yy format, shortening years to two digits in the process. (Thus 1915-02-08 becomes 2/8/15!) If you then save back to CSV, it will overwrite four-digit years to two, thereby ruining your date fields — as there will be no record of which century it’s from. You’ll need to go back and recover four-digit years from your source. This is bad.
Excel for Windows defaults to m/d/yyyy format. This is not so bad, as the full four-digit year values are maintained.
Neither Excel for Mac or Windows recognizes dates before 1900, instead treating such dates as text. (Thus 1898-01-01 remains ‘1898-01-01’, as text.) On the plus side, it does not change the formatting of these dates.
For the above reasons, if you view date fields in Excel for Mac or Windows, it makes good sense to immediately format your dates to yyyy-mm-dd (following the international standard for data formats: ISO 8601). This requires using custom formatting in Excel. But it’s effective and can save your bacon. (Plus, it jibes with Python pandas and R.)

To reformat dates in ISO 8601 format in Excel for Windows:

Go to Format Cells and select the Number tab.
Then use the Custom category, and type in the formula: yyyy-mm-dd

Reformat dates to ISO 8601 yyyy-mm-dd in Excel for Windows

In Excel for Mac, the process is similar, but the option we need is (currently) available under the Date category:

Go to Format Cells and select the Number tab.
Then use the Date category, and select the option starting with a four-digit year, followed by a two-digit month and two-digit day, with hyphen separators. (Excel for Mac currently displays this with the sample date: 2012-03-14.)
Alternatively, do as in Windows Excel, and enter it as your own Custom format: yyyy-mm-dd.

Reformat dates to ISO 8601 yyyy-mm-dd in Excel for Mac

For Further Reading

The data revolution is now transforming the world of finance

This article from Tech Republic is worth a read. In summary: The data revolution is now transforming the world of finance. A recent Deloitte survey reveals that traditional roles are being automated. To be a human working in finance, you need skills in data science, analytics, and visualization. More than manipulating spreadsheets, you need to create business value with data-informed innovations.

The finance robots are coming — TechRepublic.com

Once again, the NY Times demonstrates the value of interactive data visualization

This impressive interactive data visualization demonstrates the value of the format. More than merely interesting, or intriguing, or even fun — it massively amplifies the communicative power of its subject matter.

Check it out:

How to Cut U.S. Emissions Faster? Do What These Countries Are Doing.
By Brad Plumer and Blacki Migliozzi — FEB. 13, 2019

Bitcoin: Birth Growth and Rise — a data visualization

I’ve just published a Tableau story on the birth, growth, and rise of Bitcoin.

I would love feedback and recommendations, as I intend to develop this project over time.

Bitcoin: Growth, and Rise – Tableau Public

data enhanced

website of David Cochran, dean at Newman University, data science nerd, etc.

Category: Data Science

Getting a Grasp on Data Science: David Donoho’s Six-Part Definition

SettingWithCopyWarning? Try using .copy()

What’s the deal with the SettingWithCopyWarning?

What’s Happening

Why is this Happening?

What to Do About It

References

Statistics for Data Science

On the Differences between Statistics and Machine Learning

Tom Khabaza’s Nine Laws of Data Mining

Tom Khabaza’s Nine Laws of Data Mining

Two Cheers for Penguins Data!!

Links and Credits

Why does Excel keep mangling my date formats! When my date range spans multiple centuries …

For Further Reading

The data revolution is now transforming the world of finance

Once again, the NY Times demonstrates the value of interactive data visualization

Bitcoin: Birth Growth and Rise — a data visualization