On the dubious (mis)use of data

This is a point well made:

What’s up hinge lover? from r/BrandNewSentence

On the Differences between Statistics and Machine Learning

In his post, The Actual Difference Between Statistics and Machine Learning, Matthew Stewart helpfully explains how statistical analysis differs from machine learning. Data science is still a larger than machine learning. But it’s very appropriate to say something very similar about the relationship between the two as we said above: Data science can’t be done without machine learning.

Both statistics and machine learning are part and parcel of the data science toolkit. And each plays a somewhat different role. Explaining the difference is helpful.

Stewart summarizes the difference like this:

Statistical modeling aims first and foremost for understanding and explaining relationships between variables. Predictive power is a secondary consideration.
Machine learning aims first and foremost for effective prediction. Some machine learning algorithms are easy to interpret, and some are not.

Thus, if you are writing a scientific paper that needs to explain the relationships between variables, statistical modeling is probably the best route.

However, if the point of your work is to produce actionable results that translate into greater efficiency and effectiveness achieving the mission of your organization — machine learning is often the better route.

In Stewart’s own words:

Machine learning is all about results, it is likely working in a company where your worth is characterized solely by your performance. Whereas, statistical modeling is more about finding relationships between variables and the significance of those relationships, whilst also catering for prediction.

He goes further to develop a helpful analogy:

By day, I am an environmental scientist and I work primarily with sensor data. If I am trying to prove that a sensor is able to respond to a certain kind of stimuli (such as a concentration of a gas), then I would use a statistical model to determine whether the signal response is statistically significant. I would try to understand this relationship and test for its repeatability so that I can accurately characterize the sensor response and make inferences based on this data. Some things I might test are whether the response is, in fact, linear, whether the response can be attributed to the gas concentration and not random noise in the sensor, etc.

Statistical analysis is great in such a case. It’s the right tool for the job.

But what if the nature of the problem is slightly different, and the goals are different?

In contrast, I can also get an array of 20 different sensors, and I can use this to try and predict the response of my newly characterized sensor. This may seem a bit strange if you do not know much about sensors, but this is currently an important area of environmental science. A model with 20 different variables predicting the outcome of my sensor is clearly all about prediction, and I do not expect it to be particularly interpretable. This model would likely be something a bit more esoteric like a neural network due to non-linearities arising from chemical kinetics and the relationship between physical variables and gas concentrations. I would like the model to make sense, but as long as I can make accurate predictions I would be pretty happy.

That brings it home nicely. In the case of machine learning, our interest is in the results: How can we make the most accurate predictions? And moreover, do these predictions yield benefits for the mission of our organization?

Perhaps said otherwise, statistics is more about understanding — helping to answer the question, What’s really happening here? Machine learning is more about driving action — helping to answer the question, What can we anticipate next? — and by extension enabling efficient and effective responses.

Tom Khabaza’s Nine Laws of Data Mining

Those who work in data mining or predictive analytics are familiar with the CRISP-DM process. Metaphorically, if not literally, that process description is taped to our wall. Tom Khabaza’s Nine Laws of Data Mining should be taped up right next to it.

Khabaza has published those laws as a series of blog posts, here. For each law, he has provided a short name, followed by a one-sentence summary, supported by a few paragraphs of explanation.

The value of these laws is that they help prepare us for what to expect as we do the work — and then they remind us of what we should have expected if we occasionally forget!

As I am a fan of brevity, I’m creating this post as a list of the single-sentence summaries. Occasionally I’ll add a short clarifying note. Here they are:

Tom Khabaza’s Nine Laws of Data Mining

Business objectives are the origin of every data mining solution.
Business knowledge is central to every step of the data mining process.
Data preparation is more than half of every data mining process.
The right model for a given application can only be discovered by experiment (aka “There is No Free Lunch for the Data Miner” NFL-DM).
There are always patterns (aka “Watkin’s Law).
Data mining amplifies perception in the business domain.
Prediction increases information locally by generalization.
The value of data mining results is not determined by the accuracy or stability of predictive models. (Rather, their value is found in more effective action and improved business strategy.)
All patterns are subject to change. (Thus, data mining is not a once-and-done kind of undertaking.)

These laws, as Khabaza points out, are not telling us what we should do. Rather they are “simple truths,” describing brute facts that give shape to the landscape in which data mining is done. Their truth is empirical, discovered and verified by those who’ve been doing the work. So it’s best to keep these truths in mind and adapt our efforts accordingly, lest we pay the price for failing to acknowledge reality as it is.

If you’re intrigued, and want to read further, view Khabaza’s full post here. His exposition of these points is more than worth the time!

data enhanced

website of David Cochran, dean at Newman University, data science nerd, etc.

Month: July 2020

On the dubious (mis)use of data

On the Differences between Statistics and Machine Learning

Tom Khabaza’s Nine Laws of Data Mining

Tom Khabaza’s Nine Laws of Data Mining