This hit my funny bone.
Image Credit: TayTayFan13 – Reddit
This hit my funny bone.
Image Credit: TayTayFan13 – Reddit
Statistical knowledge is immensely valuable to our work in data science. Indeed, the field of statistics has helped shape the realities we work in, including the software tools and algorithms we have available. Those with deep statistical knowledge play key roles in shaping the future of the field.
However, the individual practitioner in Data Science need not have a PhD in statistics or mathematics to be successful. Indeed, our everyday use of statistics proper is often strategic, empowered by software, and requires more of an intuitional grasp of key statistical concepts than deep knowledge.
Data science barely uses statistical science and techniques.
He goes on to clarify:
The truth is actually more nuanced …
In the ensuing post he lists a series of new statistical concepts that are frequently useful in data science, followed by a series of old statistical concepts that are also often useful.
Then he follows up with this sentence:
From a typical 600-pages textbook on statistics, about 20 pages are relevant to data science, and these 20 pages can be compressed in 0.25 page.
Granville’s post is worth a read, as he goes into some reasons why old-school statistics proper is increasingly less useful in the world most of us live and work in, while machine learning techniques are becoming much more useful.
And I will add: Granville’s post, while framed somewhat controversially, fairly well summarizes the realities of data science work. There are several statistical concepts that are indeed useful when doing this work. But we have powerful software tools ready at hand to do much of that work — often using new techniques that yield better predictive results than older statistical approaches do.
Yes, we still often need to understand the meaning and implications of a range of statistical insights in relation to our data. But we can get lots of great work done with an intuitional understanding of those concepts. Thus, we can start with statistical fundamentals, use them as needed, and then expand our knowledge when the situation calls for it.
One last sentence from Granville summarizes this data-sciencey attitude toward stats:
I believe that you can explain the concept of random variable and distribution (at least what you need to understand to practice data science) in about 4 lines, rather than 150 pages. The idea is to explain it in plain English with a few examples.
Granville expressed an intention to draft a “statistics cheat sheet for data scientists,” and do it in a single page. If he ever wrote that, I’ve not found it. Turns out he wrote a Machine Learning Cheat Sheet that covers many data sciencey things, but not statistics.
Meanwhile, in 2017, O’Reilly published a nice handbook of 318 well-organized, succinct and readable pages to fill the gap: Practical Statistics for Data Scientists, by Peter Bruce and Andrew Bruce. I recommend it:
For most of us, a work like this does a great job of bridging the gap for those who are coming to data science from a variety of fields.
Great visual analytics involves a sequence of steps which may be understood as both a science and an art. That sequence includes:
Each of these steps merits its own extended discussion. In this post, I’d like to draw attention to the seventh step: leveraging elements of good narrative to lead the audience to action.
To my knowledge, no one has discussed this more effectively than Cole Nussbaumer Knaflic. Her book, Storytelling with Data, is the leading book on the topic, hitting many of the preceding steps, and then driving on to leverage the power of narrative in the presentation of the story.
Among the narrative elements she discusses are:
Her irrefutable point: You may have the best, most insightful, most beautifully designed analysis. But if you fail to effectively communicate that analysis, its sum total value is exactly ZERO. For at the end of the day, the sole point and purpose of analysis is to inform and generate action.
This is where Knaflic’s work is so valuable. If you’re short on time to read a book, Knaflic has presented this seventh point in the form of an entertaining and informative short video. Indeed, in this video she takes the discussion a step further to discuss the transformative power of the narrative arc:
Only 15 minutes in length, her presentation is work of art.
Properly received, her presentation should provoke you, as a data professional, to put the lesson into practice. May your future presentations be more focused, more meaningful, and much more effective at inspiring data-driven action.
Sharing this purely for grins and giggles — and a bit of admiration for what’s currently possible with some machine learning mojo.
If you’re fairly new to Tableau, chances are you’ll find Tableau’s repository of free training videos (free with registration) to be very helpful. Indeed, there’s enough there to help you go from zero to serious just about as fast as you dare to do it. The tutorials are really pretty great.
BUT their organization scheme needs a little help.
Here’s a helpful list of the best, most useful videos to get started with. Once you’ve worked through those, check out my page with a more extensive organized index including Tableau’s more in-depth training videos.
The following seven videos will get you up and running quickly.
From the Getting Started section
From the Connecting to Data section
From the Visual Analytics section
From the Why is Tableau Doing That? section
From this point forward, the best path depends on your needs.
In his post, The Actual Difference Between Statistics and Machine Learning, Matthew Stewart helpfully explains how statistical analysis differs from machine learning. Data science is still a larger than machine learning. But it’s very appropriate to say something very similar about the relationship between the two as we said above: Data science can’t be done without machine learning.
Both statistics and machine learning are part and parcel of the data science toolkit. And each plays a somewhat different role. Explaining the difference is helpful.
Stewart summarizes the difference like this:
Thus, if you are writing a scientific paper that needs to explain the relationships between variables, statistical modeling is probably the best route.
However, if the point of your work is to produce actionable results that translate into greater efficiency and effectiveness achieving the mission of your organization — machine learning is often the better route.
In Stewart’s own words:
Machine learning is all about results, it is likely working in a company where your worth is characterized solely by your performance. Whereas, statistical modeling is more about finding relationships between variables and the significance of those relationships, whilst also catering for prediction.
He goes further to develop a helpful analogy:
By day, I am an environmental scientist and I work primarily with sensor data. If I am trying to prove that a sensor is able to respond to a certain kind of stimuli (such as a concentration of a gas), then I would use a statistical model to determine whether the signal response is statistically significant. I would try to understand this relationship and test for its repeatability so that I can accurately characterize the sensor response and make inferences based on this data. Some things I might test are whether the response is, in fact, linear, whether the response can be attributed to the gas concentration and not random noise in the sensor, etc.
Statistical analysis is great in such a case. It’s the right tool for the job.
But what if the nature of the problem is slightly different, and the goals are different?
In contrast, I can also get an array of 20 different sensors, and I can use this to try and predict the response of my newly characterized sensor. This may seem a bit strange if you do not know much about sensors, but this is currently an important area of environmental science. A model with 20 different variables predicting the outcome of my sensor is clearly all about prediction, and I do not expect it to be particularly interpretable. This model would likely be something a bit more esoteric like a neural network due to non-linearities arising from chemical kinetics and the relationship between physical variables and gas concentrations. I would like the model to make sense, but as long as I can make accurate predictions I would be pretty happy.
That brings it home nicely. In the case of machine learning, our interest is in the results: How can we make the most accurate predictions? And moreover, do these predictions yield benefits for the mission of our organization?
Perhaps said otherwise, statistics is more about understanding — helping to answer the question, What’s really happening here? Machine learning is more about driving action — helping to answer the question, What can we anticipate next? — and by extension enabling efficient and effective responses.
Those who work in data mining or predictive analytics are familiar with the CRISP-DM process. Metaphorically, if not literally, that process description is taped to our wall. Tom Khabaza’s Nine Laws of Data Mining should be taped up right next to it.
Khabaza has published those laws as a series of blog posts, here. For each law, he has provided a short name, followed by a one-sentence summary, supported by a few paragraphs of explanation.
The value of these laws is that they help prepare us for what to expect as we do the work — and then they remind us of what we should have expected if we occasionally forget!
As I am a fan of brevity, I’m creating this post as a list of the single-sentence summaries. Occasionally I’ll add a short clarifying note. Here they are:
These laws, as Khabaza points out, are not telling us what we should do. Rather they are “simple truths,” describing brute facts that give shape to the landscape in which data mining is done. Their truth is empirical, discovered and verified by those who’ve been doing the work. So it’s best to keep these truths in mind and adapt our efforts accordingly, lest we pay the price for failing to acknowledge reality as it is.
If you’re intrigued, and want to read further, view Khabaza’s full post here. His exposition of these points is more than worth the time!
If you’re an educator (or student) interested in leveraging Amazon Web Services through AWS Educate to host a cloud database that allows student connections — this post is for you. In what follows, I’ll document the process to:
The resulting database will be friendly for student projects that include database interactions such as querying, reading, and writing.
I’m going to assume that you’ve already created your Amazon Educate Account and are logged into your AWS Console. Thus, we’ll begin by creating a relational database.
Once you have logged into your AWS Console, these are the steps to set up a relational database.
Congratulations! You’ve created your database!
Now we need to set a security rule to allow interactions with the database.
In these next few steps, we’ll set a connection security rule to allow inbound traffic to interact with the database.
After creating the database, you can connect to the database using an application or database management package. You’ll simply need a few key items of information. These were supplied when you first created the database. The items include:
If you need to recover these later, you can do so by selecting the database from the RDS database list, and then looking under the two tabs: Connectivity & security, and Configuration.
The Endpoint (aka host or hostname) and Port can be found under the database Connectivity & Security details:
The DB name and Master username can be found under the Configuration tab:
As for the user password, you will hopefully have recorded or remembered it!
Those provide the essential credentials you’ll need to connect to the database.
In the next section, I’ll illustrate using these credentials to connect to the database, create a table, insert records, and query the table using Python.
Python provides modules for connecting to any number of database engines. Since I selected a PostgreSQL engine, I’ll be using the psycopg2 module to interact with it. (For a MySQL database, you can use pymysql. And so on …)
If the module is not currently installed on your system, you’ll need to install it. In Python, this may be done easily using pip or conda:
pip install psycopg2
conda install -c anaconda psycopg2
Once the module is installed, you’ll simply import it to use it in your Python application or Jupyter notebook:
Next we’ll establish the connection, using the psycopg2.connect() method, and providing the database information and login credentials, such as follows:
Then you can use lines such as follows to interact with it. (See the psycopg2 docs for guidance.)
# Start the cursor to enable SQL operations cur = conn.cursor()
# Create a table cur.execute("CREATE TABLE test1 (id serial PRIMARY KEY, num integer, data varchar);")
# Insert a record cur.execute("INSERT INTO test1 (num, data) VALUES (%s, %s)", (101, "abcdefg"))
# Query the table cur.execute("SELECT * FROM test1;")
# Output the query results cur.fetchall()
# Commit the changes conn.commit()
# Close the connection conn.close()
That’s it! Your database is ready to roll.
AWS Educate provides a fantastic opportunity to equip students with cloud resources. In fact, it’s worth pointing out that both educators and students can follow these steps. If a professor should want students to create their own cloud databases for their projects, the above steps will serve them just as well.
I hope this resource proves helpful. Please comment with feedback, suggestions, and recommendations!
I’m excited about this penguins data set which has just been made publicly available. This will be much more fun for student projects than the old standard iris data set.
The data is from a published study on Antarctic penguins. It offers great opportunities for regression analysis, cluster analysis, etc. Here are two sample charts from the Github Readme:
What’s a culmen you may ask? They’ve illustrated that nicely:
The data set is available at Github here: https://github.com/allisonhorst/penguins
The data was used in the published study freely available here:
Gorman KB, Williams TD, Fraser WR (2014) Ecological Sexual Dimorphism and Environmental Variability within a Community of Antarctic Penguins (Genus Pygoscelis). PLoS ONE 9(3): e90081. doi:10.1371/journal.pone.0090081