Statistical knowledge is immensely valuable to our work in data science. Indeed, the field of statistics has helped shape the realities we work in, including the software tools and algorithms we have available. Those with deep statistical knowledge play key roles in shaping the future of the field.
However, the individual practitioner in Data Science need not have a PhD in statistics or mathematics to be successful. Indeed, our everyday use of statistics proper is often strategic, empowered by software, and requires more of an intuitional grasp of key statistical concepts than deep knowledge.
Data science barely uses statistical science and techniques.
He goes on to clarify:
The truth is actually more nuanced …
In the ensuing post he lists a series of new statistical concepts that are frequently useful in data science, followed by a series of old statistical concepts that are also often useful.
Then he follows up with this sentence:
From a typical 600-pages textbook on statistics, about 20 pages are relevant to data science, and these 20 pages can be compressed in 0.25 page.
Granville’s post is worth a read, as he goes into some reasons why old-school statistics proper is increasingly less useful in the world most of us live and work in, while machine learning techniques are becoming much more useful.
And I will add: Granville’s post, while framed somewhat controversially, fairly well summarizes the realities of data science work. There are several statistical concepts that are indeed useful when doing this work. But we have powerful software tools ready at hand to do much of that work — often using new techniques that yield better predictive results than older statistical approaches do.
Yes, we still often need to understand the meaning and implications of a range of statistical insights in relation to our data. But we can get lots of great work done with an intuitional understanding of those concepts. Thus, we can start with statistical fundamentals, use them as needed, and then expand our knowledge when the situation calls for it.
One last sentence from Granville summarizes this data-sciencey attitude toward stats:
I believe that you can explain the concept of random variable and distribution (at least what you need to understand to practice data science) in about 4 lines, rather than 150 pages. The idea is to explain it in plain English with a few examples.
If you’re fairly new to Tableau, chances are you’ll find Tableau’s repository of free training videos (free with registration) to be very helpful. Indeed, there’s enough there to help you go from zero to serious just about as fast as you dare to do it. The tutorials are really pretty great.
BUT their organization scheme needs a little help.
In his post, The Actual Difference Between Statistics and Machine Learning, Matthew Stewart helpfully explains how statistical analysis differs from machine learning. Data science is still a larger than machine learning. But it’s very appropriate to say something very similar about the relationship between the two as we said above: Data science can’t be done without machine learning.
Both statistics and machine learning are part and parcel of the data science toolkit. And each plays a somewhat different role. Explaining the difference is helpful.
Stewart summarizes the difference like this:
Statistical modeling aims first and foremost for understanding and explaining relationships between variables. Predictive power is a secondary consideration.
Machine learning aims first and foremost for effective prediction. Some machine learning algorithms are easy to interpret, and some are not.
Thus, if you are writing a scientific paper that needs to explain the relationships between variables, statistical modeling is probably the best route.
However, if the point of your work is to produce actionable results that translate into greater efficiency and effectiveness achieving the mission of your organization — machine learning is often the better route.
In Stewart’s own words:
Machine learning is all about results, it is likely working in a company where your worth is characterized solely by your performance. Whereas, statistical modeling is more about finding relationships between variables and the significance of those relationships, whilst also catering for prediction.
He goes further to develop a helpful analogy:
By day, I am an environmental scientist and I work primarily with sensor data. If I am trying to prove that a sensor is able to respond to a certain kind of stimuli (such as a concentration of a gas), then I would use a statistical model to determine whether the signal response is statistically significant. I would try to understand this relationship and test for its repeatability so that I can accurately characterize the sensor response and make inferences based on this data. Some things I might test are whether the response is, in fact, linear, whether the response can be attributed to the gas concentration and not random noise in the sensor, etc.
Statistical analysis is great in such a case. It’s the right tool for the job.
But what if the nature of the problem is slightly different, and the goals are different?
In contrast, I can also get an array of 20 different sensors, and I can use this to try and predict the response of my newly characterized sensor. This may seem a bit strange if you do not know much about sensors, but this is currently an important area of environmental science. A model with 20 different variables predicting the outcome of my sensor is clearly all about prediction, and I do not expect it to be particularly interpretable. This model would likely be something a bit more esoteric like a neural network due to non-linearities arising from chemical kinetics and the relationship between physical variables and gas concentrations. I would like the model to make sense, but as long as I can make accurate predictions I would be pretty happy.
That brings it home nicely. In the case of machine learning, our interest is in the results: How can we make the most accurate predictions? And moreover, do these predictions yield benefits for the mission of our organization?
Perhaps said otherwise, statistics is more about understanding — helping to answer the question, What’s really happening here? Machine learning is more about driving action — helping to answer the question, What can we anticipate next? — and by extension enabling efficient and effective responses.
Those who work in data mining or predictive analytics are familiar with the CRISP-DM process. Metaphorically, if not literally, that process description is taped to our wall. Tom Khabaza’s Nine Laws of Data Mining should be taped up right next to it.
Khabaza has published those laws as a series of blog posts, here. For each law, he has provided a short name, followed by a one-sentence summary, supported by a few paragraphs of explanation.
The value of these laws is that they help prepare us for what to expect as we do the work — and then they remind us of what we should have expected if we occasionally forget!
As I am a fan of brevity, I’m creating this post as a list of the single-sentence summaries. Occasionally I’ll add a short clarifying note. Here they are:
Tom Khabaza’s Nine Laws of Data Mining
Business objectives are the origin of every data mining solution.
Business knowledge is central to every step of the data mining process.
Data preparation is more than half of every data mining process.
The right model for a given application can only be discovered by experiment (aka “There is No Free Lunch for the Data Miner” NFL-DM).
There are always patterns (aka “Watkin’s Law).
Data mining amplifies perception in the business domain.
Prediction increases information locally by generalization.
The value of data mining results is not determined by the accuracy or stability of predictive models. (Rather, their value is found in more effective action and improved business strategy.)
All patterns are subject to change. (Thus, data mining is not a once-and-done kind of undertaking.)
These laws, as Khabaza points out, are not telling us what we should do. Rather they are “simple truths,” describing brute facts that give shape to the landscape in which data mining is done. Their truth is empirical, discovered and verified by those who’ve been doing the work. So it’s best to keep these truths in mind and adapt our efforts accordingly, lest we pay the price for failing to acknowledge reality as it is.
If you’re an educator (or student) interested in leveraging Amazon Web Services through AWS Educate to host a cloud database that allows student connections — this post is for you. In what follows, I’ll document the process to:
Configure and create a relational database instance from the AWS Management Console.
Set a security profile that will allow students to read and write to the database remotely — such as from a database client, from a program they’ve written, from a Jupyter Notebook, etc.
I’ll illustrate the process by creating a PostgreSQL database instance. Then I’ll provide illustrative code snippets for interacting with the database using Python.
The resulting database will be friendly for student projects that include database interactions such as querying, reading, and writing.
I’m going to assume that you’ve already created your Amazon Educate Account and are logged into your AWS Console. Thus, we’ll begin by creating a relational database.
Creating a Relational Database in the AWS Console
Once you have logged into your AWS Console, these are the steps to set up a relational database.
Use the search field under “Find Services” to search for “RDS.” You should see RDS: Managed Relational Database Service appear in the results. Select andnavigate to the RDS page.
Once you’ve arrived at the Amazon RDS page, select Databases in the left-hand sidebar, and then Create Database.
From the Create Database page, select your desired options for creation method and configuration. You are free to choose differently, of course, but I have chosen these options:
Standard Create — This will allow me to optimize the resources my database will use.
PostgreSQL — A favored option among data science types. But choose what’s best for you!
Select your desired Template (the labels here may depend on the engine you choose) according to your needed system resources and the size of your budget. I currently have access to the free tier, which I’ll use now. If that were not available, Dev/Test is the next least resource-intensive option I currently have.
If you desire, edit the database name (identifier), master username, and password.
Choose the DB instance size and storage (if relevant) that suits your needs and budget. I’ve chosen the least resource-intensive options, as these will be plenty for my intended use: basic CRUD operations performed by my students.
Under Connectivity, select “Additional connectivity configuration” and then “Yes” under Publicly accessible.
You’ll be given the option to create a new security group. You can keep the default, or create a new group. I’ve created one named “students.” You’ll also see the database port settings.
Depending on your selected database engine, you may (or may not) be given the option to choose Database authentication. With the PostgreSQL database I’ve chosen, I have these options, and I’ll choose Password authentication.
Depending on your selected database engine, you may (or may not) be given Additional configuration options. If you’re unsure about these options, click the handy Info link to read more about them. I’ve deselected automatic backups in order to conserve resources.
If all looks good, click Create database!
After clicking create, you may be given a message to go back and adjust a configuration option. If so, go back and do that. If all went well, you’ll be taken to a confirmation page. Here’s what mine looked like:
Notice that you can click to View credential details — a handy way to get the login information and save it for future reference!
Congratulations! You’ve created your database!
Now we need to set a security rule to allow interactions with the database.
Allowing Inbound Traffic
In these next few steps, we’ll set a connection security rule to allow inbound traffic to interact with the database.
Beginning at the RDS > Database page, click the database identifier.
Select the Connectivity & security tab.
Scroll down the page to Security group rules, and click to edit the Inbound rules.
If necessary, select Actions, and Edit inbound rules.
There should be an initial rule begun for you. Notice that the Type and Port range are already set to match your database settings. Now we need to allow a range of IP addresses. Configure this according to your needs. In my case, I’ll be working with online students. And since the database will not contain sensitive data, I’ll simply pull down the box under Source and select “Anywhere,” to allow traffic from any IP address.Once that’s been selected, I then see the result as two rules, allowing a full range of IP address options:
Click Save rules!
Connecting to the Database
After creating the database, you can connect to the database using an application or database management package. You’ll simply need a few key items of information. These were supplied when you first created the database. The items include:
Endpoint (aka host or hostname)
If you need to recover these later, you can do so by selecting the database from the RDS database list, and then looking under the two tabs: Connectivity & security, and Configuration.
The Endpoint (aka host or hostname) and Port can be found under the database Connectivity & Security details:
The DB name and Master username can be found under the Configuration tab:
As for the user password, you will hopefully have recorded or remembered it!
Those provide the essential credentials you’ll need to connect to the database.
In the next section, I’ll illustrate using these credentials to connect to the database, create a table, insert records, and query the table using Python.
Interacting with the Database Using Python
Python provides modules for connecting to any number of database engines. Since I selected a PostgreSQL engine, I’ll be using the psycopg2 module to interact with it. (For a MySQL database, you can use pymysql. And so on …)
If the module is not currently installed on your system, you’ll need to install it. In Python, this may be done easily using pip or conda:
# Start the cursor to enable SQL operations cur = conn.cursor()
# Create a table cur.execute("CREATE TABLE test1 (id serial PRIMARY KEY, num integer, data varchar);")
# Insert a record cur.execute("INSERT INTO test1 (num, data) VALUES (%s, %s)", (101, "abcdefg"))
# Query the table cur.execute("SELECT * FROM test1;")
# Output the query results cur.fetchall()
# Commit the changes conn.commit()
# Close the connection conn.close()
That’s it! Your database is ready to roll.
AWS Educate provides a fantastic opportunity to equip students with cloud resources. In fact, it’s worth pointing out that both educators and students can follow these steps. If a professor should want students to create their own cloud databases for their projects, the above steps will serve them just as well.
I hope this resource proves helpful. Please comment with feedback, suggestions, and recommendations!
In a recent Tableau project, I wanted to divide a long span of years into decades, as this would provide a more visually effective way to grasp the growth of revenue from top movies (data from The Movie Database) over time. With a little searching, I found the pieces I needed. Below I’ll include a description of my process, followed by links to the helpful sources of insight I found on this topic.
First, here is the visualization with total revenues year by year. Notice that despite its current width you still have to scroll left to reach the early 1900s. Meanwhile, the difference year to year is not in itself that interesting.
Now here is the visualization when years are chunked into decades. Much more effective!
DISCLAIMER: These charts use revenue numbers as entered in The Movie Database by contributors based on publicly reported figures. Thus, the data includes only a portion of all movies. I’ve as yet made no adjustments for inflation.
Getting to Decades from Dates
Now for the process of getting decades from dates. I broke my approach into two steps:
When working with a date range that spans multiple centuries (for instance, late 1800s to present), it’s important to know a few things before viewing or saving the data in Excel. (I’m currently working with Excel 365 for Mac and Excel 2016 for Windows.)
Suppose you’re working with data stored in a CSV file and want to examine it in Excel. Here is a short list of things to watch out for:
Excel for Mac automatically formats dates in m/d/yy format, shortening years to two digits in the process. (Thus 1915-02-08 becomes 2/8/15!) If you then save back to CSV, it will overwrite four-digit years to two, thereby ruining your date fields — as there will be no record of which century it’s from. You’ll need to go back and recover four-digit years from your source. This is bad.
Excel for Windows defaults to m/d/yyyy format. This is not so bad, as the full four-digit year values are maintained.
For the above reasons, if you view date fields in Excel for Mac or Windows, it makes good sense to immediately format your dates to yyyy-mm-dd (following the international standard for data formats: ISO 8601). This requires using custom formatting in Excel. But it’s effective and can save your bacon. (Plus, it jibes with Python pandas and R.)
To reformat dates in ISO 8601 format in Excel for Windows:
Go to Format Cells and select the Number tab.
Then use the Custom category, and type in the formula: yyyy-mm-dd
In Excel for Mac, the process is similar, but the option we need is (currently) available under the Date category:
Go to Format Cells and select the Number tab.
Then use the Date category, and select the option starting with a four-digit year, followed by a two-digit month and two-digit day, with hyphen separators. (Excel for Mac currently displays this with the sample date: 2012-03-14.)
Alternatively, do as in Windows Excel, and enter it as your own Custom format: yyyy-mm-dd.
This article from Tech Republic is worth a read. In summary: The data revolution is now transforming the world of finance. A recent Deloitte survey reveals that traditional roles are being automated. To be a human working in finance, you need skills in data science, analytics, and visualization. More than manipulating spreadsheets, you need to create business value with data-informed innovations.