Two Cheers for Penguins Data!!

I’m excited about this penguins data set which has just been made publicly available. This will be much more fun for student projects than the old standard iris data set.

Penguins cartoon to illustrate the penguins data set

The data is from a published study on Antarctic penguins. It offers great opportunities for regression analysis, cluster analysis, etc. Here are two sample charts from the Github Readme:

Histogram of penguin flipper lengths colored by species


Scatterplot of culmen length and depth clustered by species

What’s a culmen you may ask? They’ve illustrated that nicely:

Illustration of a penguin culmen

Links and Credits

The data set is available at Github here: 

A CSV file of the full data set is available in the data-raw sub-directory. Here is a direct link. (You can view the raw version and then save it as a CSV file from your browser.)

Data were collected and made available by Dr. Kristen Gorman and the Palmer Station, Antarctica LTER, a member of the Long Term Ecological Research Network.

The data was used in the published study freely available here:

Gorman KB, Williams TD, Fraser WR (2014) Ecological Sexual Dimorphism and Environmental Variability within a Community of Antarctic Penguins (Genus Pygoscelis). PLoS ONE 9(3): e90081. doi:10.1371/journal.pone.0090081




Tableau Tip: Group Years into Decades Using Calculated Fields

In a recent Tableau project, I wanted to divide a long span of years into decades, as this would provide a more visually effective way to grasp the growth of revenue from top movies (data from The Movie Database) over time. With a little searching, I found the pieces I needed. Below I’ll include a description of my process, followed by links to the helpful sources of insight I found on this topic.

First, here is the visualization with total revenues year by year. Notice that despite its current width you still have to scroll left to reach the early 1900s. Meanwhile, the difference year to year is not in itself that interesting.

Top TMDB movie revenue totals by year a partial view

Now here is the visualization when years are chunked into decades. Much more effective!

Top TMDB movies total revenue by decade

DISCLAIMER: These charts use revenue numbers as entered in The Movie Database by contributors based on publicly reported figures. Thus, the data includes only a portion of all movies. I’ve as yet made no adjustments for inflation.

Getting to Decades from Dates

Now for the process of getting decades from dates. I broke my approach into two steps:

  1. I first created a calculated field to pull the Year from the Release Dates field, using Tableau’s DATEPART function.

    Calculated field to get only the Year from the date information in Release Date

    Once that field was created, I moved the new calculated field from Measures to Dimensions, where it should be.

  2. Then I created a Decades dimension as an additional calculated field. This calculation uses Year and the modulo operator to round each year down to the nearest multiple of 10.

    Calculated field to round each year down to its decade using modulo

    Then, similarly, once created, I made this calculated field a Dimension.
    That’s all it took!

Many thanks to Nick Parsons and Erik Bokobza for their helpful replies in the Tableau Community Forums. Links below.

Recommended Reads

Why does Excel keep mangling my date formats! When my date range spans multiple centuries …

When working with a date range that spans multiple centuries (for instance, late 1800s to present), it’s important to know a few things before viewing or saving the data in Excel. (I’m currently working with Excel 365 for Mac and Excel 2016 for Windows.)

Suppose you’re working with data stored in a CSV file and want to examine it in Excel. Here is a short list of things to watch out for:

  1. Excel for Mac automatically formats dates in m/d/yy format, shortening years to two digits in the process. (Thus 1915-02-08 becomes 2/8/15!) If you then save back to CSV, it will overwrite four-digit years to two, thereby ruining your date fields — as there will be no record of which century it’s from. You’ll need to go back and recover four-digit years from your source. This is bad.
  2. Excel for Windows defaults to m/d/yyyy format. This is not so bad, as the full four-digit year values are maintained.
  3. Neither Excel for Mac or Windows recognizes dates before 1900, instead treating such dates as text. (Thus 1898-01-01 remains ‘1898-01-01’, as text.) On the plus side, it does not change the formatting of these dates.
  4. For the above reasons, if you view date fields in Excel for Mac or Windows, it makes good sense to immediately format your dates to yyyy-mm-dd (following the international standard for data formats: ISO 8601). This requires using custom formatting in Excel. But it’s effective and can save your bacon. (Plus, it jibes with Python pandas and R.)

To reformat dates in ISO 8601 format in Excel for Windows:

  • Go to Format Cells and select the Number tab.
  • Then use the Custom category, and type in the formula: yyyy-mm-dd

Reformat dates to ISO 8601 yyyy-mm-dd in Excel for Windows

In Excel for Mac, the process is similar, but the option we need is (currently) available under the Date category:

  • Go to Format Cells and select the Number tab.
  • Then use the Date category, and select the option starting with a four-digit year, followed by a two-digit month and two-digit day, with hyphen separators. (Excel for Mac currently displays this with the sample date: 2012-03-14.)
  • Alternatively, do as in Windows Excel, and enter it as your own Custom format: yyyy-mm-dd.

Reformat dates to ISO 8601 yyyy-mm-dd in Excel for Mac


For Further Reading

The data revolution is now transforming the world of finance

This article from Tech Republic is worth a read. In summary: The data revolution is now transforming the world of finance. A recent Deloitte survey reveals that traditional roles are being automated. To be a human working in finance, you need skills in data science, analytics, and visualization. More than manipulating spreadsheets, you need to create business value with data-informed innovations.

The finance robots are coming —

Balancing Clarity and Creativity in Data Visualization

I’ve been reflecting on Elijah Meeks’ provocative essay, “3rd Wave Data Visualization”. In this post, I want to reflect on the tension between his first and third “waves.” I’ll refer to these as attitudes. (Meeks himself acknowledges that none of his “waves” have washed away. Each lives on.) He refers to them as Wave 1: Clarity and Wave 3: Convergence.

Upon re-reading his argument a few times, I believe we may useful understand the contrast Meeks highlights as the tension between these two imperatives:

Attitude 1: Design with Clarity. (Make sure we don’t miss the message.)

Attitude 2: Bring back the Creativity and Fun. (Give us some enjoyment.)

I’ll talk about these attitudes in more detail in a later post.

For now, I’m going to spend some time going out and evaluating a number of data visualizations bearing in mind questions such as these:

  1. How clear is this visualization? How easy is it to understand and interpret? Is that a good or a bad thing?
  2. How creative and fun is this visualization? Am I motivated to explore it further? Why or why not?
  3. Are there times, places, and audiences for whom clarity is more important than creativity? And vice versa?

The Tableau Public Gallery is a good place to start. And there are many others.

I’d be interested in your responses below. Include a link to a relevant data visualization.

I’ll report back with an update to this post.

Once again, the NY Times demonstrates the value of interactive data visualization

This impressive interactive data visualization demonstrates the value of the format. More than merely interesting, or intriguing, or even fun — it massively amplifies the communicative power of its subject matter.

Check it out:

How to Cut U.S. Emissions Faster? Do What These Countries Are Doing.
By Brad Plumer and Blacki Migliozzi — FEB. 13, 2019


Hal Varian on the Need for Data Interpreters

Hal Varian, Google’s chief economist, gave a nice summary of a major need of our era.

Emphasis added:

“The ability to take data—to be able to understand it, to process it, to extract value from it, to visualize it, to communicate it—that’s going to be a hugely important skill in the next decades, not only at the professional level but even at the educational level for elementary school kids, for high school kids, for college kids. Because now we really do have essentially free and ubiquitous data. So the complimentary scarce factor is the ability to understand that data and extract value from it.

“I think statisticians are part of it, but it’s just a part. You also want to be able to visualize the data, communicate the data, and utilize it effectively. … being able to access, understand, and communicate the insights you get from data analysis —are going to be extremely important.”

Hal Varian, Google’s Chief Economist, 2009

KazAnova on Stacking: leveraging multiple machine learning algorithms for better predictive models

Machine learning can be a powerful tool in the creation of predictive models. But it doesn’t provide a magic bullet. In the end, effective machine learning works very much like other high-value human endeavors. It requires experimentation, evaluation, lots of work, and a measure of hard-earned wisdom.

As Kaggle Competitions Grandmaster Marios Michailidis (AKA KazAnova) explains:

No model is perfect. Almost every time the models make mistakes. Plus, each model has different advantages and disadvantages and they tend to seize the data from different angles. Leveraging the uniqueness of each model is of the essence for building very predictive models.

To help with this process, David H. Wolpert introduced the concept of stacked generalization in a 1992 paper.

Michailidis explains the process as follows:

Stacking or Stacked Generalization … normally involves a four-stage process. Consider 3 datasets A, B, C. For A and B we know the ground truth (or in other words the target variable y). We can use stacking as follows:

  1. We train various machine learning algorithms (regressors or classifiers) in dataset A.
  2. We make predictions for each one of the algorithms for datasets B and C and we create new datasets B1 and C1 that contain only these predictions. So if we ran 10 models then B1 and C1 have 10 columns each.
  3. We train a new machine learning algorithm (often referred to as Meta learner or Super learner) using B1.
  4. We make predictions using the Meta learner on C1.

As part of his own PhD work, Michailidis developed a software stack, named StackNet to speed up the process.

Marios Michailidis describes StackNet in this way:

StackNet is a computational, scalable and analytical framework implemented with a software implementation in Java that resembles a feedforward neural network and uses Wolpert’s stacked generalization in multiple levels to improve accuracy in classification problems. In contrast to feedforward neural networks, rather than being trained through back propagation, the network is built iteratively one layer at a time (using stacked generalization), each of which uses the final target as its target.

StackNet is available in GitHub under the MIT license.

Be sure to read the interview with Michailidis about stacking and StackNet on the Kaggle blog, here.