If you’re fairly new to Tableau, chances are you’ll find Tableau’s repository of free training videos (free with registration) to be very helpful. Indeed, there’s enough there to help you go from zero to serious just about as fast as you dare to do it. The tutorials are really pretty great.
BUT their organization scheme needs a little help.
If you’re an educator (or student) interested in leveraging Amazon Web Services through AWS Educate to host a cloud database that allows student connections — this post is for you. In what follows, I’ll document the process to:
Configure and create a relational database instance from the AWS Management Console.
Set a security profile that will allow students to read and write to the database remotely — such as from a database client, from a program they’ve written, from a Jupyter Notebook, etc.
I’ll illustrate the process by creating a PostgreSQL database instance. Then I’ll provide illustrative code snippets for interacting with the database using Python.
The resulting database will be friendly for student projects that include database interactions such as querying, reading, and writing.
I’m going to assume that you’ve already created your Amazon Educate Account and are logged into your AWS Console. Thus, we’ll begin by creating a relational database.
Creating a Relational Database in the AWS Console
Once you have logged into your AWS Console, these are the steps to set up a relational database.
Use the search field under “Find Services” to search for “RDS.” You should see RDS: Managed Relational Database Service appear in the results. Select andnavigate to the RDS page.
Once you’ve arrived at the Amazon RDS page, select Databases in the left-hand sidebar, and then Create Database.
From the Create Database page, select your desired options for creation method and configuration. You are free to choose differently, of course, but I have chosen these options:
Standard Create — This will allow me to optimize the resources my database will use.
PostgreSQL — A favored option among data science types. But choose what’s best for you!
Select your desired Template (the labels here may depend on the engine you choose) according to your needed system resources and the size of your budget. I currently have access to the free tier, which I’ll use now. If that were not available, Dev/Test is the next least resource-intensive option I currently have.
If you desire, edit the database name (identifier), master username, and password.
Choose the DB instance size and storage (if relevant) that suits your needs and budget. I’ve chosen the least resource-intensive options, as these will be plenty for my intended use: basic CRUD operations performed by my students.
Under Connectivity, select “Additional connectivity configuration” and then “Yes” under Publicly accessible.
You’ll be given the option to create a new security group. You can keep the default, or create a new group. I’ve created one named “students.” You’ll also see the database port settings.
Depending on your selected database engine, you may (or may not) be given the option to choose Database authentication. With the PostgreSQL database I’ve chosen, I have these options, and I’ll choose Password authentication.
Depending on your selected database engine, you may (or may not) be given Additional configuration options. If you’re unsure about these options, click the handy Info link to read more about them. I’ve deselected automatic backups in order to conserve resources.
If all looks good, click Create database!
After clicking create, you may be given a message to go back and adjust a configuration option. If so, go back and do that. If all went well, you’ll be taken to a confirmation page. Here’s what mine looked like:
Notice that you can click to View credential details — a handy way to get the login information and save it for future reference!
Congratulations! You’ve created your database!
Now we need to set a security rule to allow interactions with the database.
Allowing Inbound Traffic
In these next few steps, we’ll set a connection security rule to allow inbound traffic to interact with the database.
Beginning at the RDS > Database page, click the database identifier.
Select the Connectivity & security tab.
Scroll down the page to Security group rules, and click to edit the Inbound rules.
If necessary, select Actions, and Edit inbound rules.
There should be an initial rule begun for you. Notice that the Type and Port range are already set to match your database settings. Now we need to allow a range of IP addresses. Configure this according to your needs. In my case, I’ll be working with online students. And since the database will not contain sensitive data, I’ll simply pull down the box under Source and select “Anywhere,” to allow traffic from any IP address.Once that’s been selected, I then see the result as two rules, allowing a full range of IP address options:
Click Save rules!
Connecting to the Database
After creating the database, you can connect to the database using an application or database management package. You’ll simply need a few key items of information. These were supplied when you first created the database. The items include:
Endpoint (aka host or hostname)
If you need to recover these later, you can do so by selecting the database from the RDS database list, and then looking under the two tabs: Connectivity & security, and Configuration.
The Endpoint (aka host or hostname) and Port can be found under the database Connectivity & Security details:
The DB name and Master username can be found under the Configuration tab:
As for the user password, you will hopefully have recorded or remembered it!
Those provide the essential credentials you’ll need to connect to the database.
In the next section, I’ll illustrate using these credentials to connect to the database, create a table, insert records, and query the table using Python.
Interacting with the Database Using Python
Python provides modules for connecting to any number of database engines. Since I selected a PostgreSQL engine, I’ll be using the psycopg2 module to interact with it. (For a MySQL database, you can use pymysql. And so on …)
If the module is not currently installed on your system, you’ll need to install it. In Python, this may be done easily using pip or conda:
# Start the cursor to enable SQL operations cur = conn.cursor()
# Create a table cur.execute("CREATE TABLE test1 (id serial PRIMARY KEY, num integer, data varchar);")
# Insert a record cur.execute("INSERT INTO test1 (num, data) VALUES (%s, %s)", (101, "abcdefg"))
# Query the table cur.execute("SELECT * FROM test1;")
# Output the query results cur.fetchall()
# Commit the changes conn.commit()
# Close the connection conn.close()
That’s it! Your database is ready to roll.
AWS Educate provides a fantastic opportunity to equip students with cloud resources. In fact, it’s worth pointing out that both educators and students can follow these steps. If a professor should want students to create their own cloud databases for their projects, the above steps will serve them just as well.
I hope this resource proves helpful. Please comment with feedback, suggestions, and recommendations!
In a recent Tableau project, I wanted to divide a long span of years into decades, as this would provide a more visually effective way to grasp the growth of revenue from top movies (data from The Movie Database) over time. With a little searching, I found the pieces I needed. Below I’ll include a description of my process, followed by links to the helpful sources of insight I found on this topic.
First, here is the visualization with total revenues year by year. Notice that despite its current width you still have to scroll left to reach the early 1900s. Meanwhile, the difference year to year is not in itself that interesting.
Now here is the visualization when years are chunked into decades. Much more effective!
DISCLAIMER: These charts use revenue numbers as entered in The Movie Database by contributors based on publicly reported figures. Thus, the data includes only a portion of all movies. I’ve as yet made no adjustments for inflation.
Getting to Decades from Dates
Now for the process of getting decades from dates. I broke my approach into two steps:
This article from Tech Republic is worth a read. In summary: The data revolution is now transforming the world of finance. A recent Deloitte survey reveals that traditional roles are being automated. To be a human working in finance, you need skills in data science, analytics, and visualization. More than manipulating spreadsheets, you need to create business value with data-informed innovations.
I’ve been reflecting on Elijah Meeks’ provocative essay, “3rd Wave Data Visualization”. In this post, I want to reflect on the tension between his first and third “waves.” I’ll refer to these as attitudes. (Meeks himself acknowledges that none of his “waves” have washed away. Each lives on.) He refers to them as Wave 1: Clarity and Wave 3: Convergence.
Upon re-reading his argument a few times, I believe we may useful understand the contrast Meeks highlights as the tension between these two imperatives:
Attitude 1: Design with Clarity. (Make sure we don’t miss the message.)
Attitude 2: Bring back the Creativity and Fun. (Give us some enjoyment.)
I’ll talk about these attitudes in more detail in a later post.
For now, I’m going to spend some time going out and evaluating a number of data visualizations bearing in mind questions such as these:
How clear is this visualization? How easy is it to understand and interpret? Is that a good or a bad thing?
How creative and fun is this visualization? Am I motivated to explore it further? Why or why not?
Are there times, places, and audiences for whom clarity is more important than creativity? And vice versa?
This impressive interactive data visualization demonstrates the value of the format. More than merely interesting, or intriguing, or even fun — it massively amplifies the communicative power of its subject matter.
Hal Varian, Google’s chief economist, gave a nice summary of a major need of our era.
“The ability to take data—to be able to understand it, to process it, to extract value from it, to visualize it, to communicate it—that’s going to be a hugely important skill in the next decades, not only at the professional level but even at the educational level for elementary school kids, for high school kids, for college kids. Because now we really do have essentially free and ubiquitous data. So the complimentary scarce factor is the ability to understand that data and extract value from it.
“I think statisticians are part of it, but it’s just a part. You also want to be able to visualize the data, communicate the data, and utilize it effectively. … being able to access, understand, and communicate the insights you get from data analysis —are going to be extremely important.”
Spend a while on visualization, making graphs of various properties of the data and trying to get a feel for how everything fits together.
Test the performance of a variety of standard algorithms (random forests, SVMs, elastic net, etc.) to see how they compare. It’s often very informative to look at which data points are the least well predicted by standard algorithms, as this can give you a good idea of what direction to move in. (Be warned: Home-brew algorithms can be useful later on in a project, but in the early stages you want to try out as many things as possible, not get bogged down in the details of implementing a particular algorithm.)
Then move into the nitty-gritty details once you have a sense for the lay of the land.
Of course, all this assumes a certain kind of problem, where the data is already in numeric/categorical form. For more “interesting” datasets, such as the recent Automated Essay Scoring competition, a lot of the early work is in feature extraction — just looking for numbers which you can pull out of the data. That tends to be a bit more creative, and I use a variety of tools to see what works best. However, one of the joys of this kind of problem is that every one is different, so it’s hard to give general advice.