Have you seen plots of data that are ridiculously explanatory and you can’t understand why? Do you want to plot data and have no idea what sort of plot to use? Do you want to distinguish between plots that are informative and plots that are simply flashy?
ggplot2 is a plotting system for R. It’s a easy way to make decent plots rather quickly, and it’s so helpful that it might be the only thing you use R for. I once gave a lesson at Galvanize’s Data Science Immersive that introduced ggplot2 without expecting any prior knowledge to R.
Perhaps you don’t care about ggplot2 in particular; this lesson, or whatever you want to call it, should also serve as a decent introduction to plotting in general.
I tend to keep two principles in mind when I plot data:
1. Plotting is how we perceive abstract data
I define “plotting” as the process of mapping abstract data to concrete things that we can perceive. Data is abstract and doesn’t have fundamental representation in the real world. We map abstract data onto concrete things in the real world so that we can understand the abstract data, and in order to make good plots, we have to choose mappings that people understand.
Some people think of data tables and histograms as pure, true representations of data. I don’t. These are simple plots, and they may be easier to understand than complex plots, but they still involve deciding how to represent abstract data as a concrete thing we can perceive.
2. Escape flatland
Plots are more informative when they relate or compare several things, and an enthralling plot relates several variables without being distracting. I pay lots of attention to how many different variables I am representing in a single plot, and whenever possible, I try to add more. I look for interactions in plots the same way that I look for interactions in statistical models.
Now let’s focus more specifically on ggplot2. The first thing to do is set up your computer. If you have R installed and can write some simple SQL, you can download all of the files that you will need with either of the following commands:
- git clone firstname.lastname@example.org:tlevine/ggplot-not-r
- wget –recursive –level=2 http://dada.pink/ggplot-not-r/
If not, no problem—all of these files are linked here as well.
I’ve prepared a toy project that I hope will give you some understanding of plotting. We’re going to play around with some data provided by treasury.io, which offers a daily feed of deposits and withdrawals from different accounts within the federal treasury. You can download the full historical feed as a SQLite3 database here.
Your task is to play around with the data and make some plots. A good exercise is to come up with an arbitrary question for yourself, such as: “How does spending differ between Tuesdays and Thursdays?”
Base your code on treasury.r so that you don’t have to know how R works. Query the data with the sqldf function, and then pass the results of those queries to the ggplot function to make plots.
If you aren’t comfortable with SQL, just use the full tables for making the plots. The only thing this will limit is the same of the data you use in your plots—you’ll still be able to make rather complex plots.
Start out by changing the contents of the aes function call and by adding geoms other than geom_point.
Here are some aesthetics (aes) that you might try:
And here are some geoms you might try:
You can look at the ggplot2 documentation for further reference. (Feel free to ignore the parts that you don’t understand.)
An important thing to remember at this point is not to worry about making your plots look pretty—just try to make plots that tell you something.
When You Get Bored
Here are a few directions you can go when you get tired of ggplot or the above toy project.
Use your own data
If I were you, I’d probably just want to start playing with my own data. Figure out how to load your data into R, then start plotting it.
More about the Grammar of Graphics
Once you start seeing plots as mappings between abstract data and concrete elements, different plots start seeming much more similar. Look at a few plots, such as these, and deconstruct them into their grammatical components. Answer at least these questions about each plot:
- What variables are represented?
- What real-world element is each abstract variable mapped to?
If you would like to delve deeper, try these questions:
- What is the coordinate system?
- What happens when two data points collide on the plot?
- What are the scales for the different variables?
- How is the data transformed before it is plotted? That is, what mathematical operations were applied?
- Alternatively, just answer the following question: How would you write this plot in ggplot2?
Make Your Graphs Look Better
In general, if you want to make something interesting, follow most of the rules and then break one or two.
I’d recommend reading Design Elements, by Timothy Samara, (or you can just read this summary of the rules). Now try applying these design elements to your plots.
There are also some special rules for visual displays of quantitative information. You can read about them in any of Edward Tufte’s books, such as The Visual Display of Quantitative Information. Here are two main concepts to try to understand:
- Escaping Flatland (presenting multivariate data)
- Data:ink ratio
Learn More R
Here’s some additional materials for learning R:
I tried to keep it short so you could get on with the toy project, so I left out the following materials, which are also great:
- Visualizing Data, by William Cleveland, has many fine examples of different ways to represent datasets with one variable and two variables.
- Here are some slides about ggplot2.
- The Grammar of Graphics, by Leland Wilkinson, is obviously relevant, but I dislike the order in which it’s presented. Consider reading it in the order of this outline.