In this post, you’re going to learn the 20% of Matplotlib that you’ll use 80% of the time.
(This guide is emphatically not meant to be comprehensive—this guide will show you how to get up and running quickly with the most useful commands.)
Just a couple quick things before we dive in.
In case you want to learn more than just Matplotlib, here are the other Project Data Science 80/20 Guides:
If you need to get a professional data science environment set up on your computer, we have a guide for that: Step-by-Step Guide to Setting Up a Professional Data Science Environment.
And, if you’d rather watch an in-depth video tutorial, we have that for you right here: YouTube – Matplotlib Mega Tutorial.
Alright—Ready to get started?
Table of Contents
- Table of Contents
- 80/20 Matplotlib
- Primary Objects
- Loading Data Using Pandas
- Our First Graph—A Histogram
- Other Graph Types
- Using Color
- Plotting Multiple Graphs—Same Axes
- Plotting Multiple Graphs—Different Axes and Subplots
- Saving Figures
- Additional Matplotlib Information
One of the first things to ask when faced with a new Python package is, “What are the primary data structures, methods, and other objects?”
Since Matplotlib is a data visualization package, the primary objects we’re going to be using relate to pieces of a data visualization.
First, there’s the Matplotlib figure object, which is essentially the entire image. There could be one graph on that image, or a million graphs on it—the whole image is the figure. The figure also includes any text on the image, including plot titles, axis labels, annotations, legends, etc.
Second, there’s the Matplotlib axes object, which is the part of the figure where the graph actually happens. One figure can contain multiple axes objects, in order to have multiple graphs. The axes object is where we’ll actually be plotting our scatter plots, bar charts, line charts, etc.
One important note about confusing terminology. On a graph, you’ll typically have an x-axis and a y-axis, which are your horizontal and vertical lines on the bottom and side of your graph. Be careful not to confuse this kind of “axis” with the Matplotlib “axes” object. An axis is a single dimension of a graph, like an x-axis, while a Matplotlib axes object relates to where a whole plot is being graphed. A good synonym for “axes” might be “graph”, so you can think about it like that if you want.
Here’s a very helpful graphic from the Matplotlib website showing the different pieces of a Matplotlib image.
Loading Data Using Pandas
Let’s get some data to play with. We’re going to be using the world happiness dataset from Kaggle, which you can download here: https://www.kaggle.com/unsdsn/world-happiness#2019.csv. Specifically, we’ll look at the 2019 data.
We’re going to use the Python package pandas to load the data, which is very often how we’ll load data in data science projects. We’ll be doing all of our coding in a Jupyter notebook.
If you want to learn how to use the pandas package, see our guide 80/20 Pandas—Pandas for Data Science.
From the first five rows of the dataset, you can see that we have a list of countries and some data about those countries: GDP per capita, healthy life expectancy, and the happiness score, for example.
Our First Graph—A Histogram
Suppose we want to take a look at the distribution of happiness scores. We can do this in Matplotlib using a histogram. Let’s go ahead and import Matplotlib and get ready to create our first data visualization!
The first thing you’ll notice is that we’re actually importing the pyplot subpackage from Matplotlib, and we’re aliasing it as plt, which is the common way to import pyplot. Most of the time, the pyplot subpackage is the only thing you’ll need to import from matplotlib.
But if you do want to see what else we could import from Matplotlib, you can use tab-complete in Jupyter Notebooks to see what other subpackages and modules are there.
But like we said above, we’ll just be sticking to pyplot.
Let’s go ahead and plot a histogram of the Score column in our dataset. We can do this in three short lines of code.
First, we create the figure and axes objects (which are the two primary objects we mentioned in the beginning of the article). Second, we plot our histogram. And third, we show our plot.
Notice how the ax.hist() method of plotting the histogram is a method on our axes object. You plot data visualizations on the axes objects, not on the figure object—the figure object is a container for axes, titles, and a few other things. But, the plotting is done on the axes object.
You’ll notice that our x-axis has a scale going from roughly three to eight, while the y-axis has a scale going from zero to about thirty. The x-axis ticks show every integer between three and eight inclusive, while the y-axis ticks only show multiples of five.
Adding a Figure Title
Let’s add some text to our image to make it more informative. First, we’ll add a title using the figure.suptitle() method on our figure object.
Adding Axis Labels
And we usually want to add an x-axis label and a y-axis label as well, to describe what the axis represents.
This is looking good! We’ve already learned how to do most of the important things in Matplotlib—we can create a graph, we can add a title, and we can add axis labels. I’d say this represents about 60% of what you’ll want to do in Matplotlib.
But, with just a few more pieces of functionality we can get you up to 80%. So let’s keep going.
Other Graph Types
Histograms are only one type of graph that you’ll want to plot. There are also bar charts, line charts, scatter plots, and more. Let’s show how to do some of those really quickly.
If we want to look at the relationship between Score and GDP per Capita, that’s a perfect opportunity for a scatter plot. We can use the axes.scatter() method to accomplish this.
We can see a very strong positive correlation there, which is good information to have.
Maybe we want to see a line chart of all of the GDP values sorted from smallest to largest. We can sort our data using the Python sorted() function, and then use the axes.plot() method to plot a line chart.
(In this graph, we removed our x-axis label since the x-axis doesn’t really mean anything in this graph, other than being a general numeric index for each data point.)
Finally, let’s say we want to present the happiness scores for the top five countries. This is a perfect use case for a bar chart, which we can do using axes.bar() method. First, we create two variables to hold the data for our top five countries and those countries’ scores, then we create the plot.
Let’s talk about color for a bit, since effective use of color is one of the most important parts of data visualization.
First, we can change the color of any graph using the parameter c. (Sometimes the parameter is called color rather than c.) Let’s change the color of our scatter plot to green.
For scatter plots, we can do something much more powerful though—we can color each individual data point by another variable. For example, let’s say that we want to color each data point based on the healthy life expectancy. We can simply pass the healthy life expectancy data to the parameter c.
Look at that! There’s obviously a pattern here. But, there’s nothing here to tell us what those colors mean… Let’s fix that with a colorbar.
The cmap Parameter
If we want to change the color palette, we can pass in an argument for the cmap parameter (“color map”).
Plotting Multiple Graphs—Same Axes
Now that we’ve covered a lot of the functionality that you can do with a single graph, let’s talk about how to do multiple graphs. The first kind of multi-graph plot that we’ll work with is where you plot multiple graphs on the same axes. Doing this is incredibly easy—you simply add one more line of code with your new plot.
For example, let’s say that we want to plot two line charts on the same graph—our first line chart will be the happiness score of each country, and the second will be the healthy life expectancy of each country.
We can simply have two axes.plot() method calls on the same axes object. (We’ll remove the plot text for now.)
Adding a Legend
In order to tell which graph is which, we can add a legend to our graph. First, we’ll need to pass in a label parameter to each plot method. Then, we can call the axes.legend() method to add the legend to our graph with the correct labels.
Very often, we’ll want to do multiple histograms on the same graph, to compare the distributions of two variables. For example, let’s say we want to split the countries up into “high happiness” and “low happiness” groups, and then plot the GDP per capita distributions for each of those groups to see if the distributions of GDP per capita are different.
First, we’ll create our variables.
Then, we’ll plot the two histograms on the same axes.
Alpha Parameter for Transparency
The orange histogram is covering a large part of the blue histogram, which isn’t good. We can fix this by using the alpha parameter, which sets the transparency of each graph. We’ll set the alpha to something less than 1 so that each graph becomes transparent and we can see both graphs clearly.
The alpha parameter is useful anytime you need to plot multiple overlapping graphs like this.
Plotting Multiple Graphs—Different Axes and Subplots
Rather than doing multiple plots on the same axes, what if we want totally separate axes for each plot? This is where we can use subplots with the plt.subplots() function.
Let’s say we want to create a two-by-two figure, with four axes total. Before plotting anything, let’s just see what happens when we create that two-by-two figure with four axes.
You can see that the axs object is now an array (technically a NumPy ndarray) that holds four different axes objects, one axes per graph. Since we didn’t plot anything, each axes shows up as a blank rectangle for now.
Let’s get each axes object from the array and plot histograms for four of our variables.
Beautiful! But once again, we’re missing text which means that we don’t know what data each plot has. We can add text using each of the axes objects. In addition to a title for the whole figure, we can also add a title to each axes, now that we have multiple axes objects.
It looks like our axes titles are getting all mixed up with our x-axis tick labels. Let’s use the plt.tight_layout() function to see if we can fix that.
There we go.
Finally, we’ve created some cool graphs that we might want to share with others—how do we save graphs out to files? We can save images using the plt.savefig() function. The function can’t save JPG files, but it can save PNG files, so let’s do that.
If we look in our current directory, we’ll find our image there—and we can open it like any normal image.
Additional Matplotlib Information
And with that, we’ve finished the main section of 80/20 Matplotlib! This is enough Matplotlib functionality to get you through many of the data visualization tasks you need to do.
However, Matplotlib is a rather big, hairy, extensive package—so before wrapping up, there are a few more things that you should at least be aware of in case you run into them.
Two Matplotlib Styles: Object-Oriented vs. MATLAB
You’ll notice that we’ve occasionally used the plt (pyplot) object to do certain things, such as show figures and save figures and create subplots. There’s certain functionality that pyplot handles, that isn’t handled by the figure or the axes—this can be confusing at first, but just know for the most part you’ll only be using the figure and axes objects.
But, there is a way of using Matplotlib that relies pretty much entirely on the plt object, without using fig or ax at all. This method of using Matplotlib is called the MATLAB-style, where there’s a “hidden” state in the background where your figure and axes live.
For example, in the MATLAB-style way, you can create a scatter plot by just calling plt.scatter().
Although this seems simpler at first, it becomes more convoluted very quickly as your graphs get more complex. If you prefer this style after doing your own research, then go for it—otherwise, we recommend sticking with the more explicit fig, ax way of using Matplotlib, which is called the object-oriented style of using Matplotlib.
Matplotlib generally recommends the object-oriented style as well. Here’s a quote from the Matplotlib website:
“For more complicated applications, this explicitness and clarity becomes increasingly valuable, and the richer and more complete object-oriented interface will likely make the program easier to write and maintain.”
Searching for Matplotlib Documentation and Answers
Finally, customizing Matplotlib graphs can take a lot of work and research. The package is so powerful that you can do pretty much anything with it… but doing what you want can take some digging.
The documentation will very often be your friend in this case, as will StackOverflow. When you find yourself needing to do something, just Google your problem and the documentation (or a StackOverflow answer) should pop right up.
I would recommend including either axes or figure in your search so that you get results for the object-oriented style, rather than the MATLAB style.
And with that, you’ve just learned the 20% of Matplotlib that will get you 80% of the value. Feels pretty good, right?
PS, here are some extra resources for you.
- Here’s a great introductory guide on the Matplotlib website: https://matplotlib.org/tutorials/introductory/usage.html
- Seaborn is a nice data visualization tool built on top of Matplotlib: https://seaborn.pydata.org/
- Plotly is an interactive data visualization package for Python: https://plotly.com/python/
- And finally, here’s the Project Data Science Matplotlib Mega Tutorial if you want to dive deeper: https://youtu.be/axSTGczvYIE
Introduction to Practical Data Science in Python
This course bundle covers an in-depth introduction to the core data science and machine learning skills you need to know to start becoming a data science practitioner. You’ll learn using the same tools that the professionals use: Python, Git, VS Code, Jupyter Notebooks, and more.