80/20 pandas blog image

In this post, you’re going to learn the 20% of pandas that you’ll use 80% of the time.

(This guide is emphatically not meant to be comprehensive—this guide will show you how to get up and running quickly with the most useful commands.)

Before we dive in, just a couple of items.

In case you want to learn more than just pandas, here are the other Project Data Science 80/20 Guides:

If you need to get a professional data science environment set up on your computer, we have a guide for that: Step-by-Step Guide to Setting Up a Professional Data Science Environment.

And if you do want a deeper guide to pandas, here’s our longer video tutorial: YouTube – Pandas Mega-Tutorial.

Alright—Ready to get started?

[convertkit form=1849266]

Table of Contents

Here’s everything we’re going to go over.

(One quick note—pandas is technically lowercase, so we’ll usually write it lowercase too. We don’t want any pandas developers yelling at us for using the wrong case. Not that they would do that. They’re probably very nice people.)

80/20 Pandas

Primary Data Structures

One of the first things to ask when faced with a new Python package is, “What are the primary data structures, methods, and other objects?”

In the case of pandas, there are two primary objects you’re going to be working with: the pandas Series and the pandas DataFrame.

First, let’s look at the pandas Series.

(By the way, notice that we’re importing pandas “as pd”, meaning that we’ll reference the pandas package as pd from here on. This is the way that pandas is usually imported. It’s just a convention established by the creators of pandas and the people who use it.)

As you can see from the docstring above, a Series is a “One-dimensional ndarray with axis labels.”

Translating that into more normal language, you can usually think about a Series like a column of data. In an Excel spreadsheet of data, you’ll often have rows and columns—a Series is just like one of those columns, with the column name as part of it too. Let’s create a Series really quickly just so you can see what it looks like.

Now a DataFrame is pretty straightforward after you know what a Series is—a DataFrame is essentially just a collection of Series into a single object.

So if a Series is a single column in an Excel spreadsheet, a DataFrame is the whole spreadsheet with lots of columns.

And here’s a contrived example of a DataFrame, created from two Series objects.

But we don’t usually create DataFrames or Series from scratch—usually we have some data to work with. So let’s get some data!

Loading Data

We’re going to download the world happiness dataset from Kaggle and specially look at the 2019 dataset. You can get the data here: https://www.kaggle.com/unsdsn/world-happiness#2019.csv.

So we download the data and move it into the same directory as our Jupyter notebook. Now we can load it using one of the most common pandas data loading functions, read_csv().

We assign the results of the function to a variable called “df”, which stands for “data frame” and is a conventional variable name used when loading data in pandas. You’ll also notice that if we check the type of the df object, we do in fact have a pandas DataFrame.

There are many other data loading functions, but two of the most common that you may use are read_excel() and read_sql().

Looking at Data

So how do we look at our data? If we want to see a handful of rows, we can use df.head().

We can also just get a list of our columns using df.columns, and if we want to know how many rows and columns are in our data we can use df.shape.

And if we want to know the data types of each column, we can look at the df.dtypes attribute on the DataFrame.

So we can see that our rank is an integer, country is a string (object), and everything else is a float.

Descriptive Statistics

We probably want to start understanding our data a little better now, which is where the df.describe() method comes in handy. We’re going to pass the parameter “include=’all’” to this method so that describe gives us some information about all of our columns, rather than just the numeric columns.

And with that single command, we now have some good descriptive statistics for every single one of our columns! Getting descriptive stats is an important first step in exploratory data analysis (EDA), to start understanding your data. Notice that the “Country or region” column is displaying different information, since it’s a non-numeric column.

Displaying a Single Column

Let’s actually look at our country column now, and see which countries are represented in our data. We’ll use the bracket notation to access the column, just like how you would access a dictionary key in Python.

You see that we get a pandas Series returned—which is exactly what we would expect, since DataFrames are basically a collection of Series. If I want to see if any country or region is listed more than once, I can use the Series.value_counts() method on our Series.

All of our values are exactly 1, which means that each country or region only shows up once in our data. (If any value were greater than 1, it would be at the very top.) It’s not very useful with this specific dataset, but the Series.value_counts() method is usually extremely useful for counting how many rows you have for each value in a column.

Plotting a Histogram

What if we want to look at a single numeric column, like Score? We can access it exactly the same as before if we want.

But since this is numeric data, there’s a much better way to view this data: with a histogram. We’ll use the Series.hist() method to plot a histogram of the Score column.

Now we can see the shape of our data. Most of our values lie somewhere between 4.0 and 6.5, with a couple odd peaks in there around 4.5 and 6.0.

Aggregating, Sorting, and Filtering Data

If we want to find the minimum, maximum, and average scores, we can use the Series.min(), Series.max(), and Series.mean() methods on our Series. (We could also just look back at our df.describe() method if we wanted.)

Sorting Data

If we want to sort our data, we can do that using the df.sort_values() method. Let’s sort countries by life expectancy to see what the healthiest countries are. We’ll use the parameter “ascending=False” to sort from largest values to smallest values.

Filtering Data

Let’s talk about filtering our data for a minute, because filtering your DataFrames is something that you’ll do all the time.

To filter your data, first you can create a boolean mask, which is basically a Series of True and False based on which rows you want to keep. It’s very straightforward to create a boolean mask based on a Series in your dataset—you simply use the Series in a logical statement.

For example, let’s say we want to look at just the countries with a Score greater than seven. Creating our boolean mask is a single statement that looks like this.

The first values are True, indicating that the countries with a score over seven currently sit at the top of our DataFrame. We can now pass this boolean mask into our DataFrame using the bracket notation again, which will filter down to only the rows where the mask value is True.

There are actually quite a few countries with a score over seven!

We can create our boolean mask however we want, to do any kind of complicated filtering we can think of. For example, let’s create a boolean mask where the score is over seven and the healthy life expectancy is under one. We can use the logical operators in Python to combine the two Series together.

And there we go, a small handful of countries with a score above seven and a healthy life expectancy under one.

Note that you can create the boolean mask right in the brackets if you want—you technically don’t need to create a separate variable.

Creating New Columns

Let’s say we want to create a new column, such as score times ten. It’s very easy to do that by simply multiplying the column by 10.

We can assign the result to a new column by using the same bracket notation we’re used to, with the new column name in the brackets.

You can see the new column on the far right, which is where columns get created by default.

Combining Multiple Columns

It’s also just as easy to combine two columns together, such as multiplying the Score column and the Generosity column to create a fun new made-up metric, “Scorgerosity”. (Which isn’t a real metric, by the way… but maybe it should be!)

Looks like Iceland comes out ahead of these other four countries on the Scorgerosity metric. Sounds like a lovely place!

Generic Column Manipulation Using Series.map()

You can also do any kind of special column manipulation you want using the Series.map() method, which is something that you’ll probably start using all the time once you get the hang of it. The trick with the Series.map() method is that you pass it a Python function that accepts a single value as input and returns a single value as output.

This is easier to understand with an example. Let’s say we want to create some categories based on our score column. For countries with a score over 6, we want to categorize those as “High Score”. For countries with a score under 4, we want to categorize those as “Low Score”. And finally, we want to categorize everything else as “Middle Score”.

First, let’s create a Python function that accepts score as a single parameter and returns a category.

Now we can pass that function to the Series.map() method and assign the results to a new “Score_Category” column.

Another very common way to pass simple functions into the Series.map() method is to use a Python lambda function, which is an “anonymous” function you can define right where you need it. Let’s use a lambda function to create a column showing whether or not the country name starts with the letter “N”.

We got you, Norway and Netherlands!

Group By Operations

One of the last pieces of functionality we’ll look at is DataFrame.groupby(). This “group by” method is somewhat similar to an Excel pivot table (although pandas can also do pivot tables exactly like Excel using the DataFrame.pivot_table() method), and it’s pretty much exactly the same as a SQL “GROUP BY” clause.

Let’s say we want to look at the average GDP for countries with the same rounded score—so we want to get the average GDP for countries with a score of 7, the average GDP for countries with a score of 6, etc.

First, we’ll create a column with a rounded score, rounding to the closest whole integer.

Now, we can use the DataFrame.groupby() method on this rounded score to group all of our data by the scores.

The DataFrameGroupBy object isn’t very helpful by itself—we have to tell it what we want to do with the groups. But, just for illustrative purposes, here we’ve printed out the groups just so you can see what the DataFrameGroupBy object is storing under the hood. For each rounded score, we have a list of the indices of the data for countries that have that score.

What did we want again?

That’s right, we wanted to get the average GDP for each group. First, we can call the DataFrameGroupBy.mean() method on our group by object to calculate the averages for every column in our dataset, by group.

But we really only wanted GDP, so after calculating the mean we can filter down to just “GDP per capita” using our bracket notation again.

Notice that the full averaged group by object was a DataFrame, then we accessed a single column in that DataFrame to return a Series object. The GDP certainly is much higher on average for countries with higher scores—an important and interesting finding.

Saving Data to a File

Finally, we’ve done a lot of work on our data and now we want to save it back out to a different CSV file. We can do that using the DataFrame.to_csv() method. We’ll pass “index=False” as a parameter, since our DataFrame index doesn’t currently carry any useful information. (That’s a topic for a whole other post.)

I can go to this folder and open this file up with Excel now, just to check that it wrote everything out correctly.

Looks good to me!

Conclusion

And with that, you’ve just learned the 20% of pandas that will get you 80% of the value. How’s that for efficient?

Happy learning!

PS—here’s the pandas documentation, in case you need it: https://pandas.pydata.org/docs/.


Introduction to Practical Data Science in Python

This course bundle covers an in-depth introduction to the core data science and machine learning skills you need to know to start becoming a data science practitioner. You’ll learn using the same tools that the professionals use: Python, Git, VS Code, Jupyter Notebooks, and more.

Leave a Reply