In this post, you’re going to learn the 20% of NumPy that you’ll use 80% of the time.
(This guide is emphatically not meant to be comprehensive—this guide will show you how to get up and running quickly with the most useful commands.)
And in case you want to learn more than just NumPy, here are the other Project Data Science 80/20 Guides:
By the way—if you need to get a professional data science environment set up on your computer, we have a guide for that: Step-by-Step Guide to Setting Up a Professional Data Science Environment.
Ready to get started?
Table of Contents
- Table of Contents
- 80/20 NumPy
- Primary Data Structures
- Loading and Looking at NumPy Arrays
- Slicing and Indexing
- Mathematical Functions
- Element-Wise Operations and Vectorization
- Filtering Using Boolean Masks
- If/Then Operations Using np.where()
- Sorting Arrays
- Getting NumPy Arrays from Pandas DataFrames
- Creating Ranges in NumPy
- Reshaping Arrays
Primary Data Structures
One of the first things to ask when faced with a new Python package is, “What are the primary data structures, methods, and other objects?”
In the case of NumPy, there’s really only one new data structure you need to know: the array. And, more specifically, the homogenous multidimensional array, meaning an array that only stores a single type of data and can have multiple dimensions.
Loading and Looking at NumPy Arrays
Let’s import NumPy (which we’ll alias as “np”, since this is the usual convention), create a very simple array, and then look at the data type of that array.
So we created a very simple array by just passing a Python list to the np.array() method. We can see that it still looks like the list we passed in, only the data type is now numpy.ndarray, where “nd” stands for “n-dimensional”—just another way of saying multidimensional.
Loading NumPy Data
Let’s get some real data to play with, though, to make it more interesting. We’ll use the Python machine learning package scikit-learn to load data on the chemical properties of various wines. (Yay wine!)
There we go, that’s better.
So what exactly do we have here? It looks like a bunch of nested Python lists—and that is, in fact, one way that you can think about it. We can access the first row of data just like we would with a list of lists in Python.
We have 13 different numbers here. Just for curiosity’s sake, what are these numbers?
These 13 different features (or columns, or attributes) are all different chemical properties and characteristics of wines.
Let’s look at the ndarray.shape attribute of our array to see what the dimensions of our data are. This is one of the very first things you usually do with NumPy arrays.
So it looks like we have 13 columns, which we already knew, and we have 178 rows. Although the two dimensions here can be interpreted in different ways, we usually think about the first dimension as rows and the second dimension as columns. Later, we’ll see what happens when we add a third dimension to our data.
Since this data is just rows and columns, we can think about it just like tabular data in an Excel sheet.
Slicing and Indexing
We’ve seen how to look at a single row, using the normal Python list bracket notation. But how do we look at a single column? If we actually did have a nested list of lists, we would have to use a for-loop to iterate through each row and get the value for the column that we wanted to look at.
In NumPy, though, we can use slicing and indexing to locate just the data we want to see. Slicing a NumPy array means that we return multiple values from inside the array, and we use indices (plural of index) which are just the numeric locations of the data. This is very similar to indexing into a normal Python list—here, we just do it in multiple dimensions, and we have more flexibility.
The trick to getting out a column from the data—which is in the second dimension of the array (or technically the 1-indexed dimension, if the other dimension is 0-indexed)—is that we pass in multiple values into the square brackets, one value for each dimension.
Let’s say we want to look at the magnesium levels, for example. Looking back at our feature names, we can see that it’s at column index 4 (where the first index is 0). So we want to look at all of the rows, and just the 4th column index.
If this seems complicated, just think about it like an Excel sheet with columns and rows, where we have 13 columns with different wine properties, and each row is a specific wine. We want to look at just the magnesium column.
Slicing by Column
In NumPy, we specify “all of the rows” using a colon. Here’s how we look at just the magnesium values.
That looks like about 178 values, which is what we should expect since we’re getting all of the rows for a single column. If you look back at our first full row of data, you’ll see that the value of 127 matches the 4th index—looks good to me! Notice that we’re returning a one-dimensional array here, which you can double-check with the ndarray.shape attribute on our array.
If we want to get the alcohol and magnesium columns, we can pass in both of their indices into the array for the 1-indexed dimension.
And this time, we get back out a two-dimensional array since we’re getting all of the rows for two columns.
So far we’ve just been looking at how to get arrays and index into arrays. Now let’s do some math!
First, what’s the average alcohol content of these wines? We can find that out using the np.mean() function.
So it looks like we have an average alcohol content of about 13%. What are our min and max alcohol contents?
As you can see, doing math of this kind in NumPy is very straightforward. There are all kinds of math functions you can call in the package.
Element-Wise Operations and Vectorization
You may notice that the phrase element-wise shows up a lot on that documentation page. This just means that the function performs the math operation on each item in the array, rather than doing some kind of aggregation.
For example, if we have a simple array of integers…
The np.square() function operates element-wise and squares each individual number in our array…
So this is what element-wise means.
Now this brings us to an incredibly important part of why NumPy exists, and it’s a concept called vectorization. Vectorization is why NumPy crunches numbers so quickly, and why it lies at the heart of many other Python statistics and machine learning packages.
To understand vectorization, let’s think about how we would operate on a normal Python list of data.
In regular Python, we would store data in a list. If we wanted to do some mathematical operation on the data in that list, we would need to loop through each value one by one and do the operation. For example, here’s how we might square each number in a list.
Or, we could use a list comprehension—but this is basically just a for-loop in disguise.
NumPy on the other hand can operate on the entire list at once, completely doing away with the for-loop. NumPy treats the whole vector as the object that it’s operating on. And that means we can do cool things like this, where we multiply the whole vector by a number in a single operation.
Or maybe we square the array and then multiply it by 10.
Or, maybe we want to add two different arrays together, or multiply them by each other. That’s completely possible too with NumPy’s vectorized approach.
This kind of vectorized approach to computation only gets better and better once you start doing things like multiplying vectors together, such as doing a vector dot product or multiplying matrices, something that is done all the time under the hood of machine learning algorithms.
While you probably don’t need to work with the intricacies of vectorization too much in your day-to-day work, the algorithms and processes that you use rely heavily on NumPy’s vectorized approach to math.
Filtering Using Boolean Masks
Let’s return to slicing and indexing for a minute. Suppose we want to only look at data for wines where the alcohol content is under 13%. First, we can create a True/False boolean mask of which rows of data match that criteria.
For each row of data, we get one value: True if that wine’s alcohol content is under 13%, and False otherwise.
Now, if we want to look at the full rows of data for just those wines, we can pass this boolean mask back into our array to index down to just the rows we want.
It looks like we’ve filtered our original 178 wines down to 86, almost exactly half of our wines.
Maybe we’re interested in wines where the alcohol is under 13% and the magnesium is over 120. We can do that too, using the logical & operator in Python to combine two different boolean masks into a single mask.
Sure enough, we have six of these wines! And there they are—so beautiful, aren’t they? I wonder if they’re tasty.
If/Then Operations Using np.where()
Let’s say we want to create some categories based on our wines. Maybe if a wine is under 12% alcohol we want to categorize it as low alcohol, and if it’s over 14% we want to categorize it as high alcohol. We can use the np.where() function to do this. The np.where() function returns one value if a condition is true, and another if the condition is false.
First, let’s just try the “low alcohol” condition and return a blank string otherwise.
But we want to split the “if_false” value into two different values based on another condition, so we can pass np.where() back in as that third argument to test that condition.
There we go, all three of our categories are now there.
What if we wanted to see all of our alcohol values sorted from highest to lowest? We can use the np.sort() function for that. The default method sorts values in ascending order…
But to do descending order, we use this indexing trick…
Reversing an Array
What did we just do there? Well, NumPy indexing follows a pattern of start:stop:step, where you can specify what index you want to start at, what index you want to stop at, and what step size you want to use to skip through the data (for example, a step of 2 would return every other value). A step value of –1 simply means “go through the array backwards”, essentially reversing the order of the array.
So by starting with an array sorted ascending, we can use [::-1] to reverse the sort.
Sorting Using Argsort
Now suppose that this is all nice and everything, but what we actually want to is sort the whole data array by our alcohol values. How do we do that?
There’s this very useful NumPy function called np.argsort() that we can use. Rather than sorting the array, np.argsort() returns the indices that would sort the array.
Another way to think about it is like this. For each value in an array, replace that value with the index of the value (where it currently sits in the array). Then sort that array of indices, but sort it by the actual values themselves. What you’re left with is an array of indices shuffled around.
And if we pass that array of indices back into our original array, what we get is the sorted array.
First, here’s what those indices look like.
And then if we pass the indices back into the alcohol array, we get out the sorted array.
The cool thing about this is that now we can pass these indices into the full data array and sort all of our data by the alcohol column. We’ll also reverse the indices as we pass them in, to get all of our data sorted by alcohol descending.
Here we’re printing out just the first four columns in order to see more clearly that the values are sorted by alcohol values descending.
Getting NumPy Arrays from Pandas DataFrames
Let’s look at some other ways of creating and getting access to NumPy arrays.
One of the most common ways of getting access to NumPy arrays is by loading data into a pandas DataFrame first, and then using the DataFrame.values or Series.values attribute to return the NumPy arrays that are at the heart of those pandas objects.
Let’s load the Kaggle country happiness data for 2019, which you can get here: https://www.kaggle.com/unsdsn/world-happiness#2019.csv.
We’re not going to discuss pandas too much in this post, but if you want a quick introduction then check out our 80/20 Pandas—Pandas for Data Science post.
So you can see that we have a list of countries and data about those countries in our DataFrame. Now let’s use the DataFrame.values attribute to get out the NumPy array from this data.
Notice that the dtype (data type) of this array is “object”, which usually means string. Remember that one key part of NumPy arrays is that they’re “homogenous”, meaning single data type—this means that NumPy has to make the whole array a data type that can accommodate all of the data. Since we have string data in our dataset (the country names), which can’t be represented as numbers, then NumPy has to turn the whole array into a string (object) array.
But if we pull out the values for one of our numeric columns, you can see that the data type is now a numeric data type—specifically, a float array.
A very common way to create, share, and analyze data is using CSV and Excel files that can be easily loaded using pandas, which means we’ll often interact with NumPy arrays by getting them from the pandas DataFrames.
Creating Ranges in NumPy
In Python, we have the range() function which can give us consecutive numbers, or numbers spaced out evenly. In NumPy, we have a very similar function called np.arange() which returns a NumPy array. We’ll typically use np.arange() anytime we want to create a sequence of integers.
One common thing we might need to do, though, is to create an array with float values rather than integers. For example, we might need to create finely spaced x-axis values for graphing a function, or perhaps we want to try a series of small floats in a machine learning model as hyperparameters.
In this case, for creating an array of evenly spaced floats, we’ll use the np.linspace() function.
For example, here we’ll create an array of 50 floats spaced evenly between 0 and 1 inclusive.
So far, we’ve dealt with data as it exists and haven’t tried changing the shape at all. In this last section, we’ll look at how to reshape arrays.
Let’s say we have some image data, such as this handwritten digits dataset from scikit-learn. First, we’ll load the data and look at the shape of the data array that scikit-learn gives us.
What does a single row of data look like?
Hmm, this image data doesn’t look much like an image, does it? This is supposed to represent a handwritten digit, but currently it just looks like a bunch of numbers.
This is because the image data is currently flattened, meaning that the original shape of the data was squished down into a one-dimensional array. This data was originally a two-dimensional square image.
Luckily for us, we can get the image data back into its original shape using the np.reshape() function. Since our array has a length of 64, we can take the square root of this number to get the dimensions of our original square image: 8 by 8.
There we go! We just reshaped a one-dimensional array with length 64 into an 8 by 8 two-dimensional array. And if you stare at this reshaped data, you might even be able to tell what digit it is…
But before we plot the digit (just for fun, because we can)—let’s verify the shapes of our original data array and our new reshaped array.
Just as we expected.
Alright, here’s what the digit looks like if we plot it using matplotlib.
That’s a zero alright!
Flattening an Array
And if we wanted to—which sometimes we do—we can flatten that image data back into a one-dimensional array. This time, we’ll use the ndarray.flatten() method on the array itself.
And we’re back to our original array.
And with that, you’ve just learned the 20% of NumPy that will get you 80% of the value. Feels pretty good, right?
PS—here’s the NumPy documentation, in case you need it: https://numpy.org/.
And here’s a very handy “Quick Start” guide by the NumPy folks: https://numpy.org/devdocs/user/quickstart.html.
Introduction to Practical Data Science in Python
This course bundle covers an in-depth introduction to the core data science and machine learning skills you need to know to start becoming a data science practitioner. You’ll learn using the same tools that the professionals use: Python, Git, VS Code, Jupyter Notebooks, and more.