80/20 Jupyter Notebooks - Project Data Science

In this post, you’re going to learn the 20% of Jupyter Notebooks that you’ll use 80% of the time.

(This guide is emphatically not meant to be comprehensive—this guide will show you how to get up and running quickly with the most useful commands.)

Jupyter notebooks are one of the most powerful tools of the modern data scientist, and they’re only getting more popular. In this guide, you’ll learn how to run code in a notebook, document your code, and share the results of your notebook easily using a GitHub link. (Enabling you to share your data science work with anyone, anywhere!)

Just a couple quick things before we dive in.

In case you want to learn more than just Jupyter Notebooks, here are the other Project Data Science 80/20 Guides:

And if you need to get a professional data science environment set up on your computer, we have a guide for that: Step-by-Step Guide to Setting Up a Professional Data Science Environment.

Alright—Ready to get started?

[convertkit form=1849266]

Table of Contents


80/20 Jupyter Notebooks

Primary Objects

One of the first things to ask when faced with a new technology is, “What are the primary objects and functionality?”

With Jupyter Notebooks, the primary object is… the notebook! (Surprise!)

So what is a notebook? At its simplest, a code notebook is a place where you can:

  1. Write code in Python (or other languages); and,
  2. Document that code with rich text—specifically, a markup language called Markdown.

That’s it!

Jupyter notebooks are the code notebook created by Project Jupyter, which also has some other tools such as JupyterLab. Jupyter notebooks are the most popular code notebooks these days.

Creating a Jupyter Notebook

Let’s go ahead and create a Jupyter notebook. To do that, we’re going to need to open up a terminal (or shell) and launch Jupyter notebooks from there. If you need to set up your data science environment on your computer, see our Step-by-Step Guide to Setting Up a Professional Data Science Environment.

First, we’ll open up a terminal.

I’m running the terminal shell called Zsh on my Mac, so this is what it looks like. Your terminal may look slightly different.

The Notebook Server

Now, we’ll run the command to launch Jupyter notebooks: “jupyter notebook”. I’m going to add an ampersand (&) after this command so that Jupyter notebooks runs in the background of the terminal, which is how I usually run Jupyter notebooks.

After hitting enter, you’ll see something that looks like this.

We’ve just started the notebook server, which is what runs everything about Jupyter notebooks.

You should have a new window pop open in your browser, but if not you can open up your browser and copy the URL that shows up into your browser bar—the link will probably be http://localhost:8888.

The “8888” indicates a port number—your number will probably be 8888. (Mine is 8889 because I’m already running Jupyter notebooks on my computer, and you can’t run two things from the same port.)

The notebook server that we’ve just started is essentially a small web server running on our computer. That web server is run by the Jupyter notebook program running in our terminal and lets us use this special application in our browser, with the application running right on our computer.

When we do things in the browser, the notebook server in the terminal will be doing everything that’s needed to keep the web app working—creating notebooks, saving notebooks, etc. If we were to go back to our terminal and end the notebook server process, our Jupyter notebook tab in our browser would stop working as well.

Basically, we have created a small web app running on our computer, that we can only access from our computer, in order to work with Jupyter notebooks.

(One quick note: don’t worry, you don’t need to know anything else about web servers or ports for now.)

The Notebook Dashboard

The screen that we’re currently looking at is called the notebook dashboard. This is where you create new notebooks and run existing notebooks. You can also do other things, like launch terminals and create text files—but usually we’ll just be creating and running notebooks.

Let’s create a new Jupyter notebook. To do that, we just click the “New” dropdown menu on the right-hand side of the screen, then select one of the options—selecting “Python 3” is a fine choice.

The other choices there will be your other Python environments on your system. Selecting the top Python 3 option runs the notebook using the virtual environment that’s currently active in your terminal. You’ll also notice the “Terminal” and “Text File” options that we mentioned before.

When we click “Python 3” in the notebook dashboard, a new notebook is created and will pop up in another tab in your browser.

Here’s our new notebook! Exciting.

Let’s go ahead and name our notebook by clicking the word “Untitled” at the top. We’ll just call the notebook “My First Jupyter Notebook”, but you could also name it something cooler.

Code Cells

The main object that we work with in notebooks is the cell, which is that blue highlighted text entry area at the bottom of the image above. There are two main types of cells: code cells and markdown cells.

A code cell is the default cell that shows up. Here, we can simply enter some Python code.

To run the code cell, we can simply use the keyboard shortcut Shift+Enter.

Tada! There’s the output of our Python code—our “hello world” for Jupyter notebooks.

Notice that before the cell runs, the “In” doesn’t have a number in the brackets. This indicates that the cell hasn’t been run yet. Afterwards, the number 1 shows up to indicate that this code cell is the first cell that has been run in the notebook.

Let’s add some code to the next code cell and run it to see what happens.

There are two important things to notice here.

First, the number in the square brackets beside “In” now says 2, indicating that this was the second cell we’ve run in the notebook.

Second, did you notice that we still have access to our function from the first code cell, even though we’re in a completely different cell now? This is because there’s a Python kernel running as part of the notebook that keeps track of all of the variables, functions, and classes that we define anywhere in the notebook.

The Notebook Kernel

What is a kernel? From the Jupyter notebook documentation:

“Kernels are programming language specific processes that run independently and interact with the Jupyter Applications and their user interfaces. IPython is the reference Jupyter kernel, providing a powerful environment for interactive computing in Python.”

So underneath our Jupyter notebook, there’s basically a version of IPython running that gives us access to all of our IPython functionality—all of the basic Python functionality plus extra goodies.

So, looping back to our second point from above: this means that each cell in your Jupyter notebook has access to every object that has been defined anywhere in the notebook, but only after those cells with the object definitions have been run. For example, if I define a variable in a cell that hasn’t been run, an error will get thrown if I try to use that variable.

We can see all of the objects currently defined by running the dir() function.

Most of these are objects that come with IPython and Jupyter notebooks—the only one of these that we’ve created is “print_hello()”.

A more helpful command is the IPython magic command called “%whos”, which only shows us the objects that we’ve defined.

Restarting the Kernel

So what if we want to reset everything back to an empty Python environment, like starting the notebook from scratch? There are two main ways to do this. First, you can simply “Close and Halt” the notebook and reopen it.

Or, if you want to keep working but just want to reset the environment, you can just restart the kernel instead.

Let’s restart the kernel and see what happens.

Nothing changes at first—all of the code cell numbers are still there, and all of the cell output is still there. This is an important thing to remember: Jupyter notebooks will retain exactly what the notebook looks like, even after closing it and starting it back up again, and even though the underlying Python environment has been reset.

But, we notice a difference if we run our “%whos” cell again.

The numbers go back to starting from “1” again, and the namespace is currently empty, meaning that we haven’t defined any variables, functions, or classes yet.

Interrupting the Kernel

One last thing about the kernel before we move on. Let’s say you write some code that gets stuck somehow. Maybe you’ve accidentally created an infinite loop, or the code you’re trying to run is going to take forever. Here, I’ve created an infinite while loop that’s never going to stop running.

To stop the execution of a code cell, you can interrupt the kernel like this.

This will cause a KeyboardInterrupt just like you could do in a normal Python shell or IPython shell, interrupting the execution of the code.

Markdown Cells

So far, we’ve been working exclusively with code cells. Let’s talk about the other type of Jupyter notebook cell—the Markdown cell.

To create a Markdown cell, click outside of the text area for a code cell (or just hit the Escape key) so that the cell turns blue. Then, you can go up the dropdown list for cell type and select Markdown. Or, you can simply use the keyboard shortcut which is just the letter “m”.

(I pretty much always recommend using the keyboard shortcuts.)

Notice that the “In” text goes away from the left-hand side of the cell, to indicate that we now have a Markdown cell.

If you haven’t used Markdown before, here’s a very handy cheat sheet: Cheat Sheet – Markdown Guide. But it’s fairly easy to pick up—here are some of the most common styles that you’ll find yourself using.

(By the way, Markdown is exactly the same markup language used by GitHub for the README files. You’ll be running into Markdown in several places as a data scientist.)

That’s what the Markdown cell looks like before we run it. After we run it (once again using Shift+Enter), this is what it looks like.

Here’s a very simple example of how we might document some of our code using Markdown cells and code cells.

Saving and Sharing a Notebook

Finally, let’s look at how we can save and share our Jupyter notebooks. It looks like it’s been over an hour since we saved our notebook… dangerous!

To save, you can just use “Command-S” (or Ctrl-S on a Windows or Linux), or you can use the “Save and Checkpoint” menu item.

Halting the Notebook and Quitting the Notebook Server

Let’s go ahead and close our Jupyter notebook. If you haven’t halted your notebook, you’ll see in the notebook dashboard that the notebook is still running (down at the very bottom).

We can simply select the notebook and click “Shutdown” at the top.

Now, let’s close our notebook dashboard tab and go back to our terminal. Since I used the ampersand after the “jupyter notebook” command, I can hit Enter and still use my terminal.

Committing a Notebook to Version Control Using Git

As the last piece of saving and sharing our notebook, we’re going to save our notebook to version control using git, and we’re going to push it up to GitHub for sharing (and remote storage).

The first thing that I’m going to do is stop my Jupyter server. Since our process is running in the background, I’m going to use the “kill %1” command to stop the process.

Then I’m going to create a new project directory for my notebook to live in, and I’m going to move the notebook into that directory. This will become my repository (repo) for version control using git.

I can change into that directory, set up a new git repo, and commit my notebook.

Storing and Sharing a Notebook with GitHub

Now I’ll head to GitHub to create a new remote repository to push the code to.

I’ll add the new remote URL to my git repo on my computer, then push the code to GitHub.

If I go back to my repo on GitHub and click on the link for my Jupyter notebook…

Voila! Look at that—our Jupyter notebook shows up perfectly right on GitHub. Now, you can share your code with anyone you want, just by sharing a GitHub link!

Pretty cool, eh?

(One important note is that you can’t run code on GitHub, only view it.)

Additional Resources

That concludes the main part of this article, but here are a couple of other resources that you might find useful as you dive deeper into Jupyter notebooks.

First, there are the Jupyter notebook extensions which add a lot of useful functionality to the notebooks. If you find yourself wanting to do more with the notebooks that the default functionality allows, try finding an extension.

Second, if you need to host a Jupyter notebook server somewhere other than your computer (such as creating a server for yourself and other colleagues at work), check out JupyterHub and more specifically The Littlest JupyterHub for small servers. These allow you to host a Jupyter server on another computer and access it over the Internet (or on your private network).

Third and last, the package Papermill is a very cool extension to Jupyter notebooks that lets you parameterize your notebooks and run your notebooks from the command line like any other Python script. Companies like Netflix use notebooks for all kinds of workflows using tools like Papermill. (See this nifty blog for a taste of how Netflix uses notebooks: Notebook Innovation at Netflix.)

Conclusion

And with that, you’ve just learned the 20% of Jupyter notebooks that will get you 80% of the value. Feels pretty good, right?

Happy learning!


Introduction to Practical Data Science in Python

This course bundle covers an in-depth introduction to the core data science and machine learning skills you need to know to start becoming a data science practitioner. You’ll learn using the same tools that the professionals use: Python, Git, VS Code, Jupyter Notebooks, and more.

Leave a Reply