Guides for Other Operating Systems

A brief note before we get started: If you’re on a Mac or Windows, we’ve got you covered with these other articles:

Video Tutorial

And if you prefer to watch a video, here’s a tutorial that goes through these steps.

Overview of This Guide

The first step of any data science project is having a good local development environment where you can explore data in Jupyter notebooks, write Python scripts to process data and train models, and keep track of your code.

If you don’t have a rock-solid data science environment setup, this article is going to help you get the exact same environment that professional data scientists use.

(And don’t worry—all of the software used here is completely free.)

Here’s exactly what we’re going to be setting up:

  1. The terminal: zsh
  2. Programming language: Python
  3. Virtual environment: conda
  4. Code editor: VS Code (Visual Studio Code)
  5. Version control: git and GitHub
  6. Code notebooks: Jupyter notebooks
  7. Project organization: Cookiecutter Data Science

Then, at the very end I’ll give you at Checklist for New Data Science Projects that you can refer to each time you need to set up your environment.

In each section, we’ll discuss what each one of these is used for.

If you run into any issues along the way, one of the best places to find answers is by searching for your problem on StackOverflow (or just use Google—the top result is often StackOverflow anyway). StackOverflow is a phenomenal source of information and will probably become your best friend as you get deeper into data science and programming.

Important Note about Linux

One of the great things about open-source software like Linux is that people are free to modify it however they want to fit their needs.

But, that does complicate matters when it comes to a guide like this. There are many different versions (“flavors”) of Linux, and each has its own slightly different ways of doing things.

This guide will call out those differences as much as possible, but there might be some areas where you need to do some digging on your own to figure something out. And that’s perfectly ok—troubleshooting is one of the biggest things you’ll do as a data scientist!

The terminal: zsh

First off, what exactly is a terminal?

A terminal is basically just a way of giving your computer commands via text. This is important because you’ll run Python scripts via the terminal, you’ll launch Jupyter notebooks through the terminal, and you’ll do lots of other things through the terminal.

(By the way, another name for terminal is “shell”. You’ll see both terms used.)

As a matter of fact, many computers only have a terminal, meaning that you have to know how to use the terminal if you want to do things on them. For example: if you set up an instance on Amazon Web Services (AWS) to do some machine learning, or to create a web app to serve your machine learning algorithm, you’ll need to interact with that machine via the terminal.

Don’t worry if you aren’t comfortable with the terminal yet, but do start using it as much as you can!

There are many different terminals, but all of them do basically the same thing: interact with your computer via text. The terminal we’re going to be setting up is zsh, and specifically the Oh My Zsh framework.

Note: If you would prefer to use the default terminal on your Linux, that’s totally fine. You probably have bash installed as the main terminal that opens up, and bash is a perfectly good terminal. If you want to keep using bash, feel free to skip this section.

Installation Instructions

Here are the step-by-step instructions for getting zsh setup:

  1. First, we’ll check to see if zsh is installed.
    • Search for “terminal” on your computer and open it up.
    • Try running zsh --version. If a version prints out, then you already have zsh installed and you can skip to Step #3.
    • If you get an error, then you need to install zsh using the instructions in Step #2.
  2. If you need to install zsh, do the following:
    • Go to this zsh installation guide on GitHub and find the heading for your version of Linux. There’s a good chance that you’ll need to run either apt install zsh (for Ubuntu/Debian), or sudo yum update && sudo yum -y install zsh (for CentOS/Red-Hat). Run the appropriate command in your terminal.
  3. Now, verify that you have zsh installed by running zsh --version. You may need to exit out of your terminal and open it up again for the changes to take effect.
  4. Now, we’ll install Oh My Zsh.
  5. Exit out of the terminal and open it back up again. You should see a terminal that looks something like this! (It might not look exactly the same—that’s ok.)

Congratulations on getting your terminal setup!

If you want to learn more about the terminal and about zsh, check out these resources:

Programming language: Python

So why would we choose Python over another programming language, like R?

Well, over the last handful of years Python has only kept growing in popularity for data science as well as other fields. These days, pretty much any job description for data science that you see will list Python as a requirement, and most of the popular machine learning libraries these days are written with Python as their primary language: TensorFlow, PyTorch, Scikit-Learn, etc. While many people still use R, Python has grown to be the most popular programming language in the space.

One big reason for this is that Python is a full scripting language that can be used for pretty much anything. Not only can you do your data analysis and machine learning modeling in Python, but you can write a full application in Python as well. This means that the data science you do in Python can be more easily integrated into web applications and other applications—something that is difficult or impossible with other data analysis tools. As you’ll learn, data engineering and software engineering skills are a big part of data science, and Python frees you up to work freely in those spaces as well.

And practically speaking, more people are simply going to want you to know Python as a data scientist when you apply for jobs—which is why at Project Data Science, we’re focused on helping you learn Python.

(Of course, learning more tools is a great idea once you already have one under your belt. The best programmers use the right tool for the job, no matter what that tool is.)

One additional note. We’re going to be installing Python and conda through the Miniconda distribution of Python, even if you already have Python installed on your system. This is because we want to use conda to install packages and manage our virtual environments, which is something we’ll discuss further down.

Let’s help you get Python installed!

Installation Instructions

Here are the step-by-step instructions for getting zsh setup:

  1. Check your current Python version.
    • You probably already have Python installed. Open up your terminal and run python --version. If you get a version printed out, then you already have Python installed.
    • Even if you already have Python installed though, you’re going to want to download and install Miniconda using the next few steps. This is because we’re going to be using conda for our virtual environment manager later on.
  2. Install a new version of Python using Miniconda. (Here are the official Miniconda installation instructions for Linux, if you want to just follow those: Installing on Linux.)
    • Go to the Miniconda home page and download either the “Miniconda3 Linux 64-bit” or “Miniconda3 Linux 32-bit” installer for Python 3.8, depending on if your system is 32-bit or 64-bit.
    • Open up a terminal and navigate to where you downloaded the file. If your download is in your Downloads folder, you can probably run cd ~/Downloads in your terminal.
    • Run bash (or whatever your file name is) in the terminal to start installing Miniconda.
    • Go through the installation instructions. Accepting the defaults should be fine.
  3. Verify your installation.
    • Open up a terminal (or close your terminal and re-open).
    • Run conda env list in the terminal. You should get a short list printed out.
    • If you get an error, you’ll need to do some troubleshooting. See the conda Linux installation page for some troubleshooting tips.
  4. See which version of Python is your default.
    • Run which python in the terminal. If your Miniconda installation is your default, then you should see a path printed out that has “miniconda” in it, like “miniconda3/bin/python”.
    • If you aren’t using the Miniconda version as your default, that’s ok. When you go to create a virtual environment using conda, it will create it using your Miniconda setup.

Virtual environment: conda

Since you just installed Miniconda, you have conda installed.

What is conda? Well, there are two main purposes of conda:

  1. First, conda is used to install Python packages.
    • We’ll use conda to install data science packages like Pandas, NumPy, and Scikit-Learn.
  2. Second, conda is used to create and manage virtual environments.
    • Virtual environments are basically separate installations of Python that have their own set of packages that they have access to.
    • This means each project you work on can have its own set of packages, and you don’t have to worry about your projects having conflicting package requirements (like one project needing scikit-learn version 0.21 and another project needing version 0.23).

There are other ways to install Python packages (like pip, which we’ll use sometimes in conjunction with conda), and there are other ways to create and manage virtual environments (like venv). But conda is used by a lot of data scientists, and we’ll be using conda throughout Project Data Science projects.

If you want to learn more about virtual environments, check out this documentation page:

Using conda to create a virtual environment

Let’s walk through how you can create, use, and remove a conda virtual environment.

  1. Create a conda virtual environment.
    • First, open up a terminal. Type which python to see which version of Python would run right now if you were to run python in your terminal.
    • Run conda env list to see the list of conda environments you already have on your system. You should probably only have the “base” environment, which is the default conda environment that gets setup when you installed Miniconda.
    • Run this command to create a new conda environment named “my-env” with pandas installed:
      • conda create -n my-env pandas
    • Run conda env list again, and you should now see “my-env” listed.
  2. Activate your conda virtual environment.
    • Run conda activate my-env to activate your new environment.
    • Now run which python. You should see a different path printed out to the screen. This means that if you were to run python right now, you would be using your virtual environment.
  3. Install a new package in your virtual environment.
    • With your conda environment activated, run conda install seaborn to install the data visualization package seaborn. This will only get installed in your “my-env” environment, and not in “base”.
  4. Run Python and verify that you have pandas and seaborn installed.
    • In your terminal, type python. You should see a Python interpreter open up.
    • Run import pandas in the Python interpreter. If you don’t get an error, then you successfully imported the pandas package!
    • Try running import seaborn to verify that seaborn got installed as well.
    • Type exit() to get out of the Python interpreter and back into your terminal, zsh.
  5. Deactivate your conda environment.
    • Run conda deactivate.
    • Now if you type which python, you should see your own Python path printed out again.
  6. Remove your virtual environment.
    • Let’s go ahead and delete your virtual environment, since this was just for testing purposes.
    • Run conda remove --name my-env --all.
    • After that finishing running, run conda env list again. You shouldn’t see your “my-env” environment any longer.

Congratulations on creating, using, and removing your first conda virtual environment! It’s a best practice to create a new virtual environment for each project that you work on.

Code editor: VS Code (Visual Studio Code)

People can have very strong feelings about their preferred code editor, but the truth is that there are a lot of good editors out there and it doesn’t matter so much which one you go with, as long as you like it and know how to use it.

Our preferred editor is VS Code, for these simple reasons:

  1. It’s popular among data scientists and developers.
  2. It’s easy to use.
  3. It has very useful functionality right out of the box with all of the features you would expect like multi-line select, an integrated terminal, and debugging tools.
  4. You can use extensions to make it as powerful as you need it to be.

If you don’t have a favorite code editor, I would highly recommend giving VS Code a shot.

Installation Instructions

Installing VS Code is very easy. Simply go to the VS Code website, click “Download”, and install using the instructions.

Usage Instructions

Here are some of the most useful features of VS Code.

  1. Launch VS Code from the terminal.
    • This is a very common way to launch VS Code. Open up a terminal. Type code . (that’s the word “code” with a period after it), and hit enter.
    • VS Code should open up a new window showing you the directory you had open in the terminal (probably your home directory, with Documents and Downloads and all of those folders showing).
  2. Create a new Python file.
    • Right click in the sidebar on the left and select “New File”, then type in “”. You just created a new Python file, and VS Code knows to highlight the text using Python syntax highlighting rules.
    • Type “print(‘hello world!’)” into your Python script and save it.
  3. Open up a terminal right in VS Code.
    • In the Menu bar, select “Terminal” and then “New Terminal”. You’ll see a terminal open up in the bottom of VS Code.
    • In the terminal, run python to execute your Python script. You should see “hello world!” printed out to the console.
  4. Use multi-line select.
    • In your Python script, copy paste the “print(‘hello world!’)” statement a few more times.
    • On the first line, select the word “world” and use the keyboard shortcut “ctrl-d” to select the other instances of the word. You now have multiple cursors on different lines!
    • Change the word “world” to “multi-line select”, save the file, and run it again.
  5. Open up the command palette.
    • Use the keyboard shortcut “ctrl-shift-p” to open up the command palette.
    • Search for “lowercase” to show the command that you could use to lowercase a text selection.
    • The command palette is a great place to look for any kind of code editing functionality you might need.
  6. Search for a Python extension.
    • Find the “extensions” icon on the left-hand bar and click it. You should see a search bar where you can search for extensions.
    • Search for a Python extension (you can just search “python”), and install it.
  7. Look at the other icons on the left and other menu items to get a feel for what all is in VS Code.

You’ve just scratched the surface of what VS Code can do. You’ll explore the functionality further when you start diving into projects using it.

Version control: git and GitHub

Version control is an incredibly important software engineering tool that data scientists use. Essentially, version control helps you do these three things:

  1. Save “snapshots” of your code, so that you can track your changes over time and go back to different versions of code if needed. (For example, if you break something.)
  2. Develop new parts of your code, while keeping the primary version of your code clean (using “branches”).
  3. Collaborate with others on developing different parts of the same code.

The most popular version control software is called git, and there’s a very useful website called GitHub which is used in conjunction with git to store your code online and collaborate with others. You can think of git as being local (on your computer), and GitHub as being in the cloud.

Generally speaking, you’ll want to create a new git “repository” (or repo, for short) for each project you work on. A git repo is simply where all of your code for a project lives.

Here, we’ll show you how to set up git on your computer in order to start tracking your code, and how to create a GitHub account where you can store your code online.

Git Installation & Usage Instructions

First, we’ll get git installed locally.

  1. Install git.
    • You might already have git installed, in which case you don’t need to do anything. Try running git --version in your terminal. If you get a version printed out, you’re good to go.
    • Otherwise, check on the git installation page for the instructions for your version of Linux. You will probably run either sudo apt install git-all or sudo dnf install git-all in your terminal.
    • Then, run git --version to ensure that the install worked correctly (you may need to exit your terminal and open a new one).
    • In your terminal, run these two lines to configure your git for the first time so that git knows who you are:
      • git config --global "Your Name Here"
      • git config --global
  2. Create a project with some code to use with our first git repo.
    • Open up a terminal.
    • Create a directory to test with git. You can use this command in your terminal: mkdir my-test-repo.
    • Open up that directory using VS Code by running this in the terminal: code my-test-repo. This should open up that directory (which will be empty) in VS Code.
    • In the left Explorer tab, right click and create three new files:
      2. .gitignore (make sure to put the period in front of this file)
    • The file is something that will show up on GitHub when we push the code. The .gitignore file is a special file that’s used by git to ignore certain files and not track them in version control (which is very useful).
    • Let’s add some text to the file.
      1. Open up the file by double-clicking it in the Explorer tab. On the first line, put this text (without the double quotes): “# My First README”
      2. On the second line, put this text: “This README will show up on GitHub when we push this code.”
    • Put some Python code into the “” file. You can put whatever code you want—if you don’t know what to put, just put a print statement like we did previously.
  3. Initialize a git repo for your project, and take a snapshot (“commit”) of your code.
    • Open up a terminal inside of your VS Code window by going to Terminal in the Menu bar, then “New Terminal”.
    • In that terminal, run ls -la to see the files you have in the current directory.
    • Run git init in your terminal to initialize a git repository in your project directory.
    • Run git add . in your terminal (take note of the period at the end there), which adds all of the files in the current directory to the “staging” area of git. The staging area is where we put files that we want to take a snapshot of.
    • Run git commit -m "First commit." in your terminal to take a snapshot of the current state of your code. In git, snapshots are called “commits”, and the “-m” flag in that command allows us to add a commit message—basically just a short message describing what code changes we made since the last commit.

We now have a very simple git repo with some code in it, and we’ve taken a snapshot of the current state of our code.

GitHub Usage Instructions

Now, we’ll set up a GitHub account and push our git repo to the cloud, where we can share our code with others if we choose to. (For example, we could put a GitHub link on our resume to show companies projects that we’ve worked on.) It’s also a best practice to have our code backed up online (like on GitHub) so that if something happens to our computer, we still have all of our code stored safely.

Not only do git and GitHub store our code, but they also store all of our commits. This is important, because the commits are what we use to go back to different versions of code if we need. They also store some other things like “branches” and “tags”, which you’ll learn more about later.

  1. Create a GitHub account.
    • This is very straightforward—just go to GitHub and sign up for an account.
  2. Create a repo on GitHub.
    • First, go to your GitHub account and click the green “New” button under Repositories to create a new repository.
    • Give your repository a name like “my-test-repo”.
    • You can make the repo private or public, whichever you prefer, and you don’t need to create a or .gitignore since we already did that on our computer.
    • Click “create repository”. (Your repo name on GitHub doesn’t have to match the name of the folder on your computer where your code is, but it’s common practice to have those names match.)
  3. Push your code to GitHub.
    • After you click create, GitHub should have instructions there for pushing an existing repo to GitHub. Copy those instructions line by line into your terminal in the directory where your git repo is (but don’t copy the dollar sign at the beginning—that’s just to indicate that you should enter the instructions into a terminal like bash or zsh). You should see some text printed out when the code gets pushed to GitHub. You’ll probably need to enter your GitHub password.
    • After you push the code to GitHub from your computer, refresh the GitHub repo page in your browser. You should see your README printed out and a list of the files in your project repo!

Your code and commits are now safely stored on GitHub. You could throw your computer out of a window (which isn’t recommended, by the way) and still have all of your code backed up online where you could retrieve it later.

If you want to learn more about git and GitHub, you can start by checking out this Hello World GitHub Guide. (After that, check out the other GitHub Guides if you want to go deeper.) If you want to really dive in, you can dive into these posts:

Code notebooks: Jupyter notebooks

Jupyter notebooks have become incredibly popular with data scientists over the last few years, and for good reason—they’re a great way to analyze data, run some experiments, and document your results in a way that others can follow along with. With notebooks, you create individual cells where you can either (a) write and run Python code, or (b) write Markdown code to document your findings.

(By the way, this is the same Markdown that we used for GitHub. Markdown is a “markup language”, which is basically just a way to write plain text that ends up getting formatted nicely. See here for more details: Markdown Guide.)

Jupyter notebooks aren’t usually used for end-to-end machine learning pipelines in a full production environment—which is where normal Python scripts come in—but notebooks are still an indispensable tool in your toolbelt.

We’ll install Jupyter notebooks using conda.

Installation & Usage Instructions

  1. Create a conda environment.
    • First, we’ll create a new conda environment like we do for all of our new projects. Run this in a terminal: conda create -n my-jupyter-env jupyter. This will create the virtual environment and install Jupyter at the same time.
    • Activate your environment by running conda activate my-jupyter-env.
  2. Run Jupyter notebooks.
    • In your terminal, run jupyter notebook. You should see some text printed to the console telling you that Jupyter notebooks is starting up and running.
    • A new window should open in your browser with Jupyter notebooks running. If it doesn’t, you may need to copy the link in your terminal and paste it into a browser window.
    • You now have Jupyter notebooks up and running! The notebooks are running in your terminal, and the way that you interact with the notebooks is through your browser. Your browser isn’t connecting to the Internet like usual, though—it’s simply connecting to your local computer, since that’s where you’re running Jupyter notebooks.
  3. Create a new notebook and play around with it.
    • In the top right of your browser window, you should see a dropdown that says “New”. Click that, and select “Python 3”. This creates a new Jupyter notebook running Python.
    • Type some Python into the cell in your new notebook, and use the keyboard shortcut “shift-enter” to run the cell.
    • In the next cell, change the cell type to Markdown (use the dropdown at the top of the notebook, and change it to markdown), enter some Markdown code, and use “shift-enter” to format the Markdown cell.
    • If you want, play around with other notebook functionality.
  4. Kill your Jupyter notebook process in your terminal.
    • To stop Jupyter notebooks from running, simply go back to your terminal and use the keyboard command “ctrl-c”. You may need to hit it twice. Jupyter notebooks should print out text saying that it’s shutting down.

You now know how to install Jupyter notebooks using conda, how to run Jupyter notebooks, and how to create a new notebook.

As a side note, Jupyter notebooks can be stored in version control just like regular Python scripts. (Fun fact—pretty much anything can be stored in version control, although you typically don’t want to store data or large files.)

Project organization: Cookiecutter Data Science

One of the hardest parts about being a data scientist (and about programming in general) is having a clear, logical way of organizing your code.

Thankfully, Cookiecutter Data Science is here to help out.

Cookiecutter Data Science is essentially a template project directory that you can use when you start new data science projects. Rather than having to always remember how to best organize your files, you can use this template as a good starting place and make adjustments from there.

One tip here is that you don’t have to use the entire Cookiecutter Data Science template if you don’t want to. There are a ton of files and code included in the template, and you probably won’t use much of it—especially if you’re just getting started in data science.

A good suggestion for starting out is to just use the main directories that scripts that you need. A good starting place is to only use the “data”, “notebooks”, and “src” folders.

Installation Instructions

If you want to install the full Cookiecutter Data Science template, you can follow these steps (which are from the Cookiecutter Data Science GitHub page).

  1. Install cookiecutter.
    • You can install it via pip or conda. Feel free to create a new conda environment, or you can install it directly into your “base” environment if you prefer.
    • To install via conda, run this command in your terminal: conda install -c conda-forge cookiecutter.
  2. Use cookiecutter to download and create a data science project template.
    • Run this in your terminal: cookiecutter
    • In your terminal, you should get prompted to enter some information about the project you want to create. For anything that doesn’t make sense, just hit enter and it will be left blank or cookiecutter will choose a good default for you.
    • After you’re done entering all of the options, you should see that a new directory has been created with the name you gave it. There will be many files and directories in there, forming a good starting point for your project.
  3. Delete anything you don’t want.
    • Like was mentioned above, feel free to delete anything you aren’t going to use. Especially when you’re just starting out, you probably only need 25–50% of the files included in the template.
    • A good starting place is to just keep the “data”, “notebooks”, and “src” folders.

Another way that you can approach the Cookiecutter Data Science template is to simply look at the directory structure on the website and manually create only the folders and files that you want. This can sometimes be a faster way to create a minimal subset of what’s included.

Conclusion & Environment Checklist

And with that, you now have a fully functioning professional data science setup! This is exactly the same basic coding environment that many professional data scientists use, so you should feel confident going into projects with these tools.

Checklist for a New Data Science Project Environment

To conclude, here is a formula to get you up and running with any new data science project. (Example code is shown as sub-bullets.)

  1. Open up your terminal.
  2. Create a new conda environment for your project.
    • conda create -n my-project-env pandas jupyter scikit-learn matplotlib seaborn
    • conda activate my-project-env
  3. Create a new project directory using cookiecutter.
  4. Open up your new project directory in VS Code.
    • code my-project-directory
  5. Open up a terminal in VS Code, initialize a new git repo, and take a first snapshot.
    • git init
    • git add .
    • git commit -m “First commit.”
  6. Create a new repo on GitHub, then follow the instructions to push your code from your computer project directory to that repo.

And now you’re ready to go for your next data science project.

Leave a Reply