80/20 Git—Git for Data Science. Project Data Science.

In this post, you’re going to learn the 20% of Git (and GitHub!) that you’ll use 80% of the time.

(This guide is emphatically not meant to be comprehensive—this guide will show you how to get up and running quickly with the most useful commands.)

Git one of the single most popular tools that nearly all programmers and data scientists need to know how to use. Simply put, Git is a tool that lets programmers track their code, save their code, and collaborate on their code with others. GitHub is a website dedicated to helping people collaborate on code using Git.

Just a couple quick things before we dive in.

In case you want to learn more than just Git, here are the other Project Data Science 80/20 Guides:

And if you need to get a professional data science environment set up on your computer, we have a guide for that: Step-by-Step Guide to Setting Up a Professional Data Science Environment.

Alright—Ready to get started?

[convertkit form=1849266]

Table of Contents


80/20 Git

Primary Functionality

One of the first things to ask when faced with a new technology is, “What is the primary functionality? What is this supposed to do? What problem is it solving?”

With Git, the primary functionality is something called version control. To explain version control, we’ll divide it up into three separate parts: versioning, branching, and collaborating.

We’re going to show specific hands-on examples of all three of these later on, but first we’ll discuss them in the abstract.

(If you want to, you can jump straight down to the hands-on section: Creating a Git Repository.)

Versioning

Let’s say you’re developing a piece of code and saving it as you make changes. But, at some point you realize: “You know what? I shouldn’t have deleted that one function… I really need it now.”

If you’re just saving the file like normal—like you might save any other text document on your computer—then there’s pretty much no way to get back to that version of your code. That function is gone forever. You can rewrite it perhaps, but there’s no way to get it back.

This is where Git helps. With Git, you can take snapshots of your code called commits. Each commit is a perfect record of exactly what all of your code looked like at a given moment in time.

So in our example above, if you had been using Git to store commits (snapshots) of your code, you could easily look back at an old commit to find the function that you need.

Versioning is like saving, but better.

Branching

Now let’s say that you have a large application, and you want to add a new feature to it. But, that feature is going to take a little while to develop, and you want to develop the feature using a separate version of the code just to keep things nice and clean.

This is where branching comes in. In Git, you would simply create a branch off of your code, which is essentially a way to track just your feature without interfering with the main codebase. All of your changes are isolated to just that branch, and you can see the main branch at all times if you want.

If you finish developing your feature and decide that you do want to incorporate it into your main codebase, you can merge your feature code back into the primary branch (the primary branch is usually called the master or main branch). And if you decide you don’t want that feature after all, no problem—just delete the branch and move on.

Another benefit of doing branches in this fashion is that you can easily continue to make small (or large) updates to the primary branch that are completely separate from your feature branch. For example, let’s say while you’re developing your big new feature, you discover a bug (yikes!) in your primary branch. That’s no problem, because you can just go to your primary branch and fix the small bug without worrying about the whole other development process that’s happening for the feature branch.

Collaborating

The ability to collaborate on code with other people is really just a result of the branching and versioning functionality.

When you’re working on code with other people, each of you can develop new features in branches, or you can make code changes in different commits, and then you can use Git to merge all of those branches and all of those commits back together into your primary branch.

Creating a Git Repository

We’re going to walk through the typical process of what you would use Git for in a real project.

Git is predominantly a command line tool (although there are other ways to use it), which means that we’ll need to open up a terminal to get started.

I’m going to create a new project directory to keep my code in, and then open up that directory inside of VS Code.

Once inside of VS Code, I’ll go ahead and open up a terminal at the bottom where we can use Git.

Without any code, there’s nothing to track using Git! So let’s create a simple Python file with a single function in it.

Initializing a Repo

To initialize a Git repository (repo) and start the process of version control, we can go to our terminal and run “git init”. Since I’m using the shell Zsh, you’ll notice that the shell itself responds to the presence of a Git repository and tells me some helpful information like what branch I’m on (master) and whether there are uncommitted changes to code (the “x” tells me that there are changes I haven’t committed yet).

So we’ve basically just told Git, “Hey—I’m about to start tracking some files. I want you to get ready to track files in this project directory for me.”

But, we haven’t actually started tracking anything yet.

Git Status

Let’s take a look at the status of our repository using the “git status” command. (A repository is just a directory with files in it, by the way.)

Git tries to be pretty helpful. Here, Git is telling us a few things:

  1. We’re on the master branch. (The primary branch, also called “main”.)
  2. We don’t have any commits (snapshots) yet.
  3. We do have one untracked file, main.py.

Staging Files Using Git Add

Git also tells us we can start tracking files using the “git add” command. Let’s do that and then run “git status” again.

So we’ve just staged a file to be committed, which Git tells us in the “Changes to be committed” section. But, we haven’t actually created a commit yet.

The “staging area” is helpful because it lets us collect the files that we want to be committed before actually committing the changes. If a commit is a snapshot, then the staging area is where you line up just the files that you want to take the snapshot of. (Like lining up a subset of family members for a photo.)

Committing Your Changes

Finally, let’s commit this new file. When you create a commit, you’ll want to add a short commit message with a description of what changes happened in that commit. You can do this using the command: “git commit -m ‘<commit-message-here>’”.

You’ll notice that Git tells us that 1 file was changed, and the small “x” has now disappeared from my terminal (because I’m using Zsh) showing me that I don’t have any more changes to be committed to my Git repo.

Let’s run git status again and see what it says.

Nothing to commit right now.

Git Log

We can see a brief summary of our previous commits using the “git log” function.

(You can type the letter “q” to get out of the git log screen once you’re done looking at your commits.)

Storing Code on GitHub

Since we have a basic Git repository set up and our code committed, let’s talk about where we can store our code.

We’re already storing the code on our computer, but we typically want to store the code in another location as well. One reason for this is to have a backup of our code, and another reason is so that we can more easily share our code and collaborate with others on the same codebase.

One of the most popular places to store Git repositories is GitHub, a website devoted to helping people store and share Git repositories.

I’m going to log in to my GitHub account and create a new repository on GitHub to push my code to. This way, we’ll have a location on GitHub to store the code that’s currently only on our computer.

First, I’ll click the “New” button on GitHub.

Then I’ll give my repo a name (which doesn’t have to match your local repo name, but it often does) and a short description. I’ll leave everything else as the default and then click “Create Repository”.

GitHub now gives me instructions for what to do to get my code onto GitHub. We’ll use the instructions to “push an existing repo from the command line”, which is three lines of code.

Although Git has historically called the primary code branch the master branch, GitHub recently decided that it was going to move to calling the primary branch the main branch, so the code that we copy is also going to rename our branch for us.

Pushing Your Local Repo to GitHub

The first line of code that we’ll copy-paste from GitHub into our terminal tells git that we’re adding a remote location named origin. A remote is what we call another place that we store our code. A remote doesn’t have to point to GitHub, but it often will.

After copy-pasting that code, let’s go ahead and run “git remote -v” to take a look at our remotes and the URLs that they point to.

So you’ll see that we have a remote named origin, and the URL is the same for both fetch (pulling code down from GitHub to our computer) and push (sending code up to GitHub from our computer).

The second line of code renames our branch from master to main.

And the third line of code pushes our code from our local computer up to GitHub, where it’s stored safely on their servers and is available to share (if we want) via a URL.

If we go back to our GitHub repo page and click refresh, we’ll see that we now have one file up there!

The README.md File

GitHub at this point gives us a very helpful suggestion: “Help people interested in this repository understand your project by adding a README.” There’s a special type of file named README.md that we can include to add helpful information about our repo. The “.md” extension indicates that this is a Markdown file—Markdown is a markup language that is used to create rich text documents using plain text. Essentially, it’s a way of creating nicely formatted text documents.

When we create a README.md file in the root of our project directory, GitHub renders it and turns it into a documentation page for our repo. For example, here’s what the top of the README.md file for the pandas GitHub repo looks like. This shows up automatically when you visit the page.

Let’s create a simple README.md file for our repo and push it to GitHub. At the same time, we’ll also create another common file called .gitignore which tells Git which files it shouldn’t track. (We’ll just leave .gitignore empty for now.) We pretty much always create the README.md and .gitignore files right when we create new git repos—they’re very standard.

Let’s go ahead and add these files to the staging area, commit them in a single commit, and then push the new code to GitHub. We’ll use the command “git add .” (that’s a period at the end) to stage all of the new changes and files. (Or we could add them separately, one at a time.)

Now our GitHub repo has a nice README being displayed on the main page.

By the way, if you ever stage something for a commit and then you want to un-stage it, you can use the “git reset” command.

Cloning a Repo from GitHub

So let’s say that we accidentally delete our code, or we spill coffee on our computer, or we throw our laptop out a window. I’ll go ahead and close VS Code, open up a terminal, and delete our entire project directory.

Poof, it’s completely gone!

Lucky for us, our code is stored on GitHub. This means we can get the entire repo and the entire history of commits back on our computer with a single command: clone.

We’ll grab the URL to clone from our GitHub page.

Then we hop back in our terminal and just run “git clone <URL>”.

And we have our code back! Let’s open it back up in VS Code.

Looks like we have all of our files, and we’re back to exactly where we were.

Branching and Merging

Now let’s talk about our last big piece of functionality for Git: branching and merging.

Branching is what happens when you create a separate branch of your code to work on, usually to do something like develop a new feature or fix a bug. This branch gives you a copy of your code that’s isolated from the primary branch.

Merging is what happens when you want to bring the changes from your branch back into the primary branch. And actually, you can merge any branch into any other branch—merging just means to take changes from one branch and bring them into another branch.

Creating a Branch

First, let’s create a branch called “feature/add-subtract-function”.

We used the “git checkout -b” function to create the new branch and switch to the new branch, all in one command. My Zsh shell has updated the red letters to show me that I’m now on a new branch.

But notice that we’re still in the same project directory and nothing looks different. The code is still exactly the same, the files are still exactly the same—only the branch name has changed so far.

Let’s go ahead and add our new subtract() function to main.py, and then save the file.

Using Git Diff to See Changes

Let’s introduce a new Git function—git diff. By running “git diff” in the terminal, we can see what exactly has changed since our last commit.

Here we see the green “+” plus signs indicating rows that have been added since our last commit. In this case, we’ve added the subtract function.

Now let’s add this file to the staging area and create a new commit.

If we run “git log” now, we’ll see three commits listed.

Merging a Branch

So far we’ve only added the subtract() function to our feature branch. Let’s say we’re done developing this new function, and now we want to bring it back into our main branch—this is called merging a branch back in. We can do this in two steps.

First, we’ll check out the main branch. Notice that after checking out the main branch, our subtract function disappears from main.py—that function doesn’t exist in this branch.

Now, we can merge in our feature branch.

Voila! In the terminal, Git tells us that we’re just added three lines of code to main.py. And in the main.py file editor above, we can see that the subtract() function is now there.

Let’s go ahead and push our changes up to GitHub.

We’ve just developed our first new feature and merged it back into our main codebase! Exciting!

Additional Resources

That’s the main part of the article, and these commands are enough to get you up and running with Git and GitHub.

There’s a lot more to learn about Git from here though, and the best way to learn is to simply play around with it. Here are some of the additional Git topics that you’ll probably need to learn along the way.

  • Deleting branches
  • Resolving merge conflicts
  • Pulling/fetching code from GitHub
  • Reverting your code back to a previous commit
  • Adding files to your .gitignore file
  • …and more!

Here are some of your primary resources for learning more about Git and GitHub:

But don’t worry about learning it all in a vacuum. Go create code and do projects, and you’ll pick up the pieces as you need them.

Conclusion

And with that, you’ve just learned the 20% of Git that will get you 80% of the value. Feels pretty good, right?

Happy learning!


Introduction to Practical Data Science in Python

This course bundle covers an in-depth introduction to the core data science and machine learning skills you need to know to start becoming a data science practitioner. You’ll learn using the same tools that the professionals use: Python, Git, VS Code, Jupyter Notebooks, and more.

Leave a Reply