I know what you’re thinking: “An overview of all of data science?!”
But yes indeed! In this post we are going to try to accomplish the nearly-impossible: We’re going to give you a readable, interesting, comprehensive overview of all of data science. You can be the judge of the extent to which we’ve succeeded.
So let’s get started—what is data science?
Table of Contents
Specifically, here’s what we’re going to cover. Feel free to skip ahead liberally to find the things that you’re particularly interested in. Each section will have code as well as high-level explanations.
- Table of Contents
- An Overview of All of Data Science
- Problem Formulation and Project Planning
- Exploratory Data Analysis (EDA)
- Supervised Learning – Classification
- Supervised Learning – Regression
- Unsupervised Learning – Clustering
- Anomaly Detection
- Research and Statistics
- Recommendation Systems
- Time Series
- Natural Language Processing (NLP)
- Computer Vision
- Types of Data and Databases
- A Brief Overview of Other Data Science Topics
- Conclusion and Next Steps
- Appendix A: Data Science Flowchart
- Appendix B: Data Science Glossary
Why Data Science?
It’s no overstatement to say that data science is revolutionizing our society.
When it comes to machine learning, there are many high-profile examples that quickly come to mind: self-driving cars, computers beating humans at chess and Go, face detection used by Facebook and the government, and human-like chat bots are just a few of the most popular examples.
There are also many less-well-known use cases for machine learning—from the algorithms that power many of the apps that we use, to machine learning models that trade stocks, to models that scan forms in order to automatically fill out documents.
And for every example of machine learning we can think of, there are probably thousands of use cases behind the scenes that don’t get news articles written about them.
But if we expand our notion of data science to include aspects of data analysis, research, and statistics as well, then we can see how these skills have transformed nearly every industry on the planet in a very short amount of time. Consider that Excel was only released about 30 years ago, and today it’s considered the most ubiquitous data analysis tool that practically everyone working with computers should be acquainted with. The idea of split testing (also called A/B testing) an email or a landing page—which is technically a randomized controlled trial, the gold standard of research—is common enough that Facebook Ads has a button devoted to it.
Data analysis and data science are becoming integrated into every aspect of what it means to run a business, a nonprofit, a website, and even a country.
Data Science in a Sentence
If we had to wrap up all of data science into a single pithy sentence, it might look something like this:
“Data science is the use of data and computers to answer questions and build products.”
The word “computers” is important to distinguish data science from pure math or statistics (although these lines get blurrier every year).
The phrase “build products” is important to reference all of the fully integrated machine learning models that drive modern applications.
And using “data” to “answer questions” differentiates data science from computer science, and also covers pretty much everything else from pure research to exploratory data analysis.
The truth is that data science is a very broad, interdisciplinary field that evades attempts to put it into a nice box. It’s one part statistics, one part computer science, one part domain knowledge, plus a sprinkling of many other fields. One of the things you’ll need to do as a data scientist is decide where exactly in the field of data science you want to land—there are many options, and many of them look very different from one another.
But, there is a fairly solid foundation to the field of data science that consists of quantitative reasoning and programming skills—those two abilities are required for everything else that one does as a data scientist.
Alright, that’s enough of an introduction—let’s dive in.
An Overview of All of Data Science
Problem Formulation and Project Planning
Let’s say that you work at a medical company doing research on breast cancer. The company is interested in using cell measurements to determine if those cells are malignant (cancerous cells) or benign (non-cancerous cells). The CEO comes to you and says: “You’re our best data scientist. Can you do this?”
And this is how many data science problems start: with a big, fairly nonspecific ask from someone in the company. Now, before you ever touch a keyboard or analyze a dataset, it’s your job to gather requirements for the project, formulate a specific problem, and create a project plan. Even if the initial project plan is a very loose exploration (which is a good first step!), you need to decide that the exploration is your next action item, and you need to have some idea of what your goals are for the exploration.
It might seem funny to start our overview of data science with “soft skills”, but these are actually some of the most important skills a data scientist possesses. Often the hardest part about a project isn’t the technology or the algorithms or the technical side of things—the hardest part is creating a clear problem definition and making sure everyone is on the same page about what exactly you’re doing. There’s no use working for weeks on a problem, only to realize that you were working on the wrong problem the whole time. You’ll also need a large dose of people skills to be able to work with other people from all around the company, from executives to marketing specialists to business analysts to data engineers to software engineers.
You’re also going to need some domain knowledge at this point to help you understand what data you might have access to and what data might be helpful. You’ll need to have enough of an understanding of the field (in this case, medicine or biotech) that you can speak the lingo and talk with other people about your project and what you need to accomplish the project.
Above all, you need the ability to ask the right questions.
So you present your CEO with some questions that can help you form a clear project plan, and you talk with other people at the company to find the data that you need. Now, you’re ready to start!
Exploratory Data Analysis (EDA)
So let’s say we have access to a dataset with various attributes about tissue samples, as well as a column with the results of if the samples were malignant (cancerous cells) or benign (non-cancerous cells). We want to know if there’s a way to determine malignant cells from benign cells using the data that we have.
To start, we’ll load the data using Python, Jupyter notebooks, and a few other tools that we’ll mention below.
Using Python code inside a Jupyter notebook is one of our go-to approaches for doing the initial analysis of the data.
First we might ask: what percentage of the tissue samples actually ended up being malignant? To answer this question, we could load the data into Python using the pandas library, and then look at the value counts of the Diagnosis column.
We would have just looked at frequencies (the counts of data by category) and proportions (percentage of data by category). Both of these are descriptive statistics for categorical data. (We also used the map function to convert the zeros and ones into the diagnosis labels.)
If we were to multiply those proportions by 100, we would have the percentage of data points per diagnosis.
But perhaps we’d rather see this information visualized like this, where we could easily see that we have much fewer malignant diagnoses than benign ones.
This would be a perfect example of data visualization, and specifically creating a bar chart.
Now before we move on, let’s review some of the main tools that we’ve used so far:
- Python is the programming language we used.
- Jupyter notebooks is where we ran our Python code.
- In Python, we loaded our data as a pandas DataFrame (“pandas” is a Python package). A DataFrame is basically a spreadsheet-like representation of the data.
- (Technically we loaded the data from the Python package scikit-learn, but we’ll talk more about this later—the primary purpose of scikit-learn is machine learning.)
- To create the bar chart, we used Matplotlib, which is another Python package.
If you want to learn more about pandas, check out these two Project Data Science resources:
Now let’s say that we’re interested in learning more information about the columns in our data and what they mean. We would probably consult the metadata or the data dictionary, which would have some more information about each column such as a description of what data the column contains.
Here’s some of the metadata for our breast cancer dataset:
In it, you see information like:
- The number of instances, or rows of data. This might also be referred to as samples.
- The number of attributes, or columns of data. This might also be referred to as features.
Now maybe we’re interested in understanding what relationships our attributes might have with our diagnosis. By the way, since our cancer diagnosis is the main thing we’re interested in, we call that our target. Our target is what we’re interested in understanding.
So let’s take a look at one of our features (i.e. attributes): “mean radius”. From our metadata, we know that this feature refers to the average of the radiuses (radii) of the cell nuclei that were sampled from the suspect tissue. (The average is also known as the mean, by the way.)
To understand the relationship between mean radius and diagnosis, we might try plotting two different histograms to show the distribution of the mean radius column: one histogram shows the distribution for the benign samples, and one histogram shows the distribution for the malignant samples. Each histogram groups data points into bins, where the height of the bar represents how many points fall into the bin and the width of the bar shows the mean radius values that fall into that bin. (For example, the far right bars of the orange graph show that there are a few rows of data, but not many, where the mean radius was greater than 25.)
(A quick note. To plot two histograms, first we created two DataFrames: one for the data from the benign tumors, and one for the data from the malignant tumors. To create those two DataFrames, we used boolean masks which are basically lists of “True/False” values that say whether or not to include a row of data in your filtered DataFrame.)
We currently can’t tell which histogram belongs to which set of data, which is a problem. Let’s fix that by adding a legend:
Now we can tell that the blue histogram belongs to the benign samples, and the orange histogram belongs to the malignant samples, but we can’t actually see the whole blue graph since the orange graph covers it.
We can fix that by reducing the opacity of the graphs using a parameter called alpha, which reduces the opacity of each graph:
Let’s make one more edit to this graph—let’s add labels to our x-axis and y-axis so we know the units of our graph. In the case of our histograms, the x-axis shows the mean radius and the y-axis shows the number of data points that fall into each bin.
And if we wanted a slightly different data visualization that shows the same information, we could use a kernel density estimation (kde) plot instead of a histogram:
These graphs are looking pretty good!
By the way, everything that we’ve done so far falls into the realm of exploratory data analysis (EDA). EDA is how we understand our data—through looking at individual rows, descriptive statistics, data visualization, and other techniques. Understanding your data is a very important step to any data science project.
Now let’s go ahead and interpret our findings so far. Interpreting results is one of the primary goals of data science. (One of the other primary goals of data science is prediction. We’ll get to that later.)
Our current interpretation of our findings might be that benign cells usually have an average nucleus radius of less than 15 units (whatever our unit is), while malignant cells usually have an average nucleus radius of greater than 15 units.
Notice that we’re talking about the correlation of our feature to our target, meaning that relationship between them—but importantly, we aren’t implying any kind of causality, like that large nucleus size causes malignant cells. We’re simply noticing that smaller nucleus radius usually indicates benign cells (and vice versa), and that larger nucleus radius usually indicates malignant cells (and vice versa).
One very important thing to remember in data science is that correlation does not imply causation.
Now before we move on, we might want to take our data visualization work and turn it into a reusable Python function that we could use later. Let’s create a function that can plot histograms for any feature, not just mean radius. Our function will accept one parameter, which is the column that we want to plot.
We also added a plot title and used Python f-strings to customize title and axis labels.
Let’s call our function with a different column really quickly, just to see how it works:
Cleaning up and organizing your code is called refactoring, and it’s a necessary part of all coding.
Supervised Learning – Classification
Now let’s say that we wanted to try our hand at prediction. Specifically, we want to use our features (cell measurements and qualities) to predict our target (whether the cells are malignant or benign).
Our rationale for why we’re doing a data science project is very important, so let’s ask the question: why would it be helpful to be able to detect cancerous cells using cell measurements?
Well, there are all kinds of reasons why this could be useful, but one reason might be so that we can make faster and more accurate diagnoses of whether cells are cancerous or not. Faster diagnosis could lead to faster treatment, which could save lives and prevent suffering.
But, in some cases we might decide that there isn’t a strong enough rationale to pursue a data science project, especially in cases where data collection is difficult or expensive. Data science projects are most often chosen and pursued because they have clear, strong rationales.
So, we want to make predictions. This means that we want to create a model that can make predictions, a technique which is appropriately-named predictive modeling. A model is simply an equation that takes data as an input and returns a prediction as an output.
You should always start out with a very simple model. This is called a baseline model, and is a way of telling you what the bare minimum results are that you should expect from any future models. Your baseline model gives you a good starting place. (See this excellent article.)
We’ve actually already stumbled upon a way to create our first model, when we were creating our histograms earlier. When we said that “most benign tumors have a mean radius less than 15 units” and “most malignant tumors have a mean radius greater than 15 units”, we were creating a logical way to make predictions. All we have to do to create a simple model is to turn those words into a Python function.
The kind of question that we’re answering with this model is known as a binary classification problem: “classification” because we want to predict a discrete label or category (as opposed to a continuous numeric prediction, which is called regression—we’ll discuss regression later), and “binary” because there are only two labels/categories that we could predict.
Our current model is very simple and wouldn’t count as machine learning since we manually specified everything about the model (rather than having a computer figure out the model), but our simple model still helps us learn something important about modeling: fundamentally, any model is just a function that takes in inputs and returns a prediction as the output.
This is an important point that bears repeating: Fundamentally, any model is just a function that takes in inputs and returns a prediction as the output.
(By the way, in statistics the inputs are known as the independent variables, and the output is known as the dependent variable. In addition to a function, another way to think of a model is as an equation.)
Classification Model Evaluation Metrics
Now that we have a model, we should ask the question: how good is our model?
To measure how good our model is, we need a model evaluation metric that conveys information about what our model is getting right, and what it’s getting wrong. One of the easiest evaluation metrics to understand for binary classification is accuracy: the proportion of correctly classified points that we predict.
To calculate accuracy, first we’ll need to make predictions using our data. A simple way to get predictions using our model is to loop through our rows of data one by one and pass each value of “mean radius” to our model like this:
Now that we have our predictions, we need to compare them to our true values to answer the question: How many did we get right?
Using the Python package NumPy, we can quickly compare each prediction to each corresponding true target value. First, we convert our predictions list into a NumPy array.
Then, we can create a boolean array that indicates which predictions we got correct, and which we got wrong.
To calculate our accuracy, we can count the number of “True”s and divide that by the total number of data points. This is also the same as just using the “.mean()” method on the boolean array we’ve created. In both cases we get an accuracy of about 89%.
Getting 89% accuracy for a manually specified model using only one feature doesn’t sound that bad, does it!
It just goes to show you how starting simple is a great strategy. It helps you work through the basic fundamentals of your problem, and it gives you a baseline metric that all future models should (hopefully) do better than. (And if your future models don’t beat your original metric…then you should go with your baseline model!)
Cross-Validation, Overfitting, and Underfitting
Now before we move on, this is the point where I tell you that we’ve violated one of the most important rules of evaluating predictive models: always evaluate your model on unseen data.
And here’s a slightly more technical version of that same imperative: never evaluate your model on data that was used to train your model. (We’ll talk about training models in a bit.)
We always want to evaluate our model on unseen data. This is the only way we can know if our model will generalize well to new data points.
A quick example will show the dangers of evaluating your model on the same data that was used to train your model.
Remember that a model is simply a function that takes features (data) as inputs, and returns a prediction as an output. So suppose we use our data to create a model that looks like this:
What this model does is this:
- First, it needs to have access to the full dataset.
- Then, when we pass in a mean radius, the model simply looks in our dataset for that example mean radius value.
- When it finds that mean radius value, it just returns exactly what the diagnosis was for that value.
In a single phase—we’ve just created a lookup table! Our model simply looks through all our rows of existing data and plucks out the correct answer.
Unsurprisingly, our model does quite well when we evaluate it, achieving 95% accuracy:
But hopefully you can see that this method of making “predictions” is completely flawed. We haven’t learned anything at all here—we’re just cheating every single time by looking at our data.
This becomes quickly apparent when we pass in truly new data. Let’s pass in a mean radius of 35, which is larger than any of the data points in our dataset:
The whole thing breaks spectacularly, because the only thing this model knows how to do is look up values that already exist in our data.
So our high accuracy here is a fraud, based on “cheating”. This kind of false high accuracy is called overfitting, and it’s what happens when we don’t evaluate our model on new, unseen data. Overfitting is one of the main things machine learning engineers watch out for.
Underfitting is the other main thing that they watch out for, and it’s a bit easier to explain perhaps. Underfitting is simply what happens when your model isn’t as good as it could be because it isn’t capturing relationships in the data.
Overfitting and underfitting are kind of two opposites on a spectrum where anytime you move away from one end, you’re moving toward the other. The best models sit somewhere in the middle of the two extremes. There’s something known as the bias-variance tradeoff that discusses this balancing act we have to do when we’re modeling.
So we want to evaluate our model on unseen data that we didn’t use when we created the model. But how do we accomplish that?
We can solve this problem a couple of different ways. A simple way to solve it is to use a test set. A test set is a portion of the data that you set aside in the very beginning, before you start building your model. Then, once you’re done building your model, you evaluate your model by making predictions on the test set.
A more common way to solve the problem these days is to use a technique called cross-validation. I won’t go into detail here on what cross-validation does, but it’s kind of like creating a bunch of smaller test sets, evaluating your model on each of those small test sets, and then averaging all of your results together.
For now, I know that our original baseline model was simple enough that it didn’t affect things too much from us not using a test set or cross-validation. We can stick with 89% accuracy as our initial baseline model performance metric. But, from now on we’ll always use either test sets or cross-validation to evaluate our models on unseen data.
Fitting Our Model and Making Predictions
Let’s go ahead and refactor our code. We’re going to turn our model into a Python class that encapsulates the functionality we want. Specifically, we’re going to edit our model to automatically find the prediction threshold.
First, we can just set up the basic structure of our class with a docstring that describes the class and an __init__ method that sets up the variables our class will hold.
Now we’re going to create a class method (a Python function inside a class) that takes our data as input and automatically finds the prediction threshold. Using data to find your model parameters is called fitting your model.
This method currently only works when we know that the benign data points are on the lower side of the classification threshold. We can modify our function to also work for data where the malignant cells are on the lower side.
Note that we’ve started using the convention of X and y for our data, which is very common in data science and machine learning. The capital X is a variable that holds all of our features, and the lowercase y is a variable that holds our target. Since X could contain many columns, it is technically a multidimensional array, and since y only contains one column it is a single dimensional array.
In this case, our model only works when X contains a single feature, but most models accept an X variable with many features.
Let’s give our model a try.
First, we’re going to do that very important thing that we mentioned earlier—we’re going to split off part of our data into a test set so that we can evaluate our model on unseen data.
The data that we use to fit our model is called our training data, which is why we called two of the variables X_train and y_train. Now let’s fit our model using our training data.
First, we instantiate an instance of our model class. Basically, this means that we create a new “empty” version of the class that hasn’t been fit yet. Then, we fit our model using the training data. This finds and stores the best threshold to use for classification. Finally, we get that threshold (which is a class attribute) to see what it is.
It looks like our model found an optimal threshold of 15.08, which is very close to what we originally thought the best classification threshold might be!
Let’s use this threshold to make predictions on our test set, and then see what accuracy we got. First, we need to add a “predict” method to our model class. We’ll just add it below the other methods.
If we call our predict method with our X_text data, then we get predictions that we can check against y_test to get our test set accuracy.
We got an 87% accuracy on our test set, which is fairly close to the 89% accuracy we got before. But, using a test set (or cross-validation) is incredibly important, especially as models get more complex.
Defining Supervised Learning
This problem that we’re trying to solve—using data to create a model that can make good predictions—is what’s known as supervised learning.
It’s called “supervised” because we have data where we know what the “correct” prediction should be, and there is a single correct prediction: in this dataset for example, we know that cells are either malignant or benign, and we have data that shows us which cells are which.
It’s called “learning” because rather than specifying the exact model ourselves, we write code to figure out what the specific model parameters should be (and sometimes what the model should be!). For example, our model above uses a simple threshold for prediction—this high-level equation is something that we defined. But, we didn’t define what the specific threshold should actually be. For that, we had an algorithm that figured out the threshold. We didn’t discuss the algorithm, but you could figure it out if you went through the “.fit()” code line by line.
So, just to recap: supervised learning is a type of machine learning; classification problems are a type of supervised learning; and binary classification problems are a specific type of classification problems.
We’ll discuss other types of machine learning and other types of supervised learning later on.
So far, we’ve been manually creating our algorithms and evaluating our results. Most of the time though, data scientists will use algorithms and models that have been developed and optimized by others.
One of the most popular classification models is called logistic regression. It operates very similarly to the model that we created, where it’s essentially trying to find a threshold to use for classification.
However, logistic regression does something a bit better for us: it gives us a probability of a data point having a certain prediction, rather than just the prediction itself. Let’s see what this looks like for a few data points.
First, we’ll load the model from the scikit-learn package, which is one of the most popular machine learning packages for data scientists. Then, we’ll fit the model.
Now, our model is ready to make predictions.
Let’s look at our graph of our feature and target again:
Let’s see what prediction our logistic regression model gives us for a data point that’s clearly in the benign region:
It predicts “benign”, just like we would expect it to. Now let’s look at the probabilities that our model had for predicting each value:
The first value is the probability of our data point being benign, and the second value is the probability of our data point being malignant. As you can see, our model was very certain about its benign prediction—it thought that there was a 98% chance of the cells being benign, and only a 1.6% chance of the cells being malignant.
Similarly, let’s look at a prediction for data that’s very much in the malignant region of the graph:
Once again, our model predicts what we would expect—malignant—and it gives it a very high probability of about 92%.
But now let’s see what happens when we try to predict a point in the ambiguous territory where the two graphs overlap, and where it isn’t clear what the cells are:
Our model predicted benign, but looking at the probabilities we can see that the prediction is basically a coin flip. Our model has a default prediction threshold of 0.5, which is why it predicted benign, but we really could have predicted either way.
Evaluating a Binary Classification Model
This leads us to the question: when we’re dealing with cancer, what probability should we accept for thinking that a tumor is benign? If you had cancer, would you be comfortable knowing that there was still a 48% probability of those cells being malignant?
In this case, when we’re dealing with cancer we might have a good reason to predict cells as malignant if there’s a smaller probability—maybe even 10% or 20%.
In order to help deal with this question, we need to consider the cost of type I errors versus the cost of type II errors (see the Wikipedia article for more on these types of errors). At this point, the metric of “accuracy” might not be all that we need. To differentiate the different types of incorrect predictions we’re making, we might need slightly different metrics like precision, recall, and the F1 score (see Evaluation of binary classifiers). A confusion matrix is a way of presenting your correct and incorrect predictions to help you see how many errors you’re making in what categories.
There’s a lot to digest in that paragraph that we’re not going to go over right now, but if you’re interested in learning more about classification problems you’ll definitely want to dive deeper there.
Other Classification Models
Finally, before we move on, we should mention some other popular classification models. Some of the models you might hear the most about are tree-based models like random forests and “boosted” models like XGBoost. For classification problems that involve images, video, or language, neural network models have become a great tool. (For example, convolution neural networks (CNNs) are a powerful model for image classification.)
But even when we’re dealing with neural networks, the fundamental problem that we’re trying to solve in this case is a supervised learning classification problem that uses all of the topics we’ve discussed above! Many neural network models are just new, specialized ways of approaching the same fundamental problem of making predictions. (Of course, there are also neural networks that solve very different types of problems.)
And with that, we’ve covered classification—and we’ve also covered many of the foundational concepts within all of data science and machine learning!
Supervised Learning – Regression
Now let’s move on to another problem so that we can discuss one of the other main areas within supervised learning: regression.
Suppose that we have a dataset with some information about housing prices in Boston.
Here, each row represents an area within Boston. The “median-house-value” column is what we’ll treat as our target, and it represents the median house value of all of the houses in that part of town (in thousands of dollars—so you need to multiply the column by 1,000 to get the true value).
We also have other data, such as the per capita crime rate in that part of town (CRIM), the proportion of buildings builts before 1940 (AGE), and the average number of rooms per house (RM).
So let’s ask a question: if we know the average number of rooms per house in a certain part of Boston, how well can we predict the median house value?
First, let’s look at our tried-and-true histogram for each one of these columns separately.
First, we see that the average number of rooms looks like it’s somewhere around six. How many rows of data do we have in this dataset anyway?
Alright, looks like we have 506 rows, with each row representing a different region in Boston.
Here’s the histogram showing the distribution of our median house values:
So the vast majority of our data lies in the range from about $10,000 to $30,000 (this dataset was collected in the 1970s), and there’s a strange looking concentration of datapoints right at the $50,000 line that suggests the data might have been capped at $50,000 for some reason. Indeed, if we look at some metadata on the University of Toronto website, we see that this is most likely the case.
Correlation and Pearson’s r
Now that we have a good understanding of each one of these columns, we can ask the main question: what is the relationship between them? To investigate this, a scatter plot is a great tool.
How might we interpret this scatter plot?
Well, let’s say that we have a region where houses have an average of five rooms. What is a good prediction for median house value? It looks like somewhere around $15,000 might be a good prediction.
What about a region where houses have an average of seven rooms? In that case, it looks like many of the regions have a median house value closer to $30,000.
This means that there appears to be a pretty strong positive correlation between average number of rooms and median house value—meaning that as one value increases, the other also tends to increase.
Contrast that with the scatter plot of per capita crime rate versus median house value:
Here it appears that as the crime rate increases, the median house value tends to decrease. (Although that trend has a lot of other things going on, obviously!) This means that crime rate and house value are negatively correlated.
Both of these findings make intuitive sense, so far.
There’s a way that we can quantify the correlation between two variables called the Pearson correlation coefficient, also known as Pearson’s r. This correlation coefficient gets close to 1 if there’s a strong positive correlation, close to -1 if there’s a strong negative correlation, and close to 0 if there’s not a strong correlation.
What is the value of Pearson’s r for both of these graphs?
You can see that the graph on the left with the positive correlation has a Pearson’s r value of 0.7 (fairly close to 1), while the graph on the right has a correlation coefficient value of -0.39 (which is a moderate negative correlation).
So let’s go back to considering just our RM column and our median house value column, since it looks like we have a pretty good relationship there. How could we create a model that uses the average number of rooms to predict the median house value?
This is a perfect example of a regression problem, where we’re trying to predict some continuous numeric value like house price.
Regression problems are another type of supervised learning, like classification. But unlike classification where we’re trying to predict a category or label, in regression we’re trying to predict a number like price, weight, temperature, etc.
(Quick note. You may notice that our classification model above was called logistic regression, even though we used it for classification. This can be pretty confusing. The reason it’s called logistic regression is because it technically is modeling a number: the probability of each category. We just take that probability and then use it to classify data as one category or another, making it a powerful classification model, even if what the model is actually doing is learning a probability.)
So—regression is for continuous numbers, and classification is for discrete categories or labels.
How can we create a model for this regression problem?
Remember that a model is simply a way of taking in a value (in this case, average number of rooms) and returning a prediction (the median house value). For regression problems, a simple way to represent this relationship is by a basic linear equation, like the ones you learned in algebra.
If you remember how to calculate slope and intercept for a linear equation, try to come up with a good equation yourself.
We can experiment with a few equations below to see which one seems to fit the data.
Hmm, a slope of two and an intercept of zero doesn’t look right. Let’s increase the slope a bit.
That slope looks closer, but now we’re way above our data. Let’s decrease our intercept to bring our line down.
There we go. It’s not perfect, but it looks good enough for now.
Here’s what our equation means, in English.
We think that you can model the relationship between the average number of rooms and median house value like this—the median house value in a region in Boston is approximately ten times the average number of rooms, minus forty. (…times $1,000, to get the units as individual dollars.) For example, if the average number of rooms in a region is 7.5, then we would predict the median house value in that region to be $35,000.
Regression Model Evaluation Metrics
Our model looks like it approximately fits the data, but it’s obviously not perfect. There are a lot of data points that don’t lie close to the line.
So we may ask, how good is our model? This brings us back to the concept of model evaluation metrics. Our binary classification metrics won’t work here, because those metrics required us to calculate if a prediction was the correct or incorrect label, while now we’re dealing with continuous numbers that can be various degrees of closeness to the actual value.
To help us think about what evaluation metric makes sense for regression problems, let’s consider a single data point like this one in red towards the top of the graph, with an average number of rooms of 7.489 and a median house value of $50,000:
Our model would predict a median house value of $34,890. This is an error of $15,110.
Let’s say we calculated this error value for each data point. This would give us a list of errors, some of them positive and some of them negative, depending on if the model predicted below or above the real value. What we want is a way to combine all of these errors into a single quantitative metric that can be our model evaluation metric.
One way to combine all of these errors is to square each value, and then take the average of all of the squared values. This is called the mean squared error (MSE) and is a very common regression model evaluation metric.
If we were to take the square root of this number, that’s called the root mean squared error (RMSE) and is another very common metric. One reason for preferring the RMSE is that it has the same units as the target. Practically speaking, this means that you can represent your error in a way that’s easier to understand.
Let’s split our data into a train and test set, create a function for our current model, use it to make some predictions, and then calculate the RMSE for our test set.
In this case, our RMSE is 6.12. We can interpret this as saying that on average, our predictions are off by about 6.12.
Translating this back to single dollars, this means that our predictions are usually off by about $6,000. This could be $6,000 over the real value, or under the real value—we just know that our error is about six thousand dollars on average.
This gives us a baseline model. We want all of our future models to do better than this—otherwise, we might as well just use this initial baseline model.
Just like with classification problems, most data scientists will use models that already exist. We can use the linear regression model from scikit-learn to automatically find the best linear model for us.
Easy enough. What equation did our linear regression model find?
This means our linear regression model thinks that the best equation to model this relationship is this: (median house value) = 8.78 * (average number of rooms) – 32.6. Our original model that we came up with actually wasn’t too far away from this.
What is the RMSE on our test set when we use this model?
Our RMSE is 6.16, which you’ll notice is actually slightly greater than from our baseline model. Why might this be?
Well, one reason is that we were looking at all of our data when we created our manual model, which technically violated the important principle of not training your model on data that will also be used to evaluate the model. We were just eyeballing the values, but we did violate that principle.
The linear regression model from scikit-learn was only looking at the training data, however, which is exactly what we want.
These RMSE values are both fairly close to each other, meaning that they perform about the same. But, we should probably prefer the scikit-learn model if we had to choose, to prevent possible overfitting.
Let’s do something we haven’t done yet, which is to use two different features as input to a model. We’ll use RM (rooms) and CRIM (crime rate) to predict median house value.
To understand how we might use two features in a model to predict a single value, let’s visualize the data.
The color of the scatter plot points shows the median house value, where red points are the most expensive houses and blue points are the least expensive. It’s hard to tell exactly what’s going on here, but one main takeaway is that the homes in the most expensive regions often have a lot of rooms and a low crime rate. (Which, once again, matches intuition.)
To use both of these features in a model, we can simply include both columns in our X data and pass this to the scikit-learn model.
This time, we’re going to use cross-validation to determine our model RMSE. When we use cross-validation, we don’t really need to worry about also using a test set. The cross-validation function from scikit-learn can give us the negative mean squared error for five different dataset splits (5-fold cross-validation), and then we can do a little math to get a positive root mean squared error.
So using both RM and CRIM gives us an RMSE of 6.5. What is our RMSE for models that just use one feature or the other, but not both?
We get the best RMSE (the lowest error) when we use both features, but if we had to only use one feature then the average number of rooms gives us the most accurate predictions.
Model Interpretation and Regression Analysis
With machine learning problems, sometimes what you really want is to interpret the results of your model, rather than just get predictions. We might ask: what information can our model give us about how the features are used to predict the target?
To fully answer that question would require learning about regression analysis, which is a classical area of study in statistics. You would need to learn about the assumptions of linear regression, and you would need to learn the nuances of looking at model coefficients. We won’t dive into those right now, but know that they are an important part of machine learning if what you really care about is interpretation.
There are a few other tools that can come in handy when interpreting your model, such as feature importances and partial dependence plots. Feature importance plots help you see which features are having the greatest impact on your target prediction. Partial dependence plots help you see how the features impact the target. For example, in our current dataset the partial dependence plots would show that more rooms results in a higher median house value, while a higher crime rate results in a lower median house value.
Linear regression is a popular regression model, but there are many others—polynomial regression, for example. Many of the most popular regression models have the same names as the classification models, although the internal algorithms differ: random forests, XGBoost, neural networks, etc.
k-Nearest Neighbors and Non-Parametric Models
Now, there’s one model that deserves a special mention because it will help us explain another machine learning concept. The model is k-nearest neighbors (k-NN), and the concept is parametric versus non-parametric models.
So far, our models have needed various equation coefficients in order to work. For example, in linear regression we assumed that we wanted a linear model, and then we had to find the slope and intercept that worked the best. These models are called parametric models, because they require us to find some numbers (coefficients/parameters) to complete the model.
The k-nearest neighbors model, on the other hand, doesn’t require any parameters or coefficients—it is a non-parametric model. The way that it works is simply by looking at the point you want to predict, finding a few points that are the closest to it, and using those points to come up with a prediction for the unknown point.
It’s kind of like using a person’s friends to predict that person’s personality, when you don’t know anything about the person but you do know the friends.
In this case, there aren’t any model parameters that we need to determine—we simply look at the data itself to decide what to predict. This is the idea behind non-parametric models.
Now even though k-NN doesn’t have model parameters, there is still one hyperparameter that we have to choose, which is the number of data points we want to look at (the “k” in k-NN). In the above example, we set k to three.
Hyperparameters are essentially configuration options for the model itself that we choose, like knobs that we can use to change how the model operates. The logistic regression models and linear regression models that we used earlier had lots of different hyperparameters that we could have tweaked, but scikit-learn had defaults for all of them that we just accepted for simplicity’s sake.
Unsupervised Learning – Clustering
So far, the machine learning examples we’ve looked at have been types of supervised learning. Let’s look at our first example of unsupervised learning: clustering.
In supervised learning, there was a correct prediction for each row of data, and we were trying to create a model that could make the right predictions. By training our model on this known data and evaluating it using cross-validation, we could then assume that our models would work decently well on future unknown data.
In unsupervised learning, there simply isn’t a correct prediction. Rather than trying to make correct predictions, what we’re trying to do is find meaningful patterns.
- Supervised learning = making correct predictions.
- Unsupervised learning = finding meaningful patterns.
Now, one thing to mention here is that many datasets can be used for both supervised learning problems and unsupervised learning problems. The difference is in what question you’re trying to answer. And the data scientist is the one who frames the question.
For example, let’s load a classic wine dataset.
(Note: the sklearn dataset has “flavonoids” misspelled, so we just renamed the column after loading the data.)
People often use this dataset to learn how to predict the cultivar from the chemical characteristics of the wine, which is a supervised learning problem—and, specifically, a classification problem. (There’s another popular wine dataset that is used to model wine quality, which is a supervised learning regression problem.)
But, we can ignore the cultivar data completely and ask the question: “Can we use the chemical properties of the wines to identify distinct groupings of wines?” If we had used the word “clusters” instead of “groupings”, then we would have been very explicit about framing this as a clustering problem, which is unsupervised learning.
There is no perfectly correct answer to this question, since clustering is inherently asking a pattern-finding question, which differentiates it from supervised learning where we can determine how close our answers were to being right.
Let’s take a look at two properties of the wines, plotting together on a scatter plot.
As the title of the plot suggests, we’re asking the question: are there distinct wine clusters?
With just these two chemical properties plotted, it looks like there might be a couple of distinct groups that might make sense, although it’s hard to tell. This is the challenge of unsupervised learning.
Let’s manually define a couple of clusters to see how they look.
We manually specified these clusters don’t currently have a way of quantifying how “good” these clusters might be. But as far as two clusters go, this looks pretty good.
To interpret our findings so far, we might say something like: “We think that these clusters represent two distinct kinds of wine, based on their flavonoid and proline profiles.”
To know whether these clusters had practical implications, we would need some more domain knowledge and might need to go sample some wines. (Hey, data science involves getting your hands dirty with the real world. I would say wine sampling falls in that category.)
Let’s turn to scikit-learn again to find a model that can automatically find good clusters for us. One of the most common clustering models is k-means, where k refers to the number of clusters you want to find.
First, we load the model and fit it with just the two features we’re currently considering:
The clusters that we get from this model look slightly different from the clusters that we found manually, but are fairly similar—but, there’s something not quite right, which we’ll address in just a minute. Take a look.
Notice how there are two yellow points way off to the left in an area that looks solidly purple? Why might that be? Looking at the scales of our axes can give us a clue.
Our flavonoids feature has a scale from around 1 to 5, while our proline scale goes from about 400 to 1,600. These two scales are wildly different, which affects our model’s ability to determine which points are close to each other. Our model is currently saying that those yellow points are closer to the yellow cluster, because the x-axis distance over to the yellow cluster is so much smaller than the y-axis distance down to the purple cluster.
Scaling, Standardization, and Normalization
To address this, we can use the technique of scaling or normalizing our data to transform the data to all be in the same range. Once our data has approximately the same range, our model will be more able to determine which points are close to which other points.
You’ll often hear the terms “scale”, “standardize”, and “normalize” used interchangeably, but there are actually subtle differences between them. Whatever term is used, though, the end goal of the process is to bring all of the data into the same smaller, focused range.
Let’s use the scikit-learn preprocessing class called StandardScaler.
You can see how the proline values are scaled down now. Let’s re-train the model with our scaled data.
Now you can see that those points on the left are now getting assigned to the cluster that we think they should be (even though the colors are flipped this time). And, notice the values on the axes: they’re all in roughly the same range, which was the point of scaling our data.
Clustering Evaluation Metrics
Since it’s easy to do so, let’s try experimenting with finding different numbers of clusters.
It’s hard to say which of these might be the best, but creating two clusters still looks like the most natural splitting. Although, if we needed to divide these wines up into five different clusters, the way on the right wouldn’t be a bad way of doing it.
If we want to have a quantitative way of assessing how “good” these clusters are, we can use something like the silhouette score, which is a metric based on how dense each cluster is and how separate it is from other clusters. Let’s look at the silhouette score for each of our four options.
We can see that the Silhouette score starts off around 0.55 (on a scale of -1 to 1, where 1 indicates the best clusters possible), and then slowly decreases as we try to find more clusters. Although there still isn’t a “correct” answer as far as how many clusters is best here, we can use this as additional evidence that two clusters is probably our best choice.
Because this is an important point, let’s emphasize it really quickly.
There is no single “correct” answer in unsupervised learning problems, which is exactly what differentiates these types of problems from supervised learning problems where there is a correct answer.
Now is a good time to mention one popular use of both supervised learning techniques like classification and unsupervised learning techniques like clustering: anomaly detection. Anomaly detection is what it sounds like—detecting which data points might be anomalous, or out of the bounds of what is expected.
Anomaly detection is very important for preventing credit card fraud, for stopping hackers and malware, and for any kind of statistical process control (SPC), just to name a few use cases. Are the gears in a wind turbine making more noise than usual? This might be an anomaly indicating that one of the pieces of equipment is about to go bad—time to send out a technician!
Research and Statistics
Let’s now turn to a different domain of data science: research. Suppose we have a web page where customers can purchase a product, and we have an idea—would adding more customer testimonials increase the number of customers who purchase the product?
We can answer this through a specific form of research called A/B testing (or split testing).
First, we’ll create two different forms of the web page: one with the new customer testimonials, and the current page without them. Then, we’ll randomly split up our web traffic between the two pages—half of the customers will go to the new page, and half of them will go to the old page. One easy way to randomly assign traffic is just to assign one person to the new page, then the next person to the old page, and then repeat, swapping page assignment each time.
Here’s what our data might look like as we start collecting it.
The control group is the group that we’re not changing, while the treatment group is the one where we’re trying new things. The converted column is a 1 if a customer makes a purchase, otherwise it’s a 0. (A conversion is marketing-speak for when a customer takes an action that we want them to take, like purchasing a product or signing up for an email list.)
As we keep collecting data, we’ll slowly start seeing one of three different patterns:
- The new page will start getting more customer purchases than the old page;
- The new page will start getting less customer purchases than the old page; or,
- The new page will get about the same number of customer purchases as the old page.
As a first way to analyze this data, let’s simply get the percentage of customers who made a purchase for each page.
So it looks like 12.0% of the customers in our control group made a purchase, while only 11.9% of customers in the treatment group made a purchase. Since the new page has a lower conversion rate than our control page, this would seem to indicate that there’s no reason to switch to using the new page.
But wait, what’s this…you say that the company has already switched to using the new page without waiting for the test results? Oh no! And apparently it takes the web developers quite a bit of work to switch from one version to the other, so they’d rather now make more changes if they don’t have to.
Should we recommend switching the page back to the old version?
Well now we need to consider the question a little more deeply. First, we can consider the idea of statistical significance. We see that there’s a difference between our conversion rates, but it is a very small difference after all—12.0% versus 11.9%. Anytime we’re dealing with data, there’s a lot of random variation in the data that we get. Is this actually a real difference, or just due to a random fluctuation?
The idea of statistical significance let’s us ask the question, “What is the probability that I would have seen this result if there actually wasn’t any difference between the two groups?” If the probability is small, then there probably is a real difference. This probability is called the p-value.
Let’s calculate the p-value for our data using a function from the scipy.stats Python package.
It looks like we got a p-value of 0.216, which means there’s about a 22% probability that we would see these results if there weren’t actually a difference. Although this is a relatively small probability, indicating that there could be a difference between the two groups, the standard cutoffs for p-values are usually 10%, 5%, or even 1%—probabilities much lower than we observed.
So, we can probably report back to our web developers that they don’t need to change the website back. The old web page and the new web page perform very close to each other, and there’s a chance that the difference we saw is simply due to normal, random fluctuations in data.
The process we went through is what’s known as hypothesis testing, and an A/B test is a kind of randomized controlled trial (RCT) which is the gold standard for research.
There’s also a whole other realm of approaching hypothesis testing called Bayesian statistics, which is very powerful and becoming much more popular. (The way that we’ve been considering is called frequentist statistics.)
Speaking of which, while we’re talking about it—Bayes theorem comes up quite often in data science. It’s good knowing what it is and when to apply it.
Now we’ll turn to a topic that should be familiar to most people: recommendation systems (or recommendation engines is another term). Two of the most common examples of recommendation systems are Amazon (for product recommendations based on what you buy) and Netflix (for movie recommendations based on what you watch), but many websites have recommendation systems as part of their functionality—just think about every e-commerce site, news site, or social media site that has a section titled “Recommended For You” or “You Might Also Like”.
There are two main types of recommendation systems are content-based systems and collaborative filtering systems.
A content-based system uses the content itself as the way to determine what to recommend. If you’re watching a video of a data science presentation in Python, maybe you’d like this other data science tutorial video. If you’re reading a news article about global politics, maybe you’re interested in other articles about global politics.
One common way of creating a content-based recommendation system is to convert your products into vectors, and then choose a method of calculating distances between vectors.
Transforming text or products or people into vectors of numbers is called embedding, and it’s a major area of study for modern machine learning and neural networks. For example, one simple way to turn text into a vector is to create a term frequency vector, which is essentially just a word count.
One common distance that you can calculate between vectors of real numbers is called cosine similarity which ranks two vectors as similar if they point in roughly the same direction. Let’s calculate the similarities between each pair of these sentences.
Each matrix entry (i, j) corresponds to the similarity between sentence i and sentence j. For example, the value 0.267 is found at (1, 0) and (0, 1), which means that it is the similarity between sentence 0 (“hello world”) and sentence 1 (“data science is useful in the world”). Notice that each sentence’s similarity with itself is 1, which is the maximum similarity value. And notice that sentence 0 and sentence 2 have a similarity score of 0, meaning that they don’t have any words in common.
There’s a whole lot more to analyzing text—in fact, there’s a whole domain of data science called natural language processing (NLP) that intersects closely with the field of linguistics. (Data science is often very interdisciplinary. That’s part of what makes it so interesting and so powerful.)
Collaborative Filtering Systems
A collaborative filtering recommendation system uses a completely different approach from content-based recommendations. Rather than recommending things based on how similar they are, this kind of system recommends things based on what people like who are similar to you.
For example, you know how Amazon has a section telling you what other products people bought when they got what you’re buying? (Such as a “customers also bought” section.) This is the idea behind collaborative filtering. Let’s say Jenny bought a toaster and a spatula, and you’re buying a toaster. A recommendation system might suggest that you might also like a spatula.
The Python package Surprise is a popular choice for creating collaborative filtering recommendation systems in Python.
Most systems these days are hybrid recommendation systems, meaning that they’ll use a combination of different approaches to create the final recommendations that serve the end goal of having customers buy more products and consume more media.
One common technical area within data science is dealing with time series data, meaning data where time is a crucial feature. Probably the most common example of time series data is the stock market, where we want to know what’s going to happen tomorrow, and we assume that at least part of what happens tomorrow is based on what happened all the days before tomorrow.
Other common examples of problems involving time series data include sales forecasting, demand and utilization forecasting, and failure prediction (such as machine or sensor failure). Specialized neural networks called recurrent neural networks (RNNs) have been playing an important role in time series prediction recently, but the ARIMA model is still very popular as well.
Let’s take a quick look at one way that you might analyze time series data: through decomposition. Decomposition takes a time series graph and attempts to split it up into three parts: the overall directional trend; regular seasonal effects; and residual motion that doesn’t fall into either of the previous two.
First we’ll load some data on CO2 in the atmosphere:
Now we’ll run a statsmodels decomposition algorithm on our data and plot the results:
The initial data that we have looks pretty tidy, which results in a fairly clean decomposition. You can see a very clear directional trend upwards, indicating that the CO2 levels in the atmosphere have been steadily increasing over time.
You can also see a very clear and consistent seasonality to these effects, with yearly periods of high CO2 and other periods of low CO2. This season graph looks like a sine or cosine wave.
Finally, the residual graph shows us that there’s still quite a bit of information that we didn’t capture.
Natural Language Processing (NLP)
Now we’re going to turn to the field of natural language processing, which is an area of data science devoted to analyzing and using text data.
Let’s ask an initial question: is it possible to create a model that can predict the category of a short piece of text by the words that appear in that article? This is a text classification problem.
First, let’s load a dataset of text from a “newsgroups” dataset. (A newsgroup is basically an online discussion forum.) This dataset has about 18,000 newsgroup messages divided up into 20 different categories. (Note that this is also a multiclass classification problem, since we’re going to be dealing with 20 categories that we’re trying to predict.)
We’ll start by just loading the training dataset (so that we can use the test dataset to evaluate our model.)
You can see that we have about 11,000 messages in the training dataset. Here are the 20 different categories that messages can be about:
The first message that we printed out above looks like it might be in the autos category—let’s print out the category of the message to check.
Our first step is to transform all of our articles, which are called documents in natural language processing, into vectors using some kind of document embedding technique. We could use the term frequency approach that we demonstrated above—which is also called the bag of words approach, since we sort of throw all our words into a bin together without regard for word order—or we could use another common technique called term frequency–inverse document frequency (tf-idf) which also takes into account how unique a word is.
Let’s start by using the bag of words / term frequency approach to vectorize all of our documents. (A quote terminology note: splitting up sentences into words is called tokenizing the sentences and is an important step of text vectorization.)
When dealing with text like this, we’ll often remove the most common words—these are called stop words and we don’t gain much value by analyzing them since they’re so common. Above, we told the vectorizer to remove the stop words as it was processing the text.
How many unique words are we dealing with? Let’s look at the size of our vocabulary.
Over 100,000 unique words! Now, that does seem like a lot of unique words considering that there are only somewhere around 170,000 words in use in the English language. Let’s look at 100 words from all throughout our sorted vocabulary:
You see that we have a lot of “words” which are really just numbers or nonsense strings. These words probably won’t help us classify our documents into the 20 different categories, but fully cleaning up this text would take quite a bit of work. And, some of them might actually help us with the classification. So we’ll keep all of the words for right now.
The Naive Bayes model is a good first choice for doing text classification, so let’s load and train this model.
It looks like we’re currently predicting the correct category for about 67% of the articles. Let’s use a confusion matrix to see where exactly we’re getting things wrong. (This is the same type of confusion matrix that was mentioned previously in the Classification section.)
The diagonal of the confusion matrix shows how many times we predicted the right category for each of the 20 categories. All of the other numbers in the confusion matrix show times that we predicted incorrectly. Let’s look for some of the largest numbers off of the diagonal.
Towards the top left, we see the number 57 indicating that we predicted a category of “comp.windows.x” when the category was actually “comp.os.ms-windows.misc”. Those two categories are both about Windows, so it makes sense why our model would confuse them. We also predicted “comp.sys.ibm.pc.hardware” when we should have predicted “comp.os.ms-windows.misc”, and we predicted “comp.graphics” when we should have predicted “comp.os.ms-windows.misc” as well. For some reason, our model is having a hard time predicting “comp.os.ms-windows.misc” correctly.
If we want to see which categories we’re having a hard time predicting, we can use the F1-score to do that. The F1-score is another way of measuring accuracy—it is usually used in cases where you have imbalanced classes, meaning that one category shows up much more than other categories. Now, our classes aren’t imbalanced, as the below counts show:
But, the F1-score is still a good metric to use for multiclass classification because we can easily get a score for each category. Let’s calculate the F1-score for each category to see where we’re doing particularly well and poorly. First, we’ll load the metric from scikit-learn and use it to calculate the F1-score for our validation set, which is similar to a test set but is created from your training data.
Now let’s look at the scores:
The F1-score goes from 0 (not predicting anything correctly) to 1 (predicting everything correctly). It looks like we’re doing a very good job predicting the “sci.med” and “rec.sport.hockey”, and a very poor job predicting the “comp.os.ms-windows.misc” and “talk.religion.misc” categories. (If you look back up at our confusion matrix, you should be able to see that the “talk.religion.misc” category gets misclassified most of the time.)
If we wanted to create a single metric from these F1-scores that we could use to evaluate our model, we could average them all together in what is called a macro F1-score.
Really quickly, let’s look at our macro F1-score if we use the tf-idf vectorizer instead:
It looks like this gives us a boost! Looking at the per-category F1-scores again would help us see which categories improved the most from the tf-idf vectorizer, but for now we’ll go ahead and move on.
What if we have a lot of uncategorized text and we want to find categories for it? We could try a normal clustering approach (like the k-means algorithm we discussed earlier), but a more common approach is to use a technique called topic modeling.
The goal of topic modeling is to find a set of “topics” that can be used to describe documents. For example, if we had a news article about how much chemists get paid, that article might be partially about a “finance” topic and partially about a “chemistry” topic. Topic modeling would ideally be able to find both topics and label this article with both of them.
Let’s try topic modeling with a dataset of news articles and headlines.
We can use pandas to read the data straight from a GitHub repository, which is pretty cool. Then, we drop the “category” column (since we’re going to be determining the topics ourselves) and vectorize the documents.
We’re going to be using the popular latent Dirichlet allocation (LDA) model for topic modeling. With this model, we just need to specify how many distinct topics we want the model to find. We’ll pick 10 topics to start.
Let’s take a look at some of the attributes of our vectorizer and model.
Our vectorizer vocabulary shows us that we have about 70,000 unique words in this dataset (from about 400,000 news articles or headlines). The LDA model has an attribute called components which essentially tells us how our words relate to our topics. For each of our ten topics, each word has a numeric value assigned which indicates how much that word contributes to the topic.
We can look at our top words for each topic by getting the indices for the words with the highest values and then printing those words from our vocabulary.
First, we’ll take a look at the words for topic #0.
Let’s create a function to print out the main words for each topic.
And here are the top words by topic:
You can definitely see some distinct topics coming out of this model. For example, Topic #3 appears to be at least partially about gaming, and topic #9 is partially about the stock market and tech companies like Apple. The topics aren’t completely distinct and some combinations might not make sense, but this looks pretty good for a first attempt.
Now that we have some idea of what topics might exist in our dataset, we can convert each news article or headline into its “distribution” of topics. By using the “.transform()” method in our LDA model, we can turn a document into a vector of 10 numbers representing the 10 topics that we found—the higher the number, the more that article is about that topic.
For example, our first article below appears to be very much about Topic #5 and not really about any of the other topics.
What is this article about?
This short headline does appear to be news about health, which is what topic #5 is partially about. Notice even though the headline doesn’t use any of the top-10 topic words, it uses words like “longevity” that probably get used quite a bit with other words about health. Our model is smart enough to determine which words get used often together and use all of those words to create a topic.
Let’s create a helper object and function to help us look at the top articles/headlines for each topic.
What are the top headlines for topic #7?
For the most part, these do seem like they align with the top words that we saw for topic #7—words about international news, Obama, and (strangely) box office movies and other entertainment.
Other NLP Tasks and Applications
We’ve really just started scratching the surface of natural language processing. There’s a lot more to word and document embedding that we haven’t covered, there’s machine translation, and there are the generative models like the series of GPT models that can create human-like text with simple prompts. Natural language processing is a huge area of inquiry and is a very developed subset of data science that has benefited greatly from neural network advances over recent years.
We haven’t touched one of the biggest modern applications of machine learning: computer vision.
Before talking about some of the modern advances, let’s discuss one of the foundational computer vision problems: handwritten digit recognition.
We spend so much time early on in life learning how to recognize letters and numbers that we forget that this isn’t necessarily an easy thing to do. Not only is it challenging for humans to learn at first, but it’s not at all intuitive how to get a computer to recognize numbers and letters either.
If we’re trying to design an algorithm that recognizes letters and numbers where the font is uniform—like a printed page, for example—then we can probably just write an algorithm that recognizes those specific letters. But, we still would need to know exactly what font is being used, how large the letters are, etc. If we want to recognize handwritten digits, though, the task becomes much harder. Imagine creating an algorithm to be able to tell what these numbers are, for example:
So how could we develop a system that is able to recognize handwritten numbers and letters, written by any individual, at any size?
This is where machine learning comes in and why it is such a powerful tool—rather than developing the specific algorithm that determines what these numbers are, we can choose a model and have the model learn how to differentiate numbers.
First, before talking about the models that we might use for digit recognition, let’s talk about how the image data is structured.
We can turn a grayscale or color (RGB) image into a vector (array), much like how we turn text into vectors—only with images, the data is already represented as numbers which makes our job a little bit easier. As a specific example: for grayscale images, you can imagine the image as a grid of numbers, where each number is 0 if that pixel is black, 255 if that pixel is white, and somewhere in-between 0 and 255 for different shades of gray.
The second “images” array is how the image data looks originally, and the “data” array above that is the “flattened” version where we take each row of pixels and concatenate them all together into a single big list. We can look at this image using matplotlib:
So this is how our data looks—grayscale images are two-dimensional arrays, and color images are technically three-dimensional arrays, where the third dimension carries the color information. Notice how the 0 values in our array match up with the black pixels in our image, and the higher numbers match up with values closer to white.
Now let’s think about what kind of model we need to take image data as an input and predict what number it is.
Since our input data is just an array of numbers and our target is a category (specifically, the labels of “0” through “9”), we can actually use multiclass classification models that we’ve already discussed such as logistic regression or a random forest classifier. These methods will give you some fairly decent results for simpler problems.
But, models that require us to flatten our image data don’t perform very well for more difficult computer vision problems. This is because the “flattening” procedure actually removes quite a bit of useful information from our data. Specifically, in the original 2D and 3D images we can tell which pixels are close to which other pixels, and which pixels have which colors.
Convolutional Neural Networks (CNNs)
This is where modern neural network models like convolutional neural networks (CNNs) come into the picture. CNNs are able to take the image data as input just as it is in its original 2D or 3D format, retaining all of that helpful spatial information. The end result is much more powerful computer vision models—and in fact, CNNs are at the heart of much modern progress in difficult computer vision tasks. Here is a popular visual depiction of what a CNN does:
Let’s take a look at the fashion MNIST dataset and see how a basic logistic regression model does versus a CNN that’s designed to work with image data. First, we’ll load the data.
Our data is very similar to MNIST where we have 10 different categories, but rather than handwritten digits we’re now dealing with articles of clothing.
This is a simple example of image classification, where you classify an image with a single label of what’s in the image. (The ImageNet dataset is a much harder, much more complex image classification problem.)
First, we’ll train a logistic regression model on the flattened images and see what kind of cross-validation accuracy we get.
We get 83% accuracy with a basic logistic regression model, which isn’t too shabby. But, you can tell that this is a harder task than the original MNIST dataset we worked with where we were getting over 90% accuracy.
Now let’s try a convolutional neural network from the popular library Keras which is a part of the TensorFlow neural network package. We’ll create a pretty standard, simple architecture for our neural network.
As you can probably tell, this model is much more complex than a basic logistic regression. With this model, we actually have 90,978 parameters that we’re trying to find specific numbers for in order to develop a model that can create predictions from images of clothing. Let’s train our model now.
After 10 epochs (an epoch is a full processing of the data through the neural network), it looks like our validation accuracy is right around 89%—much better than our logistic regression. We can look at the learning curve to see how well our model is training as it processes the data (and keep an eye out for overfitting, which is a big consideration for neural networks). What we want to see is that our training accuracy and validation accuracy stay fairly close together and don’t diverge too much.
You can see that the training accuracy and validation accuracy start diverging after two to four epochs, and that divergence indicates overfitting (where we’re learning the noise in the data and not learning the underlying patterns anymore). We could implement the popular technique of early stopping to stop training the model after two to four epochs and prevent that overfitting—this is a very common approach with neural networks.
Because of their ability to better capture the information inside of images, neural networks—and more specifically convolutional neural networks—are the current go-to models for computer vision tasks.
Other Computer Vision Tasks and Applications
What are some even harder computer vision tasks that neural networks can help with?
Or you could do the even harder task of object detection, where you recognize all of the objects in an image and where they are. In order to specify where an image is, we use bounding boxes.
Face recognition is a popular application of computer vision which has improved dramatically over the last few years and has gotten a lot of press recently, positive and negative.
Some forms of video analysis are essentially just normal image analysis—where you extract frames of the video as still images, and then use those images in your machine learning models—but there are also more complex tasks within video analysis, such as predicting what action is happening in a video.
Types of Data and Databases
We’ve talked about a lot of different types of data so far, but we haven’t actually talked about how to categorize those types of data. We’ll also mention what kinds of databases exist to store each type of data, since storing data is a necessary precursor to all data science.
The main type of data that we’ve dealt with so far is tabular data—or, more simply, data stored in a table. Spreadsheets (like Excel spreadsheets) are one of the most common ways to store tabular data. With tabular data, you have rows of data that each contain one record or one entry about something. Essentially, a row stores information about something. For example, each row in your dataset could contain information about a person. You also have columns (or fields) which each contain a single attribute or characteristic about your rows. If each row in your dataset is about a person, then perhaps one column stores a person’s height, another column stores a person’s age, etc. Every row has the same columns, and the rows and columns together form the table.
When we load data into pandas DataFrames, we’re using tabular data. When we deal with data in Excel spreadsheets (or Google Sheets), we’re using tabular data. Also, single tables in databases can be thought of as tabular data, although we’ll discuss these databases more in the following paragraphs. The following DataFrame that we used in the classification section is a great example of tabular data.
Relational Data, SQL, and RDBMSs
The next type of data that we’ll discuss is relational data, which is actually an extension of tabular data. With relational data, your data is still stored in tables that have rows and columns—but now, your data is split into many different tables that must be joined together in order to get the full information about your data. Those joins are done using keys (primary and foreign keys), which are the columns that you use to connect different tables.
Here you can see an example database schema, which is the specific way that these tables are designed. The schema shows what information the tables store and how they’re connected using keys. For example, below you can see that the table Contacts is connected to the table Users by the UserID field, which is the key. UserID is called a primary key in the Users table and a foreign key in the Contacts table.
When we’re connecting these tables together using the keys, there are several different ways you do it: left joins, inner joins, outer joins, and (uncommonly) right joins and cross joins. An inner join, for example, will connect two tables on the key provided and then only return the rows where that key was found in both tables.
Let’s say we have one table with customers and another table with purchases. An inner join between those tables on the key of customer ID will only return rows where a customer has made a purchase (and leave out all other rows). You can think about an inner join like the intersection of two sets in a Venn diagram.
Similarly, you can think about left joins as returning all of the left table, even if the key isn’t found in the right table. This would be like returning all customers from the customers table, even if those customers haven’t made a purchase. The following image shows an example of that happening, where there are nulls in the columns from the table on the right where the key from the table on the left wasn’t found.
With the invention of relational data structures also came the invention of a new language used to interact with data stored this way: SQL (which stands for “structured query language”). SQL is used to query data stored in relational tables. (The text on the left in the image above is the SQL code that you would use to do a left join between two tables.)
For example, imagine that we have a database of information about customers. Here’s the query that we can use to get just the rows of data where a person lives in Paris.
And here’s the SQL query we can use to find out how many customers live in each country, using the SQL “GROUP BY” clause.
Finally, let’s say that we have these database tables with information about customers, orders, products, etc.
Finally, here’s a slightly more complex query that we can use to find out how much each customer has spent in total on products.
If you want to try SQL on your own, all of the SQL screenshots above are from the W3Schools SQL Tryit Editor.
The databases used to store relational data are the most common types of databases in the world, and they’re called relational database management systems (RDBMSs). Some of the most popular RDBMSs are PostgreSQL, MySQL, SQL Server, and Oracle. It is also becoming more popular to have databases in the cloud, such as Amazon Web Services relational databases or Google Cloud Platform relational databases. The top three databases below are all relational databases, RDBMSs.
But, fundamentally, any RDBMS is dealing with relational data that’s stored in tables, and you query that data using SQL. The details differ, but the foundation of relational data stays the same.
Key-Value Stores, JSON, and NoSQL
So far, both tabular data and relational data are stored using rows and columns. What kind of data isn’t stored in tables? Well, key-value stores and key-value databases are probably the next most common way to store data.
If you’re familiar with Python, then a Python dictionary is a great example of a key-value store. You can access your data using keys, and those keys are used to return values, which could contain other nested keys and values. For example, take a look at this example dictionary that stores the Gettysburg Address.
Each key has a value associated with it. When we access the “name” key of the dictionary, we get the value “Gettysburg Address” returned.
But, those values can also be more complex data structures like lists or other dictionaries. For example, the “manuscripts” key returns a list as the value, and the “word_counts” key returns another dictionary, where words are stored as keys and word counts are stored as values.
JSON data is used all over the place, especially in anything to do with web apps and APIs. Data is often sent across the web using JSON, and data is received back as JSON. For example, if you want to use the AWS Comprehend text processing service, you can send AWS your text using JSON and get the results of the text analysis back as JSON. The GitHub repo https://github.com/public-apis/public-apis has a massive list of APIs that you can use to get data about all kinds of different topics, and many or most of these are going to use JSON.
If you’re looking for a database to store data in a key-value format, you’re going to be looking for NoSQL databases, so named because they don’t use the query language SQL (since the structure of the data falls outside of what SQL was designed to handle). One of the most popular NoSQL databases is MongoDB, which stores data in a JSON-like format.
(The data stores Redis and Memcached are also key-value based, but these are in-memory data stores that are used for different purposes than most of the databases that data scientists use.)
You can also store large amounts of unstructured data—like text data—using tabular, relational, or key-value databases, although key-value data stores are usually used for this (sometimes also called document stores or document databases). Once text is vectorized and processed, though, a different data structure is frequently used (such as a columnar database, which stores data as columns rather than rows and is more efficient for processing large amounts of vector data).
Data Lakes and Data Warehouses
Let’s talk about data lakes and data warehouses now, which are terms you’ll hear frequently and can sometimes get confused.
A data warehouse is a type of database that is used for storing data specifically for analytical purposes. Many databases are designed for transactions, meaning lots of quick reads and writes as data gets accessed and updated—and these databases are sometimes called transactional databases. Data warehouses, on the other hand, are designed specifically for storing a large amount of data for the purpose of analysis, which requires a very different type of database architecture. Data warehouses are most often relational databases, and the table schema design is often a star schema (called that because the schema roughly looks like a star with many tables pointing away from a single central table).
A data lake, on the other hand, usually refers to a large amount of unstructured data stored somewhere awaiting processing and analysis. Unlike data warehouses and other relational databases, this data isn’t structured yet and needs some amount of processing (potentially significant processing) before it can be used for analysis or other purposes. Data lakes are often used to store large amounts of data where the specific purpose of the data hasn’t yet been defined.
You could roughly think about a data lake as being like a bunch of files on your computer in different folders, where each file contains some data that may or may not be structured.
The last type of data that we’ll discuss for now is graph data, which has gotten increasingly important over the last few decades and has started making its way into machine learning. A graph is a set of things (nodes or vertices) connected by relationships (edges).
The Internet is a graph where web pages are the nodes and links between pages are the edges. Social networks (including virtual social networks like Facebook) are graphs where people are the nodes and the relationships between them are the edges. Transportation infrastructure is a graph where cities and terminals are nodes and the roads between them are the edges. Below is a graph where Wikipedia language versions are the nodes (vertices) and the edges represent the editors or contributors to Wikipedia.
Graph databases like Neo4j have made it much easier to work with graph data at scale, including running typical graph algorithms like path-finding and community detection. Below is an example of a Neo4j graph showing relationships between people and organizations.
As you can see, graph data results in very intuitive visualizations where certain aspects of the network can be fairly easily discerned just from looking at the graph. Much of reality has a graph-like structure, so graph data can be very powerful in many different domains.
Machine learning techniques for graphs are still developing, but node embedding and graph embedding are two current techniques, and graph neural networks (GNNs) are being developed as well.
A Brief Overview of Other Data Science Topics
And with that, we’ve covered a lot of data science fundamentals—from types of data and databases, to basic data analysis, to advanced neural networks used for image recognition.
There’s still a lot more to data science, though—we’ve barely scratched the surface—so now is the part of the article where we look at a plethora of other applications of data science, machine learning, neural networks, and artificial intelligence. Since there are so many applications and areas of speciality, we’ll just spend a short amount of time on each.
Recurrent Neural Networks (RNNs)
Recurrent neural networks (RNNs) have been used for data that has some kind of time element (temporal data). This could be classic time series data like sales data over time, but RNNs have also been used extensively for natural language processing since words occur sequentially in a temporal sequence. Long short-term memory (LSTM) models are a popular type of RNN that are commonly used for natural language processing (NLP) problems.
Transformers and Attention
While we’re talking about neural networks used for NLP, we should mention Transformer models which have once again revolutionized this space. Transformer models use a technique called attention which was designed based on how humans pay attention to different parts of sequences (like words in sentences) as we process them. Transformer models can be trained much faster than models like LSTMs, and have been replacing LSTMs in much NLP usage. The series of models in the GPT language model series are Transformer models that can be used to generate text based on short bits of input. Models that are primarily used to generate text are called generative models, where the primary goal isn’t necessarily prediction.
Neural networks of all kinds are used in the field of reinforcement learning, where the goal is to train agents (independently acting entities) to achieve rewards (like points in a game) in various environments (like a game). For example, reinforcement learning could be used to train an agent to play the game Pac-Man. Reinforcement models still take data as input and return something as output—but in this case, the input is the state of the environment, and the output is an action. For example, if the Pac-Man game state is that there’s only a single dot left on the map, and it’s to the right of the Pac-Man character, then the desired action the model should output would be “go to the right”. Some of the most popular examples of reinforcement learning have been achieved by Google DeepMind, such as when DeepMind’s AlphaGo program beat one of the world’s best Go players in 2016. Reinforcement learning has also been used to significantly reduce the air conditioning costs in large data centers.
While we’re discussing reinforcement learning, DeepMind, and neural networks, it seems like a good time to mention artificial intelligence (AI). Although the specific goals of AI differ based on who you talk to, the general idea behind AI is to create computer programs (and robots) that can accomplish tasks and goals much in the same way that humans are able to, through learning and acting in complex environments to accomplish meaningful goals.
Many people view current machine learning accomplishments (like computer vision) as narrow AI, meaning algorithms that are able to accomplish specific tasks but aren’t able to accomplish many different tasks of different types. AI that seeks to be able to accomplish many different types of tasks and goals is called strong AI or artificial general intelligence (AGI). Research into AGI spans a huge range of scientific fields: mathematics, computer science, computational neuroscience, linguistics, philosophy, psychology, information theory, robotics, and more. The work being done on AI is just as diverse as the fields that are drawn from.
This seems like a good time to bring up one of the current holy grails of machine learning and AI: self-driving cars.
Self-driving cars are an attractive target for the field of data science and machine learning for several different reasons. First, driving a car is something that humans do every day with a fairly high degree of success. Second, driving a car requires navigating a multitude of complex situations that require deep knowledge of the real world, and with a mixture of well-defined and ill-defined rules (imagine driving a car through a construction zone at night in the rain, for example). And third, humans currently spend an exorbitant amount of time driving. According to one study, for example, the average American spends almost 300 hours driving per year. What if that time were freed up to do other things? And, what if people who currently drive for a living could be free to find other jobs? There’s a lot of productivity to be unlocked and—frankly—a lot of stress to be reduced by creating a system that can drive a car by itself. Imagine a world where a tired person can simply sleep in their car while it drives home, rather than forcing themselves to stay awake and potentially cause an accident.
Self-driving cars really got their first major boost during the DARPA Grand Challenge series held by the United States’ Defense Advanced Research Projects Agency (DARPA).
Since then, the company Waymo (a subsidiary of Google/Alphabet) has been one of the main entities pursuing full self-driving capabilities, which it has been working on for about the last decade. Tesla has also made headlines in recent years for its pursuit of partial and full self-driving capabilities. As of November 2020, the full self-driving beta had been released to a small number of Tesla customers.
Some of the main technologies used in self-driving car technology are the sensors that detect the environment in front of and around the car: regular camera sensors; radar sensors; and lidar sensors (which use lasers to accurately measure depth of field).
Optimization and Gradient Descent
Let’s turn away from the high-level topics we’ve been discussing and return to some more specific fields of data science.
The area of optimization is a massively important area for data science, because optimization algorithms are the reason why machine learning models can be trained using data in the first place. Typically, every model that we train with data will have a cost function (also called a loss function or an objective function), which is essentially a function that can tell us how bad our current model is doing. The goal, then, is to reduce the value of the cost function to as low as we can get it—a local or global minimum of the function.
For example, if we’re fitting a linear regression model to some data (like in the plot below), our cost function might be the mean squared error (MSE).
With certain mathematical equations (models), finding the minimum of the function can be calculated directly just from the equation itself—this is called the analytical solution to the problem. However, with pretty much all machine learning that we care about using these days, the analytical solution is either too hard, too slow, or just plain impossible. This is where optimization comes in, allowing us to find solutions through a series of approximations. This process is sometimes called finding the numerical solution.
The most common fundamental optimization algorithm in machine learning is called gradient descent, which is what pretty much all modern machine learning optimization algorithms are based on.
The basic method of gradient descent looks like this. First, we pick some random parameters for our model, just to have a beginning model to work with. (It will probably be a terrible model.) Second, we “look around” at the model and see which direction we need to tweak our parameters in order to make the cost function have a lower value (meaning a less bad model—a better model). The way that we do this is to use calculus to find the gradient of our model equation. Third, we tweak our model parameters in the direction we just found. Fourth, we repeat steps 2 and 3 until the tweaks we’re making to our model parameters are very, very small (meaning that we’ve hit the minimum of our cost function).
The specific area of convex optimization is one of the most popular subfields within optimization, and stochastic gradient descent is a popular specific type of gradient descent. Adam is a popular modern optimization algorithm. When we’re using neural networks, the algorithm for backpropagation is one of the most important elements of optimization.
Dimensionality reduction is another area that we haven’t discussed yet. Here, the main question we’re asking is this: can we retain most of the information inside of our data while reducing the number of features? This can be thought of as being very closely related to the concept of compression, such as compressing a file on your computer to retain the same information but take up much less space. Dimensionality reduction can help address the curse of dimensionality, which is a problem that arises when you have too many features (dimensions) and not enough data points. Dimensionality reduction can also help with data visualization (compressing 10 dimensions into 2 dimensions to visualize on a simple graph, for example), noise reduction, and specific fields like bioinformatics.
Principal component analysis (PCA) is one of the most common dimensionality reduction techniques, although t-SNE is a popular choice for data visualization applications specifically. For example, below is a t-SNE visualization of the MNIST handwritten digits dataset, compressed down to just two dimensions and plotted by digit.
Another type of dimensionality reduction is feature selection, where we just choose which features we want to keep and which we want to get rid of. Feature selection can be thought of as a part of data cleaning and feature engineering, two important elements of preparing data for modeling.
Audio analysis is the data science subfield that involves working with audio data. Some common applications here are speech recognition (voice-to-text), speech synthesis (text-to-voice), noise removal, and audio classification of many different types. Data science for audio data uses a whole new set of terminology such as amplitude, frequency, power, etc., and intersects very closely with the physics of sound. One of the most popular algorithms used when processing audio data is the Fourier transform (FT), which decomposes an audio signal into its respective frequencies.
Conclusion and Next Steps
Finally, I want to conclude by saying that this post is really just the beginning—a jumping-off point to further inquiry. For example, we didn’t discuss genomics at all, a modern hotspot of data science. We also didn’t discuss genetic algorithms, which have their own dedicated base of researchers working on them. There are thousands of other areas out there like these for you to discover, if you wish. And the topics that we did give an overview for, we did so very briefly.
So where do you go from here?
The only way to truly learn something is to try it yourself. I wouldn’t consider someone to have really learned a topic within data science until they had done a project with that topic themselves—or maybe three projects.
So if you want to learn data science, pick something to learn and dive into it fully. Get obsessed. This article is a great resource for building the breadth of your knowledge, but now it’s time for you to figure out where you want to go deeper.
And I believe the best way to learn is to pick a project, and then learn everything that you need to do to complete that project. Progress happens when people solve problems, and problems often look like projects. Pick a project that you care about or are really interested, and go for it.
And if you still need some guidance on where to start, try this Project Data Science course bundle: Introduction to Practical Data Science in Python – Course Bundle.
Whatever you pick, dive in fully and let yourself get absorbed in the material. Learning data science can be a life-long journey if you let it be, and true mastery requires focused work without distraction for extended periods of time.
Best of luck, and happy learning!
Appendix A: Data Science Flowchart
Here’s a guide to what kind of data science you need for different types of problems. It also serves as a useful reminder of what kinds of data science are out there.
- If you have a well-defined and known outcome you want to model or predict, you probably need supervised learning.
- If that outcome is a discrete category (like True/False, or like a type of animal in a picture), then you probably want a classification model.
- If that outcome is a continuous number (like weight or house price), then you probably want a regression model.
- If you’re looking for patterns in data where there isn’t a known right answer, you probably need unsupervised learning.
- If you want to identify distinct groups, you might try clustering.
- If you’re trying to find outliers in data, then that’s the goal of anomaly detection.
- If your problem can be described as some entity or actor trying to achieve a reward or goal, then you might have a reinforcement learning problem.
- If you’re dealing with text, then you’re in the realm of natural language processing.
- If you’re looking for patterns in text, topic modeling is a good choice.
- If you’re specifically looking at data over time, you probably have a time series problem.
- If you’re dealing with a lot of data or special kinds of data, you might be in the realm of neural networks and deep learning.
- If you have image data, you probably want to try convolutional neural networks (CNNs).
- If you have sequential data like text or time series data, you might want to try recurrent neural networks (RNNs) like LSTMs, or you may want to check out Transformers.
- If you have data that is most naturally structured as a graph, then you should try graph theory approaches and graph databases.
- If you need to understand or create the algorithms that actually train a machine learning model, then you’re dealing with the field of optimization.
- And if you have another problem… then you’re on your own! Good luck!
Appendix B: Data Science Glossary
- A/B test—See: split test
- accuracy—The number of correct predictions divided by the total number of predictions.
- ARIMA—A popular time series model.
- attribute (data)—See: feature
- bag of words—An approach to NLP tasks that treats documents without regard for the ordering of the words. For example, creating a term frequency vector is using a bag of words approach. This term is often used synonymously with term frequency vectors. Also known as: term frequency vector.
- baseline model—A simple first model that is used to understand what kind of results can be expected from the problem and how hard the problem is, as well as to establish a lowest evaluation metric score that all subsequent models should beat. See also: model, predictive modeling.
- Bayes’ theorem—A foundational theorem in probability and statistics that shows how to update beliefs in the face of new evidence (and how to assess the probability of unknown events). The formula for Bayes’ theorem is: P(A|B) = P(B|A) * P(A) / P(B). See also: Bayesian statistics.
- Bayesian statistics—A method of statistical analysis that relies heavily on Bayes’ theorem, where probabilities are expressed by full distributions (rather than point estimates) and the analyst’s assumptions about distributions are made explicit before data is collected. Contrasted with frequentist statistics. See also: frequentist statistics, Bayes’ theorem.
- bias-variance tradeoff—Describes the tension between making a model more flexible (but prone to overfitting) vs. less flexible (but prone to underfitting). See also: overfitting, underfitting.
- boolean masks—An array of True and False values which is used to filter other data. A common example is to use a NumPy array or pandas Series as a boolean mask to filter down other data (such as another NumPy array or a pandas DataFrame).
- class (data)—In classification problems, a class is one category from the target categories. Example: if the two categories (or labels) you could predict are “malignant” and “benign”, then those are your two classes. Also known as: label.
- classification—A supervised learning task where the goal is to correctly predict which class most accurately describes an instance of data. When there are two possible classes to predict, it’s known as binary classification, and otherwise it’s known as multiclass classification. See also: regression.
- clustering—An unsupervised learning method of identifying distinct groups of data points from unlabeled data. See also: k-means.
- collaborative filtering—A type of recommendation system that makes recommendations based on comparing the ratings and actions of people who have similar behavior. Example: recommending movies that a user hasn’t seen yet, but that similar users have enjoyed. See also: recommendation system, collaborative filtering.
- computer vision—The problem of getting computers to use visual data effectively. In the realm of machine learning, includes topics such as image classification and object detection. Convolution neural networks (CNNs) are common machine learning tools for computer vision. See also: image classification, object detection, convolutional neural networks (CNNs).
- content-based recommendation system—A type of recommendation system that makes recommendations based on the actual content of the things being recommended. Example: recommending other movies about pirates if someone watches a movie about pirates. See also: recommendation system, content-based recommendation system.
- control group—In a randomized controlled trial (RCT), this is the group where no experimental changes are made.
- convolutional neural networks (CNNs)—A type of neural network model created specifically for computer vision problems.
- corpus—The full set of documents for an NLP task. A corpus is composed of many documents. See also: documents, tokens.
- database—A collection of organized data stored on a computer, typically in special software that is designed for storing, updating, and accessing data. See also: RDBMSs.
- deep learning—A neural network with many layers of neurons.
- dependent variable—See: target
- descriptive statistics—A summary statistic that helps to summarize a dataset or feature. Examples: mean, median, standard deviation, count.
- digit recognition—A common image recognition task to develop a model that can determine which digit or letter is shown in an image. See also: image classification.
- dimensionality reduction—The task of reducing the number of features in a dataset while losing as little information as possible.
- distribution—The distribution of a feature shows how many instances of that feature fall into a certain range or category.
- documents—Whole pieces of text that are analyzed as a single unit. Examples: in an analysis of tweets, a single tweet would be a document. See also: tokens, corpus.
- evaluation metric—A statistic that is used to determine how well a model is performing, usually on predictive tasks. Examples: accuracy, RMSE, F1-score
- F1 score—A model evaluation metric for classification that combines precision and recall. In the case of multiclass classification, the macro F1-score is often used to combine F1-scores for all classes.
- feature—An input to our model. In tabular data, this is a column. Features are usually represented by an uppercase X. Also known as: attribute, independent variable, exogenous variable.
- feature engineering—The process of creating new features from existing features in order to create a better model.
- forecasting (sales, demand and utilization)—The task of using time series data to make predictions about the future values of the time series.
- frequency—The number of times a value shows up. See also: proportion, percentage.
- frequentist statistics—A method of statistical analysis that views probabilities as relying on the frequencies of observations, and where probabilities are often expressed as point estimates (rather than distributions). Contrasted with Bayesian statistics. See also: Bayesian statistics.
- histogram—A histogram shows the distribution of numeric data by binning (or bucketing) values into ranges and plotting those ranges as bars in a bar chart.
- hyperparameter—A parameter of the model itself that is set by the data scientist in order to get the model to achieve the desired results. Examples: model learning rate, or the “k” in k-NN or k-means.
- image classification—The computer vision task of developing a model capable of determining what is present in an image. A type of classification problem using images. See also: object detection, convolutional neural networks (CNNs), computer vision, classification.
- imbalanced classes—When the prevalence of each class in your target isn’t equal to the prevalence of all other classes. For example, in the case of fraud detection there are many fewer fraudulent transactions than legitimate transactions, which means that the classes are severely imbalanced for machine learning modeling tasks.
- independent variable—See: feature
- instances—Individual data points. In tabular data, these are rows. Also known as: samples.
- Jupyter notebooks—A popular coding notebook where you can both (a) write Python code, and (b) write rich text Markdown code.
- k-means—A popular clustering algorithm. See also: clustering.
- k-nearest neighbors (k-NN)—A popular non-parametric model for regression and classification. Makes predictions by identifying the closest known data points and predicting an aggregate of their target values.
- Keras (subpackage)—A high-level neural network API that’s part of the TensorFlow package. A popular choice for building and training neural networks. See also: TensorFlow (package), neural network.
- label—See: class
- latent Dirichlet allocation (LDA)—A common topic modeling algorithm. See also: topic modeling.
- linear model—See: linear regression
- linear regression—A model of the form y = b0*x0 + b1*x1 + b2*x2 + …, where each feature is multiplied by some constant coefficient (including potentially zero) and summed to create the output. See also: linear model.
- logistic regression—A basic type of classification model for supervised learning problems.
- machine learning—Modeling where the goal is to have an algorithm or model use data to learn how to achieve better performance on a task.
- Matplotlib (python package)—The foundational data visualization library in Python.
- mean—The average of a set of data.
- mean squared error (MSE)—The average of the sum of the squared errors, where the error is the distance from the predicted value to the actual value. A common regression model evaluation metric. See also: root mean squared error (RMSE), evaluation metric.
- MNIST—The most popular digit recognition dataset.
- model—An equation that takes features as input and returns an output, and is ideally useful for helping to describe or predict reality in some sense. See also: predictive modeling, baseline model.
- modeling—The process of creating a model from data, including choosing the model, gathering the data, interpreting the model, etc.
- naive Bayes—A common model for text classification tasks. See also: text classification.
- natural language processing (NLP)—Any data science tasks or machine learning models that deal with language, often in the form of written text.
- negative correlation—When two variables tend to move in opposite directions with each other: one variable increases while the other decreases, and vice versa.
- neural network—A type of model inspired in part by how neurons in the brain fire. A very popular type of model for complex machine learning problems, especially problems involving computer vision, audio, language, or reinforcement learning.
- NumPy (Python package)—The most popular Python numerical computation package. NumPy is a foundational package used by many other Python packages, including pandas and scikit-learn.
- overfitting—What happens when a model starts learning the noise in the data that won’t generalize to future unseen data. Characterized by a large gap between training and validation errors. See also: underfitting, bias-variance tradeoff.
- pandas (python package)—A popular data loading and manipulation library in Python. The main objects are the DataFrame and the Series.
- parameter (modeling)—A variable in a model that can be learned from the data and that influences the model’s output. Example: in the linear model y = m*x + b, both m and b are parameters that need to be set (or learned) in order to have an equation that can make predictions. Also known as: coefficient.
- percentage—Percentage is defined as the proportion times 100. Example: a proportion of 0.5 is 50%.
- positive correlation—When two variables tend to increase together and decrease together.
- predictive modeling—Modeling where the goal is predictive accuracy. This is often the type of modeling used in machine learning. See also: model, baseline model.
- principal component analysis (PCA)—A common dimensionality reduction technique.
- proportion—The ratio of one item’s portion compared to the whole. A percentage is a time of proportion. Examples: 1/3, or 33%, or 0.33. See also: percentage, frequency.
- Python—The most popular data science programming language. A general-purpose programming language.
- random forest—A popular model for classification and regression problems. Based on decision tree models.
- randomized controlled trial (RCT)—A type of research used to determine if some kind of experimental intervention has an effect. Data is randomly assigned to either a control group or a treatment group, and then the outcome of the treatment group is compared against the outcome of the control group to see if there’s a difference. See also: control group, treatment group.
- recommendation engine—See: recommendation system
- recommendation system—A machine learning system that recommends a small number of items for a user from a much larger list of options. Examples: the most common example might be Netflix, which recommends movies and shows to users out of a very large catalog. See also: collaborative filtering, content-based recommendation system.
- recurrent neural networks (RNNs)—A type of neural network designed for sequential data, such as time series and text. See also: long short-term memory (LSTM) model.
- regression—A supervised learning task where the goal is to correctly predict a continuous numeric output that most accurately describes an instance of data. See also: classification.
- root mean squared error (RMSE)—The square root of the mean squared error (MSE). One benefit the RMSE has over the MSE is that the RMSE is in the same units as the variable being predicted. A common regression model evaluation metric. See also: mean squared error (MSE), evaluation metric.
- samples—See: instances
- scatter plot—A type of data visualization where one feature is plotting on one axis, another feature is plotted on another axis, and the instances (data points) are plotted as individual points on the graph.
- scipy (Python package)—A popular Python package for all kinds of scientific applications. The scipy.stats subpackage is very popular in machine learning.
- split test—A type of randomized controlled trial (RCT), often conducted on web pages or emails, to see which version is best out of a number of options. Also known as: A/B test. See also: randomized controlled trial (RCT).
- statsmodels (Python package)—A popular statistics package in Python for data science and machine learning.
- stop words—Extremely common words that are often excluded in an NLP analysis.
- supervised learning—A type of machine learning problem where the goal is to accurately predict a class (classification) or a numeric value (regression). See also: unsupervised learning.
- t-SNE—A common dimensionality reduction technique specifically for visualizing data in two dimensions.
- target—The output of our model in supervised learning—the thing we want to model or predict. The target is usually represented by a lowercase y. Also known as: dependent variable, endogenous variable. See also: class, label.
- TensorFlow (package)—A popular neural network package. See also: Keras (subpackage), neural network.
- term frequency vector—A vector showing the count of each word present in the document, where the columns are all of the words in the vocabulary. Also known as: bag of words. See also: term frequency–inverse document frequency (tf-idf) vector.
- term frequency–inverse document frequency (tf-idf) vector—A vector where each value indicates not only how many times a word shows up in a document, but also how special that word is overall in the corpus. A common alternative to term frequency vectors. See also: term frequency vector.
- text classification—The task of classifying text as one of a number of categories. Text classification is a type of classification specifically focused on NLP tasks. See also: classification, naive Bayes.
- time series—Data sequenced by time. Examples: stock market data, regional temperature, sensor data.
- time series decomposition—The method of taking time series data and decomposing it into three separate series: overall trend, seasonal trends, and residuals. The three decomposed series combined to form the original time series data.
- tokenizing—The process of taking documents and turning them into tokens (individual words).
- tokens—Individual words that are split out of documents. A single document gets split into many tokens. See also: documents, corpus.
- topic modeling—The process of identifying distinct topics in a text corpus. See also: latent Dirichlet allocation (LDA).
- treatment group—In a randomized controlled trial (RCT), this is the group where some experimental changes are made in order to see if there’s a positive effect when compared to the control group.
- underfitting—What happens when a model isn’t learning the underlying pattern in the data. Characterized by very high training and validation errors. See also: overfitting, bias-variance tradeoff.
- unsupervised learning—A type of machine learning problem where there is no ground truth target to predict. One of the main goals of unsupervised learning is finding patterns in unlabeled data. See also: supervised learning.
- vectorization—Most commonly, the process of taking a text document and converting it into a numeric vector for the purpose of using it in machine learning models. See also: term frequency vector, term frequency–inverse document frequency (tf-idf) vector.
- vocabulary—All of the unique words in a corpus for an NLP task.
- X—The usual variable for the features. See also: feature.
- XGBoost—A popular model for classification and regression problems. In the class of models known as “boosted” models.
- y—The usual variable for the target. See also: target.
Introduction to Practical Data Science in Python
This course bundle covers an in-depth introduction to the core data science and machine learning skills you need to know to start becoming a data science practitioner. You’ll learn using the same tools that the professionals use: Python, Git, VS Code, Jupyter Notebooks, and more.