This is the initial installment of my new series as a guide to supervised classification. If you’ve got a labeled training set with multiple classes we’re going to figure out how to predict those classes. For these tutorials we will be using scikit-learn and so we will be loosely following their algorithm cheat-sheet, of course focusing on the classification quadrant. The difference is that this won’t just be a link to documentation, I’ll briefly introduce the mathematical basis of each classifier and focus on the practical usage of each algorithm. Along the way we’ll also learn some preprocessing, data cleaning, data imputation, and things to look for when training a classifier. For this post I want to briefly introduce some preprocessing steps that will come in handy which I’ll reference in later posts.
Data preprocessing and cleaning can often be the most challenging part of a problem. I could spend an entire series on these techniques (and maybe later I will), but for now let’s make sure we’re all on the same page with the basics. The most common problem that I always run into is missing or incorrect data and class imbalance. The first step I always take is to look at the reliability of each feature individually as this can inform how much data cleaning is needed, perhaps the features you’re interested in don’t need any cleaning at all! Don’t count on it… So in this post I want to cover methods to deal with:
- Missing Data
- Class imbalance
- Getting zero mean and unit variance features
- One-Hot Encoding
- Text Label Encoding
- Data visualization
Taking your time on these steps and LOOKING AT YOUR DATA along the way will greatly improve your results. Did you notice that LOOK AT YOUR DATA is capitalized? That’s because it is very important. Before we get started make sure you’re set up for doing data science, you can follow my getting started guide and how I configure my environment here.
For these tutorials we will be using the titanic dataset from seaborn. You can load it with:
You can quickly take a look at the data with
titanic.describe(). We can also use
titanic.shape to see that we have 891 rows and 15 columns. You can quickly see that even in the first few rows there are a few
NaN values. Let’s see how those are distributed:
So we see that we’re missing some age values, a couple passengers are missing their embarkation information and most records do not have a deck value. Let’s replace the missing ages with a median age.
fillna is a great function which can take a single value as we’ve used here as well as a
list of values to fill. These will allow you to implement any sort of imputation you may like and there are quite a few of them. For most cases mean or median substitution is sufficient as a first attempt in the case that not too many records are missing data. When too many records are missing data, such as
deck above, we have two options, attempt imputation or ignore the feature. For brevity of this cursory explanation we’re going to ignore
deck. For now…
If we look at our classes which we are trying to predict we will see that there is some imbalance, more passengers did not survive on the titanic.
There are many ways to deal with class imbalance but the simplest is to upsample the underrepresented class or downsample the overrepresented class. Typically this issue only becomes a problem if the class imbalance is very large (larger than it is here) in which case it is usually best to downsample as long as you have enough data. Now if you don’t have enough data to downsample it’s very likely you don’t have enough data for accurate classification but you can always try upsampling. Here I’ll demonstrate upsampling using scikit-learn’s
Getting zero mean and unit variance features
It is often desirable to transform features to have zero mean and unit variance to reduce scale and bias effects. This can be done easily with scikit-learn’s
StandardScaler. Let’s make the titanic fares zero mean and unit variance with the following:
Now you can see that our fares have zero mean and unit variance!
One hot encoding is one way to transform a categorical variable into a boolean representation of each possible category. For the titanic dataset we can use the
class categorical variable as an example. This could be represented as a continuum with first class being 1 and third class being 3 or we can treat it as a categorical variable and use one-hot encoding. Pandas has a simple utility for one-hot encoding called
We see that we now have boolean flags for each class. We could repeat this process for any other categorical variable such as
who and so on.
Text Label Encoding
Occasionally you’ll want to represent categories as unique integer numbers. I’ve mostly run into this case for recommendations. Although scikit-learn has
sklearn.preprocessing.LabelEncoder if your data is in pandas you can use the built in categorical data type in pandas.
There is so much to cover here but I highly recommend working through this notebook which provides a great introduction to visualization of datasets.
Keep an eye out for the next installment which will cover linear support vector classifiers