Supervised Classification: An Introduction and Preprocessing
This is the initial installment of my new series as a guide to supervised classification. If you’ve got a labeled training set with multiple classes we’re going to figure out how to predict those classes. For these tutorials we will be using scikit-learn and so we will be loosely following their algorithm cheat-sheet, of course focusing on the classification quadrant. The difference is that this won’t just be a link to documentation, I’ll briefly introduce the mathematical basis of each classifier and focus on the practical usage of each algorithm. Along the way we’ll also learn some preprocessing, data cleaning, data imputation, and things to look for when training a classifier. For this post I want to briefly introduce some preprocessing steps that will come in handy which I’ll reference in later posts.
Data preprocessing and cleaning can often be the most challenging part of a problem. I could spend an entire series on these techniques (and maybe later I will), but for now let’s make sure we’re all on the same page with the basics. The most common problem that I always run into is missing or incorrect data and class imbalance. The first step I always take is to look at the reliability of each feature individually as this can inform how much data cleaning is needed, perhaps the features you’re interested in don’t need any cleaning at all! Don’t count on it… So in this post I want to cover methods to deal with:
- Missing Data
- Class imbalance
- Getting zero mean and unit variance features
- One-Hot Encoding
- Text Label Encoding
- Data visualization
Taking your time on these steps and LOOKING AT YOUR DATA along the way will greatly improve your results. Did you notice that LOOK AT YOUR DATA is capitalized? That’s because it is very important. Before we get started make sure you’re set up for doing data science, you can follow my getting started guide and how I configure my environment here.
For these tutorials we will be using the titanic dataset from seaborn. You can load it with:
Missing Data
You can quickly take a look at the data with titanic.head()
, titanic.sample(n=5)
or titanic.describe()
. We can also use titanic.shape
to see that we have 891 rows and 15 columns. You can quickly see that even in the first few rows there are a few NaN
values. Let’s see how those are distributed:
So we see that we’re missing some age values, a couple passengers are missing their embarkation information and most records do not have a deck value. Let’s replace the missing ages with a median age.
fillna
is a great function which can take a single value as we’ve used here as well as a Series
or list
of values to fill. These will allow you to implement any sort of imputation you may like and there are quite a few of them. For most cases mean or median substitution is sufficient as a first attempt in the case that not too many records are missing data. When too many records are missing data, such as deck
above, we have two options, attempt imputation or ignore the feature. For brevity of this cursory explanation we’re going to ignore deck
. For now…
Class imbalance
If we look at our classes which we are trying to predict we will see that there is some imbalance, more passengers did not survive on the titanic.
There are many ways to deal with class imbalance but the simplest is to upsample the underrepresented class or downsample the overrepresented class. Typically this issue only becomes a problem if the class imbalance is very large (larger than it is here) in which case it is usually best to downsample as long as you have enough data. Now if you don’t have enough data to downsample it’s very likely you don’t have enough data for accurate classification but you can always try upsampling. Here I’ll demonstrate upsampling using scikit-learn’s resample
:
Getting zero mean and unit variance features
It is often desirable to transform features to have zero mean and unit variance to reduce scale and bias effects. This can be done easily with scikit-learn’s StandardScaler
. Let’s make the titanic fares zero mean and unit variance with the following:
Now you can see that our fares have zero mean and unit variance!
One-Hot Encoding
One hot encoding is one way to transform a categorical variable into a boolean representation of each possible category. For the titanic dataset we can use the class
categorical variable as an example. This could be represented as a continuum with first class being 1 and third class being 3 or we can treat it as a categorical variable and use one-hot encoding. Pandas has a simple utility for one-hot encoding called get_dummies
.
We see that we now have boolean flags for each class. We could repeat this process for any other categorical variable such as sex
, who
and so on.
Text Label Encoding
Occasionally you’ll want to represent categories as unique integer numbers. I’ve mostly run into this case for recommendations. Although scikit-learn has sklearn.preprocessing.LabelEncoder
if your data is in pandas you can use the built in categorical data type in pandas.
Data visualization
There is so much to cover here but I highly recommend working through this notebook which provides a great introduction to visualization of datasets.
Keep an eye out for the next installment which will cover linear support vector classifiers
Leave a Comment