In this detailed guide, I will present to you the essential steps of Data Preprocessing in the field of Deep Learning and Data Science.
Furthermore, I will show you how to implement these steps in Python.
Table of Content
- Introduction: Why do we need Data Preprocessing?
- Numerical Data
- Feature Scaling
- Handling Missing Values
- Handling Outliers
- Categorical Data and String Data Types
- Data Splitting
1. Introduction: Why do we need Data Preprocessing?
When training a neural network or a machine learning model you must always keep in mind, that the quality of the training data determines the quality of your model. The data you will encounter in practice will be not clean in most cases. This means the data will contain non-uniform data formats, missing values, outliers, and features with very different ranges. In short, the data would not be ready to be used as training data for your model. For this reason, the data must be preprocessed in various ways.
The purpose of this article is to show you how certain preprocessing steps can be applied to these problematic data samples so that they can be used as appropriate training data.
All preprocessing steps will be first discussed in theory. After each section, I will put this theory into practice and show you how to use python in order to perform these preprocessing steps.
2. Numerical Data
In the first half of this article, I will introduce the most common preprocessing steps for numeric data. Numeric data represents data in the form of scalar values. These scalar values have a continuous range. This means there is an infinite amount of possible values.
Integers and floating-point numbers are the most commonly used numeric data types. Although numeric data can be fed directly into neural network models, the data may require some preprocessing steps, for example, if the range of input features are very different.
3. Feature Scaling
It is in the nature of the data that the ranges of values in these data vary widely. You may have features that have a very small range of values, but some features that have a large range of values. Imagine, your input data consists of only two features. Let’s call these features 𝜃1 and 𝜃2. Let's say 𝜃2 takes values from a small range between 0 and 0.5, while the values of 𝜃1 are between 10 and 2500.
Can you imagine why this can be problematic when you train a neural network on this data?
To explain why this can cause problems, let’s use a graphical model. The left graph shows the plot of a loss function for which we search the weights that result in a global minimum. It is quite noticeable that very different feature ranges of, 𝜃1 and 𝜃2, make this loss function very skewed and lengthy:
This becomes even clearer when we observe the contours of the loss function below:
The red areas of the contours show higher values of the loss function, while the blue area represents the lower values. Remember, when training a neural network, we use gradient descent to find a particular set of weights that minimizes the loss function.
In the case of a very skewed loss function, like in this case, the gradient descent would take bigger steps in the direction of features with the smaller value range, while making only little steps in the direction of features with the higher value range.
The gradient would oscillate a lot which would cause the model to take a longer time to find the global minimum:
In the worst case, the minimum could not be found at all. In practice, we can prevent this problem by doing so-called feature scaling.
Feature scaling is a method in data science used to standardize the range of independent features in a dataset. Feature scaling would prevent the mentioned problem and improve the overall performance of the model. Let’s see what it would look like in a graphical model:
On the left side, you can see our loss function with its counters without feature scaling, as it was seen before. On the right side, you can see the same loss function after the feature scaling of 𝜃1 and 𝜃2. It can be clearly seen that the standardization of the features 𝜃1 and 𝜃2 has resulted in the loss function being more symmetrical.
In this case, the gradient descent can go straight towards the minimum of the loss function without any oscillation. In addition, it allows us to use a much higher learning rate, which reduces the overall training time of the model.
Now that we have seen the benefits of feature scaling let’s deal with the question of how feature scaling can be done in practice. It turns out that there are several ways to do it.
The easiest way to standardize the features is to scale these features to a new range between 0 and 1. This can be achieved by the following equation:
For a given feature X containing values between x_1 and x_n, subtract from each value of this range the lowest value in X and divide the solution by the difference between the highest and lowest values in X. This procedure causes all values of a feature X to take new values in the range 0 –1.
This rescaling method does not depend on an underlying distribution of X. Let’s take a look at how this scaling changes certain feature values.
At the top left, you can see the graph of randomly generated numbers in a range between 0 and 100. The following graph shows the same distribution, but for a much larger range. Consider these two diagrams as an example of two features in a dataset that have very different ranges.
If you apply the introduced rescaling method on the previous slide, you can see that all these values are projected into a much thinner area. Regardless of previous ranges, all values are now in a range between 0 and 1, only the relative proportions have remained the same.
In Python, you can implement this feature scaling technique by using the sklearn library. The following code shows how this technique can be implemented in practice.
Another technique for dealing with very different feature ranges is applying normalization. During normalization, you must treat your feature X as a simple mathematical vector. This is not a problem at all, since all values of X are numeric and every value in X would represent an entry in a vector.
In the next step, you divide each element in your feature vector X by the norm of the vector X:
As the norm, you can use the manhattan or the simple euclidean norm. Applying normalization cancels out the magnitude of each value in X and forces them to take values between 0 and 1.
Let's see how it looks like in practice. As before you can see on the left side the plots of random values from two highly different ranges. The corresponding values after the normalization are shown on the right side:
The following python code shows how normalization can be applied in practice:
The third method that can be used to rescale the range of features is called standardization. Feature standardization makes the values of a feature have zero-mean (when subtracting the mean in the numerator) and variance of one.
To perform standardization you must first calculate the mean and standard deviation of values in your feature X you want to standardize. In the next step, we subtract the mean μ from each feature value and divide the resulting difference by the standard deviation:
As before, let’s see how this technique applies to two very different ranges of values. It can be noticed that in this case after the standardization all values are distributed around the zero mean and most values are laying within the unit variance:
The following code snippet shows how standardization can be applied in python using again the sklearn library:
5. Handling missing Values
Unfortunately, oftentimes datasets you will encounter on practice will often have missing values. Missing values in a dataset are often are represented as ‘NaN’, ‘NA’, ‘None’, ‘ ’, ‘?’
For example, the famous “Wine Quality” dataset contains quite a lot of missing values:
Of course, this is an issue that must be appropriately handled because neural networks and machine learning models can not work with this kind of data.
Firstly, I should mention that there is no perfect way to handle missing data in your dataset. In most cases, you should just try out different methods and see what gives you the best results.
In the following, I would like to present you two techniques to deal with missing data, which provided me with satisfying results in the past. The first technique is very easy, you simply delete the entire data sample where the missing value occurs:
This technique completely eliminates the issue of missings values. One big disadvantage, however, is the fact that the deletion of the data samples reduces the size of your dataset.
In might be not a problem when you are having a huge dataset with millions of samples and only a few hundred missing values. Since the performance of a neural network and a machine learning model scales approximately with the size of the dataset, deleting samples in a very small dataset could decrease the performance of your model dramatically.
This is where the technique of data imputation comes in handy. Data imputation means the replacement of a missing value by a certain value that you calculate. This calculated value is very often either the mean, median or the mode of values within the feature where the missing value occurs. Let’s look at an example. Here you see again the first few samples of the wine quality dataset, where three different features contain a missing value:
In the imputation step, we compute in this case the mean of all values of a feature where the missing values are observed.
After that we replace the missing values with the calculated mean values:
Please note that imputation does not necessarily give better results. In my opinion, however, it is always better to keep data than to discard it. Having said that, personally I would prefer imputation aways over deletion.
6. Handling Outliers
Let’s talk about how we can deal with outliers in our dataset. Outliers refer. to a data instance that is significantly different from other instances in the dataset in terms of their values.
For example: [0.5, 0.41, 0.75, 27.4, 0.1, 2.78]
Often times outliers are harmless. These can only be statistical outliers or errors in the data. However, outliers can sometimes make the training process of a neural network or a machine learning model problematic for us.
Same as in the case of missing values, there is no perfect solution to how to handle those special occurrences. Most of the ways to deal with outliers are similar to the methods of missing values. The easiest way to get rid of outliers is to delete them.
Most of the ways to deal with outliers are similar to the methods of missing values like deleting observations, transforming them, binning them, treat them as a separate group, imputing values and other statistical methods. Here, we will discuss the common techniques used to deal with outliers. Like the imputation of missing values, we can also impute outliers.
We can use mean, median, mode imputation methods. Another possible method is to apply certain transformations. Taking log of all values within the feature where the outlier occurs would reduce the variation caused by extreme values.
Another possible transformation would be binarization where you transform all values of a feature using a binary threshold. All values above the threshold are marked 1 and all equal to or below are marked as 0. If there are a significant number of outliers, we should treat them separately in the statistical model. One of the approaches is to treat both groups as two different groups and build an individual model for both groups and then combine the output.
6. Categorical Data and String Data Types
In the previous sections, various preprocessing techniques for continuous numerical data were discussed. In this lecture, we look at another type of data that is discrete and commonly referred to as categorical data.
Categorical data is the data that is discrete and generally takes a limited number of possible values such as “age”, “gender”, “country” and so on.
Normally, you encounter categorical data as a string data type in your dataset. But this doesn't have to be always the case. For example “age” which is a discrete, categorical variable is an integer datatype.
In the case of categorical string data, neural networks and other machine learning models can not work with this kind of data. We need to make some changes and transformations of that data before we can train a model on it.
The entire process of converting categorical string data into a format that can be used by deep learning or machine learning model will be covered in the following.
Let’s take a look at a categorical feature of a dataset that is called “job”. This feature contains string values that stand for different job titles:
Of course, we can not use these values and feed them into a neural network, or any other machine learning model. We first need to convert the categorical string data into a numeric data format. This process is called encoding.
The encoding process assigns a numeric value to each string value, with identical categorical values having the same numeric values:
You can do this step by hand or just use the built-in sklearn function called LabelEncoder:
At this point, you may think, okay, we got what we wanted. We have exchanged all strings with numerical values, we are done. But unfortunately, it is not as easy as it seems at first glance.
Although we could feed these values into a neural network, we would also cause a problem during training, because the features are still categorical.
The problem is that we have quite arbitrarily assigned numeric values to the string values. In fact, the values were given in ascending order, depending on the location of the categorical string data instance in the dataset. As a result, our model could give higher numerical values a higher meaning during the training.
In this particular example, the job title “entrepreneur” has a higher numerical value than the title “admin”. In real life, some would argue that in fact an entrepreneur may be considered as a job with more value. However, in the case of training an algorithmic model, this would not be necessary in the case.
In fact, it depends on the given problem that we try to solve, whether or not the given job title is more important for this particular problem.
By using categorical data we would introduce a bias towards a particular job feature.
Remember that the whole purpose of a deep learning model is to identify important patterns and relations in the data and learn from them.
Meaning to find a certain set of weights that enables the network to perform a correct prediction given the input features. During the training, we have to let the neural network decide for itself how and to what extent a particular job title and its corresponding numerical value contribute to solving the given problem.
If the introduce this bias the network might not or at least would have a harder time to find the necessary weights. Because:
- First, the network had to recognize this distortion
- Second, make appropriate corrections to the weights in order to neutralize this bias.
The issue of introducing a bias by using categorical data can be solved by a technique which is called one-hot-encoding.
One-hot-encoding performs a binarization of categorical value. In our case, we assigned during encoding numeric values to the categorical string feature job. Then we have got a vector where the string values were exchanged by numerical values.
During one-hot-encoding, we create for each categorical feature value a sparse row vector. This vector contains only zeros except for one single entry, where the value is one. The index of this entry corresponds to the numeric value to which one-hot-encoding is applied:
In our particular example, the job title admin has the encoded value of 3. The one-hot-encoded vector of this job title has a value of 1 in the fourth column, which corresponds to the indices Nr. 3. Remember this is because, in a vector, the indices start from zero.
The columns represent the job titles. I have abbreviated the titles with the names x_1 to x_5. The column x_4 thus represents the title admin.
A person in a dataset with this job title admin would simply have the value 1 in the column that represents this job title, and zeros otherwise, just like here:
With one-hot-encoding, we get rid of categorical data and make sure that all feature entries are treated the same at the beginning of the training. The significance of a feature on the prediction of the network is determined only during the training of the neural network.
A simple and quick implementation of one-hot-encoding applied to the previous example can be performed with the sklearn OneHotEncoder function:
9. Dataset Splitting
Before we can train our neural network, there is one last step to do. We have to divide our dataset into so-called training, validation and test sets.
The training set is, as the name implies, the dataset on which we train the network. The model sees and learns from this data.
In contrast, the validation set serves to evaluate a model without bias as we train the network, adjust the hyperparameters, and try different model architectures. The evaluation is unbiased, as the model only occasionally sees this data during training to calculate some evaluation metrics, but the network never learns from this data.
Once a model is completely trained its final performance is tested on the test set. The test dataset provides the gold standard to truly evaluate the final performance of a trained model. Very often the validation set is used as the test set, but it is not good practice.
A tricky question might be what ratio should be used for these three separate datasets?
In general, the training set should be as large as possible. On the other hand, you still need to provide enough data samples for the validation, and especially for the testing phase, to get a meaningful performance result.
In most cases, you’ll need to adjust the ratio to the size of your entire dataset. For medium-sized datasets with about a couple of thousands to tens of thousands of data samples, typically 70% of the data samples are included in the training set, 10% in the validation set, and the remainder in the test set. In this case, you will be able to provide enough examples of your neural network and at the same time be able to accurately evaluate and test your model.
If you encounter very large data sets with millions of data samples, you can significantly reduce the sizes of validation and test sets. Only a few percent of these millions of data instances is more than enough to test the performance of your model.
You can divide an arbitrary dataset into two separate datasets which can be used for example for training and testing purposes using sklearn.model_selection class: