In the last article we saw the experimental design of our experiment. In this article we will see how we dealt with one of our greatest confounding variables, the preprocessing stage. Let’s think about the following scenario, let’s say that we choose one of the 2,400 datasets we collected (60 data sets from different problems, from each such dataset we sampled 40 datasets, hence 2,400 datasets, see The first article in the series for more information regarding this: https://g-stat.com/the-effect-of-machine-learning-algorithms-advancement-level-over-their-performance-when-solving-business-analytics-problems-part-1/), over that dataset we are trying to run each of our learning algorithm (see the first article in the series), each algorithm is different in terms of data preprocessing stages, some algorithms, let’s take for an example, linear regression, this algorithm requires that the features would be normally distributed as part of other requirements, this often leads to various transformations been conducted over the features to make the normal assumption valid for the model, in other learning algorithms, such as trees this is not needed. This leads to a problem in which if we will transform the data according the model we will have hard time to compare between them, because one can argue that the difference in the results is due to the specific preprocessing done over the data and not the model itself, this is why we made a generic preprocessing pipeline which will output a dataset ready for modeling, this way we can’t argue that each model got a different dataset which could alter the results, now we know that if an algorithm resulted in a better result than the rest, it is superior.
The pipeline we created deals with all kinds of datasets regardless of the type of problem (regression, classification, multi-classification), and most of the steps are the same, except of one or two stages, we will see the differences between each type of problem. In the following sentences. It’s important to understand and remember that our goal is not to find the best pipeline for preprocessing, our goal is to create a good enough common ground for all the algorithms, out implementation of the pipeline is not necessarily the best one, but it is good enough.
Stages:
1. Clear no variance stage – in this step we are looking at each feature and check if has more than one unique value, if we detect that the feature has only one unique value that feature is removed.
2. Missing values detection and imputation – in this stage we detected missing values for each feature and later we impute them with the following strategy:
2.1. Each missing value was replaced with the value of 0.
2.2. We created an indicator column which indicates for each sample if there was a missing value for that
specific feature or not. 1 – no missing value, 0 – missing value.
3. Outliers detection and treatment – in order to find the outliers, we used a clustering technique which is called Hierarchical Density-Based Spatial Clustering of Applications with Noise (Hierarchical DBSCAN) which is a variation of the DBSCAN algorithm. the main characteristic of the DBSCAN algorithm is the fact that there is no need to determine the number of clusters up front, instead we need to determine 2 other main hyper parameters:
– Epsilon – the minimum distance between 2 data points to determine they are in the same cluster.
– Minimum number of points in a cluster.
for empirical reasons we chose epsilon to be the max value between the median value plus the standard deviation and the value 1, and the min number of data points to be is set to be 1% of the number of points for the specific feature that is been processed. But there is not needed to determine the epsilon in HDBSCAN we only need to determine the minimum number of points in the cluster and the algorithm does a cluster decision by computing all optionable epsilon values over the points, so we are down to 1 main hyper parameter to decide, we set it to 1%. Eventually the algorithm outputs for each point one of 3 possibilities, either it is a core point, which is the center of the cluster, a connected point, which is a point which is in the epsilon radius from a core point, but doesn’t have min points that are at epsilon distance from it, or an outlier which a point that doesn’t have max epsilon distance from any point, therefore achieving our mission to detect the outliers.
After we detect the outliers, we dealt with them by replacing them with the 99th percentile value if the outlier was extreme towards the max values, or the 1st percentile value if the outlier was extreme towards the minimum values.
4. Scaling – we used standardization for all the continuous values we got, that is reducing the mean and dividing by the standard deviation.
5. Categorizing – so far, all the stages we talked about (1-4) was done for each dataset regardless of the type of problem. This stage is different, first, the goal of this stage to deal with categoric features that many levels, this could lead to increase of dimensionality of the problem, so we wanted to avoid this by grouping similar levels to a single level. for classification problem we wanted to do categorization of similar levels of a categoric feature which has the distribution with the target to an extent of epsilon value which we decided to be 2%. To understand this even more let’s say that we had a feature named “X” that as some levels, of some levels “a” and “b” has the same distribution with target variable, for an example for 71% of the values that has the level “a” the target value is 1 and for the other 29% the value is 0 and for 69% of the values that have the level “b” has the target value of 1 then we would merge all the values that have the value “a” or “b” to a new value, which is “a-b”. we only done this if the number of different levels of the same feature where greater than 10. For a regression problem done something different to deal with this problem, we only keep the levels that covers at least 80% of all the samples, the other once left is merged to the same level. The way we did that is by cumulative count of samples in each level where we start from the level that had the greatest number of samples and then we moved on the next biggest one, and so on until we reached to at least 80% of the samples, what’s left was merged to a single level.
6. Correlations – this stage goal is to make the feature selection process, for this we used spearman correlation because of the categorical features exists, we declared that any two features with absolute value which is above 70% are correlated so we dropped the feature that is most correlated to the target variable.
7. One hot encoding – we created dummy variables for each level of each categorical feature we got.
8. Time series data preparation – this stage is only relevant for time series problems. In this stage we used the window method with 5 time points, that means the we created 5 new features for each feature, and we tried to predict the outcome in the next time point.
All the stages was implemented using sklearn custom transformers, the code is available here: https://github.com/houhashv/MLProject/blob/master/mlproject/pre_processing/transformers.py.
At the end of this pipeline we had a dataset ready to be run with all the models. We will discuss this the next article, stay toned.