using custom transformers in your machine learning pipelines with scikit-learn

using custom transformers in your machine learning pipelines with scikit-learn

Hey there, are you a data scientist or a data analyst? if you are, I bet you’ve heard of the "scikit-learn" library, and probably it is one of your primary machine learning packages. it is quite popular and really kind of magical with its large scale of modules which covers much of all the processes
that are part of a standard machine learning pipeline.

Many times, I see that allot of users use scikit-learn the wrong way, the wrong means, not using the full capabilities of scikit-learn, instead users
tend to use only the models which includes the machine learning models, most of the times it’s because of the “copy-past” era we live in, where people just
copy paste and make small adjustments to their, which is great, but keeps you fixated on the solution you copied without making any further research on how to make it better, or what else could be made with the same module they use.

No matter in which IDE (interactive development environment) to program, probably you have lines over lines of code just for all the preprocess
stages in your pipeline. This makes your code very hard to understand and very hard to manage, I’m sure you probably know that organized readable code  is crucial to maintain your machine learning pipeline. 
one of the most interesting and powerful functionalities of scikit-learn is
the implementation of the “Custom Transformers
”, which is by far one of the best ways to implement a stage in your machine learning pipeline. custom
transformer enables the developer to pack and implement any stage in a generic way which is easy to handle and develop, because of how it’s integrated into your code, and the final part which is important as well, this approach makes it easier for a code reviewer to understand your code.

let us unfold a custom transformer by viewing the following code:

class CustomTransformer(BaseEstimator, TransformerMixin):
a general class for creating a machine learning step in the machine learning pipeline
def __init__(self):
super(CustomTransformer, self).__init__()

def fit(self, X, y=None, **kwargs):
an abstract method that is used to fit the step and to learn by examples
:param X: features - Dataframe
:param y: target vector - Series
:param kwargs: free parameters - dictionary
:return: self: the class object - an instance of the transformer - Transformer

def transform(self, X, y=None, **kwargs):
an abstract method that is used to transform according to what happend in the fit method
:param X: features - Dataframe
:param y: target vector - Series
:param kwargs: free parameters - dictionary
:return: X: the transformed data - Dataframe

def fit_transform(self, X, y=None, **kwargs):
perform fit and transform over the data
:param X: features - Dataframe
:param y: target vector - Series
:param kwargs: free parameters - dictionary
:return: X: the transformed data - Dataframe
self =, y)
return self.transform(X, y)

we can that custom transformer is indeed a class, as a class it inherits two other classes that are mandatory and that scikit-learn provides: BaseEstimator and  TransformerMixin. this what makes the magic happen, but inheriting this classes requires that the developer will implement three methods: fit, transform and fit transform. each method get your data (X – features and y – your target) as local variables to be used inside the method. fit is implementing the stage logic and the learning the behavior acquired over the data in X and y, transform will implement the stage logic that was learned in the fit stage and transform your data (X and y), in both cases the y data is optional.

on top of the custom transformer we have created, we were able to create more custom transformers, which finally enabled us to make all our machine learning generic. just a side note, there are many transformers that scikit-learn implements so you don't have to re invent the wheel. if you wish to see the transformers we used you could take a look in here:

the next code snippet shows an implementation of a complete preprocess all been bundled together from custom transforms into a pipeline which a class in scikit-learn which enables the running of multiple stages. the raw data is been passed from one stage to another learning the transformation and making them in each stage with the capability to make this transformations over new unseen data which has the same schema that was send to the fit method of the transformation.  

def features_pipeline(index, df, X_test, y_test, columns, folder, type):
running a problem
:param index: the index of the sampled from the original dataset - int
:param df: the learning dataset to perform analysis over - Dataframe
:param X_test: the test dataset - Dataframe
:param y_test: the test target - Dataframe
:param columns: the columns dictionary - dictionary where each key has list of features names - dictionary
:return: a list of one tuple with results and information about the model_pipeline_run run
start_time = time.time()
key_cols = columns["key"]
target_cols = columns["target"]
columns = get_cols(df, key_cols + target_cols, 0.005)
columns["key"] = key_cols
columns["target"] = target_cols
X_train, _, y_train, _ = train_test_split(df.drop(columns["target"][0], axis=1), df[columns["target"][0]],
test_size=0.3, random_state=42)
# pre proccess pipeline stages
clear_stage = ClearNoCategoriesTransformer(categorical_cols=columns["categoric"])
imputer = ImputeTransformer(numerical_cols=columns["numeric"], categorical_cols=columns["categoric"])
outliers = OutliersTransformer(numerical_cols=columns["numeric"], categorical_cols=columns["categoric"])
scale = ScalingTransformer(numerical_cols=columns["numeric"])
if type == "classification":
categorize = CategorizeByTargetTransformer(categorical_cols=columns["categoric"])
categorize = CategorizingTransformer(categorical_cols=columns["categoric"])
correlations = CorrelationTransformer(numerical_cols=columns["numeric"], categorical_cols=columns["categoric"],
target=columns["target"], threshold=0.9)
dummies = DummiesTransformer(columns["categoric"])
steps_feat = [("clear_non_variance", clear_stage),
("imputer", imputer),
("outliers", outliers),
("scaling", scale),
("categorize", categorize),
("correlations", correlations),
("dummies", dummies)]
pipeline_feat = Pipeline(steps=steps_feat)
X_train = pipeline_feat.fit_transform(X_train, y_train).reset_index(drop=True)
X_test = pipeline_feat.transform(X_test).reset_index(drop=True)
finish_time = time.time()
time_in_minutes = (finish_time - start_time) / 60
return (folder, index, X_train.copy(True), y_train.copy(True).values, X_test.copy(True), y_test.copy(True).values,

we can see clearly what stages are in my preprocess pipeline: "clear_non_variance", "imputer", "outliers", "scaling", "categorize", "correlations", "dummies". each is been created to enrich the basic transformers with new capabilities that you could view the github link we provided before. the code viewed is part of the "g-stat" and "yoss the boss of data" project, which is open sourced, if you wish to be part our community and to add some of your knowledge and experience feel free to join us. further more you can download our package using pip, by this command: pip install gstatyoss. to be part of our community please email me:

I hope that you will see how deep this rabbit hole goes by starting exploring the transformers exist in scikit-learn, next time we shell see how we can do something similar using pyspark. yours, Yossi.

הגיע הזמן לתפקיד הבא בקריירה? יש לנו עשרות משרות אטרקטיביות בתחום ה Data scientist תוכל/י לצפות במשרות כאן
או להשאיר פרטים (דיסקרטיות מלאה מובטחת!!)