본문 바로가기

Data Science/Scikit-Learn

[Sklearn] Transforming Columns by its Type

1. What is ColumnTransformer?

The scikit-learn library has special function called 'ColumnTransformer'. It applies transformers to columns of an array or pandas DataFrame.

 

This estimators allows different columns or column subsets of the input to be transformed separately and the features generated by each transformer will be concatenated to form a single feature space. This is useful for heterogeneous or columnar data, to combine several feature extraction mechanisms or transformations into a single transformer.

 

2. How to use ColumnTransformer?

ColumnTransfomer

  • sklearn.compose.ColumnTransfomer(transformers, *, remainder='drop', ...)
  • Parameters
    • transformer : list of tuples
      • name : Like in Pipeline and FeatureUnion, this allows the transformer and its parameters to be set using set_params and searched in grid search.
      • transformer : Estimator must support fit and transform. (e.g. MinMaxScaler, StandardScaler, OneHotEnocder, ...)
      • columns : Indexes the data on its second axis. Integers are interpreted as positional columns, while strings can reference DataFrame columns by name.

 

# Select categorical columns with relatively low cardinality (convenient but arbitrary)
categorical_cols = [cname for cname in X_train_full.columns if X_train_full[cname].nunique() < 10 and X_train_full[cname].dtype == 'object']

# Select numerical columns
numerical_cols = [cname for cname in X_train_full.columns if X_train_full[cname].dtype in ['int64', 'float64']]

# Import libraries 
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder

# Preprocessing for numerical data
numerical_transformer = SimpleImputer(strategy = 'constant')

# Preprocessing for categorical data
categorical_transformer = Pipeline(steps = [('imputer', SimpleImputer(strategy = 'most_frequent')),
                                            ('onehot', OneHotEncoder(handle_unknown = 'ignore'))])

# Bundle preprocessing for numercal and categorical data

preprocessor = ColumnTransformer(
    transformers = [('num', numerical_transformer, numerical_cols),
                    ('cat', categorical_transformer, categorical_cols)])

 

 

Source from : https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html