본문 바로가기

Data Science/Scikit-Learn

[Sklearn] Dealing Categorical Variables : Encoders

1. What is Categorical Variables?

A Categorical Variables takes only a limited number of values. Consider a survey that asks how often you eat breakfast and provides four options : "Never", "Rarely", "Most days", or "Every day". In this case, the data is categorical, because response fall into a fixed set of categories. If people responded to a survey about which brand of car they owned, the responses would fall into categorical like "Kia", "Hyundae", and "BMW". In this case, the data is also categorical.

 

2. Approaches for dealing categorical variables

Drop Categorical Variables

The easiest approach to dealing with categorical variables is to simply remove then from the dataset. This approach will only work well if the columns did not contain useful information.

 

import pandas as pd
from sklearn.model_selection import train_test_split

# Load the data
data = pd.read_csv('../../KAGGLE/Kaggle_House_Price/train.csv')

# Separate target from predictors 
y = data.SalePrice
X = data.drop(['SalePrice'], axis = 1)

# Divide data into training and validation subsets
X_train_full, X_valid_full, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2,
                                                                random_state=0)

# Drop columns with missing values (simplest approach)
cols_with_missing = [col for col in X_train_full.columns if X_train_full[col].isnull().any()] 
X_train_full.drop(cols_with_missing, axis=1, inplace=True)
X_valid_full.drop(cols_with_missing, axis=1, inplace=True)

# "Cardinality" means the number of unique values in a column
# Select categorical columns with relatively low cardinality (convenient but arbitrary)
low_cardinality_cols = [cname for cname in X_train_full.columns if X_train_full[cname].nunique() < 10 and 
                        X_train_full[cname].dtype == "object"]

# Select numerical columns
numerical_cols = [cname for cname in X_train_full.columns if X_train_full[cname].dtype in ['int64', 'float64']]

# Keep selected columns only
my_cols = low_cardinality_cols + numerical_cols
X_train = X_train_full[my_cols].copy()
X_valid = X_valid_full[my_cols].copy()

# Score from Approach 1 (Drop Categorical Variables)
drop_X_train = X_train.select_dtypes(exclude=['object'])
drop_X_valid = X_valid.select_dtypes(exclude=['object'])
print("MAE from Approach 1 (Drop categorical variables):")
print(score_dataset(drop_X_train, drop_X_valid, y_train, y_valid))

 

Ordinal Encodeing

Ordinal encoding assigns each unique value to a different integer. This approach assumes an ordering the categories : "Never" < "Rarely" < "Most days" < "Every day" to 0 < 1 < 2 < 3. This assumption makes sense in this example, because there is an indisputeable ranking to the categories. Not all categorical variables have a clear ordering in the values, but we refer to those that do as ordinal variables. For tree-based models, we can expect encoding to work well with ordinal variables.

 

One-Hot Encoding

One-hot encoding creates new columns indicating the presence of each possible value in the original data. In the original dataset, "Color" is a categorical variable with three categories : "Red", "Yellow", and "Green". The corresponding one-hot encoding contains one column for each possible value, and one row for each row in the original dataset. Wherever the original value was "Red", we put a 1 in the "Red" column.

 

In contrast to ordinal encoding, one-hot encoding does not assume an ordering of the categories. Thus we can expect this approach to work particulary well if there is no clear ordering in the categorical data.

 

One-hot encoding generally does not perform well if the categorical variable takes on a large number of values generally won't use it for variables taking more than 15 different values.

 

3. How to use Encoders

Ordinal Encoder

The scikit-learn has a function OrdinalEncoder to encode categorical features as an integer array.

 

OrdinalEncoder

  • sklearn.preprocessing.OrdinalEncoder(*, categories='auto', dtype=<class 'numpy.float64'>, handle_unknown='error',  unknown_value=None, encoded_missing_value=nan)
  • Parameters
    • categories : Categories per features.
    • dtype : Desired dtype of output.
    • handle_unkown : When set to 'error' an error will be raised in case an unknown categorical features is present during transform.
  • Methods
    • fit(X) : Fit the imputer on X.
    • fit_transform(X) : Fit to data, then transform it.
    • transform(X) : Impute all missing values in X.

 

# Score from Approach (Ordinary Encoding)
from sklearn.preprocessing import OrdinalEncoder

# Make copy to avoid changing original data 
label_X_train = X_train.copy()
label_X_valid = X_valid.copy()

# Apply ordinal encoder to each column with categorical data
ordinal_encoder = OrdinalEncoder()
object_cols = X_train.select_dtypes(include = ['object']).columns 
label_X_train[object_cols] = ordinal_encoder.fit_transform(X_train[object_cols])
label_X_valid[object_cols] = ordinal_encoder.transform(X_valid[object_cols])

print("MAE from Approach (Ordinal Encoding):") 
print(score_dataset(label_X_train, label_X_valid, y_train, y_valid))

 

Label Encoder

The scikit-learn has a function LabelEncoder to encode target labels with value between 0 and n_classes-1.

 

Label Encoder

  • sklearn.preprocessing.LabelEncoder()
  • Methods
    • fit(X) : Fit the imputer on X.
    • fit_transform(X) : Fit to data, then transform it.
    • transform(X) : Impute all missing values in X.

 

# Score from Approach (Label Encoding)
from sklearn.preprocessing import LabelEncoder

# Make copy to avoid changing original data 
label_X_train = X_train.copy()
label_X_valid = X_valid.copy()

# Apply label encoder to each column with categorical data
ordinal_encoder = LabelEncoder()
object_cols = X_train.select_dtypes(include = ['object']).columns 
label_X_train[object_cols] = ordinal_encoder.fit_transform(X_train[object_cols])
label_X_valid[object_cols] = ordinal_encoder.transform(X_valid[object_cols])

print("MAE from Approach (Label Encoding):") 
print(score_dataset(label_X_train, label_X_valid, y_train, y_valid))

 

One Hot Encoder

The scikit-learn has a function OneHotEncoder to encode categorical features as a one-hot numeric array. The input to this transformer should be an array-like of integers or strings, denoting the values taken on by categorical features.

 

OneHotEncoder

  • sklearn.preprocessing.OneHotEncoder(*, categories='auto', drop=None, spare=True, dtype=<class 'numpy.float64'>, ...)
  • Parameters
    • categories : Categoreis per feature
    • drop : {'first', 'if_binary'}, Specifies a methodology to use to drop one of the categories per feature.
  • Methods
    • fit(X) : Fit the imputer on X.
    • fit_transform(X) : Fit to data, then transform it.
    • transform(X) : Impute all missing values in X.

 

from sklearn.preprocessing import OneHotEncoder

# Apply one-hot encoder to each column with categorical data
OH_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
OH_cols_train = pd.DataFrame(OH_encoder.fit_transform(X_train[object_cols]))
OH_cols_valid = pd.DataFrame(OH_encoder.transform(X_valid[object_cols]))

# One-hot encoding removed index; put it back
OH_cols_train.index = X_train.index
OH_cols_valid.index = X_valid.index

# Remove categorical columns (will replace with one-hot encoding)
num_X_train = X_train.drop(object_cols, axis=1)
num_X_valid = X_valid.drop(object_cols, axis=1)

# Add one-hot encoded columns to numerical features
OH_X_train = pd.concat([num_X_train, OH_cols_train], axis=1)
OH_X_valid = pd.concat([num_X_valid, OH_cols_valid], axis=1)

print("MAE from Approach (One-Hot Encoding):") 
print(score_dataset(OH_X_train, OH_X_valid, y_train, y_valid))

 

 

Pandas.get_dummies()

The pandas has a function pandas.get_dummies to convert categorical variable into dummy/indicator variables.

 

get_dummies

  • pandas.get_dummies(data, prefix=None, ...)
  • Parameters
    • data : Data of which to get dummy indicators
    • columns : Column names in the DataFrame to be encoded.

 

import pandas as pd 

# Apply pandas.get_dummies to each column with categorical data
X_obj_train = pd.get_dummies(X_train[object_cols])
X_val_valid = pd.get_dummies(X_valid[object_cols])

# Remove categorical columns (will replace with dummies)
num_X_train = X_train.drop(object_cols, axis=1)
num_X_valid = X_valid.drop(object_cols, axis=1)

# Add one-hot encoded columns to numerical features
X_train = pd.concat([num_X_train, X_obj_train], axis=1)
X_valid = pd.concat([num_X_valid, X_val_valid], axis=1)

print("MAE from Approach (pandas.get_dummies):") 
print(score_dataset(X_train, X_valid, y_train, y_valid))

 

 

Source from :

 

'Data Science > Scikit-Learn' 카테고리의 다른 글

[Sklearn] Scalers  (0) 2022.09.20
[Sklearn] Feature Engineering Skills  (0) 2022.09.20
[Sklearn] Dealing Missing Values : Imputers  (0) 2022.09.20
[Sklearn] Cross Validation  (0) 2022.09.20
[Sklearn] Modules  (0) 2022.09.20