1. What is Missing Values?
Missing data(or missing values) is defined as the data value that is not stored for a variable in the observation of interest. The problem of missing data is relatively common in almost all research and can have a siginificant effect on the conclusions that can be drawn for the data.
There are three types of missing values according to the mechanisms of missingess.
- Missing completely at random(MCAR) : It's defined as when the probability that the data are missing is not related to either the specific value which is supposed to be obtained or the set of observed response.
- Missing at random(MAR) : It's a more realistic assumption for the studies performed in the anesthetic field.
- Missing not at random(MNAR) : If the characters of the data do not meet those of MCAR or MAR, then they fall into the category of MNAR.
2. Three Aproaches for dealing missing values
A simple options : Drop columns with missing values
The simplest option is to drop columns with missing values. Unless most values in the dropped columns are missing, the model loses access to a lot of information with this approach.
import pandas as pd
from sklearn.model_selection import train_test_split
# Load the data
data = pd.read_csv('../../KAGGLE/Kaggle_House_Price/train.csv')
# Select target
y = data.SalePrice
# To keep things simple, we'll use only numerical predictors
melb_predictors = data.drop(['SalePrice'], axis=1)
X = melb_predictors.select_dtypes(exclude=['object'])
# Divide data into training and validation subsets
X_train, X_valid, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2, random_state=0)
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
# Function for comparing different approaches
def score_dataset(X_train, X_valid, y_train, y_valid):
model = RandomForestRegressor(n_estimators=10, random_state=0)
model.fit(X_train, y_train)
preds = model.predict(X_valid)
return mean_absolute_error(y_valid, preds)
# Drop columns
# Get names of columns with missing values
cols_with_missing = [col for col in X_train.columns if X_train[col].isnull().any()]
# Drop columns in training and validation data
reduced_X_train = X_train.drop(cols_with_missing, axis=1)
reduced_X_valid = X_valid.drop(cols_with_missing, axis=1)
print("MAE from Approach 1 (Drop columns with missing values):")
print(score_dataset(reduced_X_train, reduced_X_valid, y_train, y_valid))
A better option : Imputation
Imputation fills in the missing values with some number. For instance, we can fill in the mean value along each column. The imputed value won't be exactly right in most cases, but it usually leads to more accurate modesl than you would get from dropping the column entirely.
An Extension to Imputation
Imputation is the standard approach, and it usually works well. However, imputed values may be systematically above or below thier actual values or rows with missing values may be unique in some other way. In that case, our model would make better predictions by considering which values were originally missing. In this approach, we imputed the missing values, as before. And, additionally, for each column with missing entries in the original dataset, we add a new column that shows the location of the imputed entries.
3. How to impute missing values?
SimpleImputer
The scikit-learn library has a function SimpleImputer that replace missing values using a descriptive statistic (e.g. mean, median, or most frequent) along each column, or using constant value.
There are other function such as IterativeImputer and KNNInputer which replace multivariate missing values. Check scikit-learn docs.
SimpleImputer
- sklearn.imptue.SimpleImputer(*, missing_values=nan, strategy='mean', fill_value=None, ...)
- Parameters
- missing_values : The placeholder for the missing values
- startegy : The imputation strategy. There are four types of trategy called 'mean', 'median', 'most_frequent', 'constant'.
- fill_value : When strategy is 'constant', fill_value is used to replace all occurrences of missing_values
- Methods
- fit(X) : Fit the imputer on X.
- fit_transform(X) : Fit to data, then transform it.
- transform(X) : Impute all missing values in X.
from sklearn.impute import SimpleImputer
# Imputation
my_imputer = SimpleImputer()
imputed_X_train = pd.DataFrame(my_imputer.fit_transform(X_train))
imputed_X_valid = pd.DataFrame(my_imputer.transform(X_valid))
# Imputation removed column names; put them back
imputed_X_train.columns = X_train.columns
imputed_X_valid.columns = X_valid.columns
print("MAE from Approach 2 (Imputation):")
print(score_dataset(imputed_X_train, imputed_X_valid, y_train, y_valid))
Source from :
'Data Science > Scikit-Learn' 카테고리의 다른 글
[Sklearn] Scalers (0) | 2022.09.20 |
---|---|
[Sklearn] Feature Engineering Skills (0) | 2022.09.20 |
[Sklearn] Dealing Categorical Variables : Encoders (0) | 2022.09.20 |
[Sklearn] Cross Validation (0) | 2022.09.20 |
[Sklearn] Modules (0) | 2022.09.20 |