[Sklearn] Cross Validation

1. What is Cross Validation?

We want to evaluate almost every model we every build. In most applications, the relevant measure of model quaity is predictive accuracy. In other words, will the model's predictions be close to what actually happens.

Many people make a huge mistake when measuring predictive accuracy. They make predictions with their training data and compare those predictions to the target values in the training data. There are many metrics for summarizing model quality, let's consider metrics called Mean Absolute Error(MAE). With the MAE metric, we take the absolute value of each error. Our code of calculating MAE is following :

from sklearn.metric import mean_absolute_error 

predicted_home_price = melbourne_model.predict(X)
mean_absolute_error(y, predicted_home_price)

2. The Problem of in-sample Score

The measure we computed can be called an "in-sample" score. We used a single "sample" of houses for both building the model and evaluating it. Here's why this is bad.

Since models' practical value come from making predictions on new data, we measure performance on data that wasn't used to build the model. The most straightforward way to do this to exclude some data from the model-building process, and then use those to test the model's accuracy on data it hasn't seen before. This data is called Validation data.

3. How to make Validation Data?

Train-Test Split

The sckit-learn library has a function train_test_split to break up the data into two pieces. We'll use some of that data as training data to fit the model, and use the other data as validation data to calculate.

train_test_split

sklearn.model_selection.train_test_split(*arrays, test_size=None, train_size=None, random_state=None, shuffle=True)
Parameters
- *arrays : sequence of indexables with same length
- test_size : Represents the absolute number of test samples
- train_size : Represents the absolute number of train samples

from sklearn.model_selection import train_test_split 

X_train, X_test, y_train, y_train = train_test_split(X, y, train_size=0.7, test_size=0.3, random_state=42)

KFold

The scikit-learn library has a function KFold to provide train/test indices to split data in train/test sets. Split dataset into k consecutive folds. Each fold is then used once as validation while the k-1 remaining folds form the training set.

KFold

sklearn.model_selection.KFold(n_splits=5, *, shuffle=False, random_state=42)
Parameters
- n_splits : Number of folds
- random_state : Affects the ordering of the indices
Methods
- get_n_splits : Returns the number of splitting iterations in the cross-validatior
- split : Generate indices to split into training and test set

from sklearn.model_selection import KFold 

kf = KFold(n_splits=3) 
kf.get_n_splits(X) 
# Result will be 3 

for train_index, test_index in kf.split(X): 
    X_train, X_test = X[train_index], X[test_index] 
    y_train, y_test = y[train_index], y[test_index]

StratifiedKFold

The scikit-learn library has a function StratifiedKFold to provide train/test indices to split data in train/test sets. This cross-validation object is a variation of KFold that returns stratified folds. The folds are made by preserving the precentage of samples for each class.

StratifiedKFold

sklearn.model_selection.StratifiedKFold(n_splits=5, *, shuffle=False, random_state=None)
Parameters
- n_splits : Number of folds
- random_state : Affects the ordering of the indices
Methods
- get_n_splits : Returns the number of splitting iterations in the cross-validatior
- split : Generate indices to split into training and test set

from sklearn.model_selection import StratifiedKFold 

skf = StratifiedKFold(n_splits=3) 
skf.get_n_splits(X) 
# Result will be 3 

for train_index, test_index in skf.split(X): 
    X_train, X_test = X[train_index], X[test_index] 
    y_train, y_test = y[train_index], y[test_index]

Source from :

'Data Science > Scikit-Learn' 카테고리의 다른 글

[Sklearn] Scalers (0)	2022.09.20
[Sklearn] Feature Engineering Skills (0)	2022.09.20
[Sklearn] Dealing Categorical Variables : Encoders (0)	2022.09.20
[Sklearn] Dealing Missing Values : Imputers (0)	2022.09.20
[Sklearn] Modules (0)	2022.09.20

See the forest

[Sklearn] Cross Validation

1. What is Cross Validation?

2. The Problem of in-sample Score

3. How to make Validation Data?

Train-Test Split

KFold

StratifiedKFold

'Data Science > Scikit-Learn' 카테고리의 다른 글

티스토리툴바

[Sklearn] Cross Validation

1. What is Cross Validation?

2. The Problem of in-sample Score

3. How to make Validation Data?

Train-Test Split

KFold

StratifiedKFold

'Data Science > Scikit-Learn' 카테고리의 다른 글

'Data Science/Scikit-Learn' Related Articles

티스토리툴바