1. What is Cross Validation?
We want to evaluate almost every model we every build. In most applications, the relevant measure of model quaity is predictive accuracy. In other words, will the model's predictions be close to what actually happens.
Many people make a huge mistake when measuring predictive accuracy. They make predictions with their training data and compare those predictions to the target values in the training data. There are many metrics for summarizing model quality, let's consider metrics called Mean Absolute Error(MAE). With the MAE metric, we take the absolute value of each error. Our code of calculating MAE is following :
from sklearn.metric import mean_absolute_error
predicted_home_price = melbourne_model.predict(X)
mean_absolute_error(y, predicted_home_price)
2. The Problem of in-sample Score
The measure we computed can be called an "in-sample" score. We used a single "sample" of houses for both building the model and evaluating it. Here's why this is bad.
Since models' practical value come from making predictions on new data, we measure performance on data that wasn't used to build the model. The most straightforward way to do this to exclude some data from the model-building process, and then use those to test the model's accuracy on data it hasn't seen before. This data is called Validation data.
3. How to make Validation Data?
Train-Test Split
The sckit-learn library has a function train_test_split to break up the data into two pieces. We'll use some of that data as training data to fit the model, and use the other data as validation data to calculate.
train_test_split
- sklearn.model_selection.train_test_split(*arrays, test_size=None, train_size=None, random_state=None, shuffle=True)
- Parameters
- *arrays : sequence of indexables with same length
- test_size : Represents the absolute number of test samples
- train_size : Represents the absolute number of train samples
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_train = train_test_split(X, y, train_size=0.7, test_size=0.3, random_state=42)
KFold
The scikit-learn library has a function KFold to provide train/test indices to split data in train/test sets. Split dataset into k consecutive folds. Each fold is then used once as validation while the k-1 remaining folds form the training set.
KFold
- sklearn.model_selection.KFold(n_splits=5, *, shuffle=False, random_state=42)
- Parameters
- n_splits : Number of folds
- random_state : Affects the ordering of the indices
- Methods
- get_n_splits : Returns the number of splitting iterations in the cross-validatior
- split : Generate indices to split into training and test set
from sklearn.model_selection import KFold
kf = KFold(n_splits=3)
kf.get_n_splits(X)
# Result will be 3
for train_index, test_index in kf.split(X):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
StratifiedKFold
The scikit-learn library has a function StratifiedKFold to provide train/test indices to split data in train/test sets. This cross-validation object is a variation of KFold that returns stratified folds. The folds are made by preserving the precentage of samples for each class.
StratifiedKFold
- sklearn.model_selection.StratifiedKFold(n_splits=5, *, shuffle=False, random_state=None)
- Parameters
- n_splits : Number of folds
- random_state : Affects the ordering of the indices
- Methods
- get_n_splits : Returns the number of splitting iterations in the cross-validatior
- split : Generate indices to split into training and test set
from sklearn.model_selection import StratifiedKFold
skf = StratifiedKFold(n_splits=3)
skf.get_n_splits(X)
# Result will be 3
for train_index, test_index in skf.split(X):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
Source from :
'Data Science > Scikit-Learn' 카테고리의 다른 글
[Sklearn] Scalers (0) | 2022.09.20 |
---|---|
[Sklearn] Feature Engineering Skills (0) | 2022.09.20 |
[Sklearn] Dealing Categorical Variables : Encoders (0) | 2022.09.20 |
[Sklearn] Dealing Missing Values : Imputers (0) | 2022.09.20 |
[Sklearn] Modules (0) | 2022.09.20 |