본문 바로가기

Data Science/Regression

[Models] Underfitting and Overfitting

Note : This is just tiny subsets of full modeling workflow. We must understand domian knowledge of our training datasets and do statistical analysis firstly.

 

1. Underfitting and Overfitting Problems

While we do modeling, we use reliable way to measure model accuracy. Using those metrics, we can experiment with alternative model and see which gives the best predictions. 

Overfitting is problem where a model matches the training data almost perfectly, but does poorly in validation and other new data. On the other side, Underfitting is problem where a model fails to capture important disticnt and patterns in the data.

 

2. How to solve these problems?

In underfitting problem, we can choose more flexible model which is more complicated(This means model has more variance). And also, we can use hyperparameter tuning by increasing degree of freedom and other parameters. In overfitting problem, we can choose more restrictive model which is more simple(This means model has more bias). And also, we can use hyperparameter tuning by decreasing degree of freedom and other parameters. 

In scikit-learn, GridSearchCV functioh helps those work. We must use our metrics to validation data, not on training data. 

 

3. Codes

# Progress of Machine Learning 

# Import libraries 
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error 

# Read the data
X_raw = pd.read_csv('../../KAGGLE/Kaggle_House_Price/train.csv', index_col='Id')
X_test_raw = pd.read_csv('../../KAGGLE/Kaggle_House_Price/test.csv', index_col='Id')

# Obtain target and predictors
y = X_raw.SalePrice
features = ['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 'FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd']
X = X_raw[features].copy()
X_test = X_test_raw[features].copy()

# Break off validation set from training data
X_train, X_valid, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2,
                                                      random_state=0)
                                                      
# Function for Gridsearch      
from sklearn.ensemble import DecisionTreeRegressor 

def get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y):
    model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=0)
    model.fit(train_X, train_y)
    preds_val = model.predict(val_X)
    mae = mean_absolute_error(val_y, preds_val)
    return(mae)

# Make best model 
# compare MAE with differing values of max_leaf_nodes
for max_leaf_nodes in [5, 50, 500, 5000]:
    my_mae = get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y)
    print("Max leaf nodes: %d  \t\t Mean Absolute Error:  %d" %(max_leaf_nodes, my_mae))

"""
result will be same as below : 

Max leaf nodes: 5            Mean Absolute Error:  347380
Max leaf nodes: 50           Mean Absolute Error:  258171
Max leaf nodes: 500          Mean Absolute Error:  243495
Max leaf nodes: 5000         Mean Absolute Error:  254983
"""

# Define best model 
best_model = DecisionTreeRegressor(max_leaf_nodes=500, random_state=0) 

# Fit the model to the training data
best_model.fit(X, y)

# Generate test predictions
preds_test = best_model.predict(X_test)

# Save predictions in format used for competition scoring
output = pd.DataFrame({'Id': X_test.index,
                       'SalePrice': preds_test})
output.to_csv('submission.csv', index=False)

 

 

Source from : https://www.kaggle.com/learn

'Data Science > Regression' 카테고리의 다른 글

[Models] Regression Models  (0) 2022.09.20
[Models] How to make model  (0) 2022.09.20
[Theorem] Multivariate Linear Regression  (2) 2022.09.19
[Theorem] Linear Regression  (0) 2022.09.19