Note : This is just tiny subsets of full modeling workflow. We must understand domian knowledge of our training datasets and do statistical analysis firstly.
Step 1 : Selecting data for modeling
We need to start by picking a few variables using our intuition. To choose variables/columns, we need to see a list of all columns in the dataset. That is done with the column property of the DataFrame.
import pandas as pd
melbourne_file_path = '../input/melbourne-housing-snapshot/melb_data.csv'
melbourne_data = pd.read_csv(melbourne_file_path)
melbourne_data.columns
# The Melbourne data has some missing values (some houses for which some variables weren't recorded.)
# dropna drops missing values
melbourne_data = melbourne_data.dropna(axis = 0)
Step 2 : Selecting the prediction target
We can pull out a variable with dot-notation. The single column is stored in a Series object.
y = melbourne_data.Price
Step 3 : Choosing proper features
The columns that are inputted into our model are called 'features'. Sometimes, we need to use all columns except the target as features. Other times, we need to use fewer features. By convention, this data is called X.
melbourne_features = ['Rooms', 'Bathroom', 'Landsize', 'Lattitude', 'Longtitude']
X = melbourne_data[melbourne_features]
Step 4 : Building our models
We use the scikit-learn library to create our models. When coding, this library is written as sklearn. The steps to building and using a model are following :
- Define : What type of model will it be? Which parameters are needed to be specified?
- Fit : Capture patterns from provided data.
- Predict : Just what is sounds like.
- Evaluate : Determine how accurate the model's predictions are.
Many machine learning models allow some randomness in model training. Specifying a number for random_state ensures us get the same results in each run.
# Modeling
from sklearn.tree import DecisionTreeRegressor
## Define model. Specify a number for random_state to ensure same results each run
melbourne_model = DecisionTreeRegressor(random_state=1)
## Fit model
melbourne_model.fit(X, y)
Source from : https://www.kaggle.com/learn
'Data Science > Regression' 카테고리의 다른 글
[Models] Regression Models (0) | 2022.09.20 |
---|---|
[Models] Underfitting and Overfitting (0) | 2022.09.20 |
[Theorem] Multivariate Linear Regression (2) | 2022.09.19 |
[Theorem] Linear Regression (0) | 2022.09.19 |