본문 바로가기

Data Science/Regression

[Models] How to make model

Note : This is just tiny subsets of full modeling workflow. We must understand domian knowledge of our training datasets and do statistical analysis firstly.

 

Step 1 : Selecting data for modeling

We need to start by picking a few variables using our intuition. To choose variables/columns, we need to see a list of all columns in the dataset. That is done with the column property of the DataFrame.

 

import pandas as pd 

melbourne_file_path = '../input/melbourne-housing-snapshot/melb_data.csv'
melbourne_data = pd.read_csv(melbourne_file_path) 
melbourne_data.columns

# The Melbourne data has some missing values (some houses for which some variables weren't recorded.)

# dropna drops missing values
melbourne_data = melbourne_data.dropna(axis = 0)

 

Step 2 : Selecting the prediction target

We can pull out a variable with dot-notation. The single column is stored in a Series object.

 

y = melbourne_data.Price

 

Step 3 : Choosing proper features

The columns that are inputted into our model are called 'features'. Sometimes, we need to use all columns except the target as features. Other times, we need to use fewer features. By convention, this data is called X.

 

melbourne_features = ['Rooms', 'Bathroom', 'Landsize', 'Lattitude', 'Longtitude']

X = melbourne_data[melbourne_features]

 

Step 4 : Building our models

We use the scikit-learn library to create our models. When coding, this library is written as sklearn. The steps to building and using a model are following :

  • Define : What type of model will it be? Which parameters are needed to be specified?
  • Fit : Capture patterns from provided data.
  • Predict : Just what is sounds like.
  • Evaluate : Determine how accurate the model's predictions are.

Many machine learning models allow some randomness in model training. Specifying a number for random_state ensures us get the same results in each run.

 

# Modeling 
from sklearn.tree import DecisionTreeRegressor

## Define model. Specify a number for random_state to ensure same results each run
melbourne_model = DecisionTreeRegressor(random_state=1)

## Fit model
melbourne_model.fit(X, y)

 

 

Source from : https://www.kaggle.com/learn

'Data Science > Regression' 카테고리의 다른 글

[Models] Regression Models  (0) 2022.09.20
[Models] Underfitting and Overfitting  (0) 2022.09.20
[Theorem] Multivariate Linear Regression  (2) 2022.09.19
[Theorem] Linear Regression  (0) 2022.09.19