[Theorem] Validation Sets

1. Decide what to do?

In machine learning, errors can be raised sometimes even we set correct aglrotihm terms. If so, what should we try next? We can make some solutions following :

Get more training examples
Try smaller sets of features
Try getting additional features
Try adding polynomial features
Try decreasing lambda
Try increasing lambda

Then, how can we select in above solutions? So we need to evaluate machine learning algorithm by using macihne learning diagnostics.

2. Hypothesis

2.1 Linear Regression

In overfitting problem of linear regression, we can't easily fix features if features are so many. So we divide examples 70% verses 30%, training set and test set. The new procedure of these two sets is same as below :

For example, if in overfitting problem, training $J(\Theta)$ becomes low, but test $J(\Theta)$ becomes high.

About 70% : $\begin{bmatrix}(x^{(1)},y^{(1)})\\...\\(x^{(m)},y^{(m)})\end{bmatrix}$
About 30% : $\begin{bmatrix}(x_{test}^{(1)},y_{test}^{(1)})\\...\\(x_{test}^{(m)},y_{test}^{(m)})\end{bmatrix}$

$$ J(\theta )=\frac{1}{2m}\sum _{i=1}^m(h_{\theta }(x^{(i)})-y^{(i)})^2 $$

$$ J(\theta )_{test}=\frac{1}{2m_{test}}\sum _{i=1}^{m_{test}}(h_{\theta }(x_{test}^{(i)})-y_{test}^{(i)})^2 $$

2.2 Logistic Regression

In logistic regression, there are two problems. One is overfitting problem, and the other is misclassification error.

The way of solving overfitting problem is same with its of linear regression, dividing examples into training set and test set. Misclassification error can know with average test error for test set.

$$ J(\theta )=-\frac{1}{m}[\sum _{i=1}^m\\ y^{(i)}\log h_{\theta }(x^{(i)})+(1-y^{(i)})\log (1-h_{\theta }(x^{(i)})] $$

$$ J(\theta )_{test}=-\frac{1}{m_{test}}[\sum _{i=1}^{m_{test}}\\ y_{test}^{(i)}\log h_{\theta }(x_{test}^{(i)})+(1-y_{test}^{(i)})\log (1-h_{\theta }(x_{test}^{(i)})] $$

$$ Test\\ error=\\ \frac{1}{m_{test}}\sum _{i=1}^{m_{test}}err(h_{\theta }(x_{test}^{(i)}),\\ y^{(i)}) $$

3. Model Selection

In polynomial linear regression, there is a no way to know that our parameter work well to new examples. So we divide examples into three datasets. Training sets(60%) + Validation sets(20%) + Test sets(20%).

About 60% : $\begin{bmatrix}(x^{(1)},y^{(1)})\\...\\(x^{(m)},y^{(m)})\end{bmatrix}$
About 20% : $\begin{bmatrix}(x_{val}^{(1)},y_{val}^{(1)})\\...\\(x_{val}^{(m)},y_{val}^{(m)})\end{bmatrix}$
About 20% : $\begin{bmatrix}(x_{test}^{(1)},y_{test}^{(1)})\\...\\(x_{test}^{(m)},y_{test}^{(m)})\end{bmatrix}$

Train set : $J(\theta )=\frac{1}{2m}\sum _{i=1}^m(h_{\theta }(x^{(i)})-y^{(i)})^2$
CV set : $J(\theta )_{val}=\frac{1}{2m_{val}}\sum _{i=1}^{m_{val}}(h_{\theta }(x_{val}^{(i)})-y_{val}^{(i)})^2$
Test set : $J(\theta )_{test}=\frac{1}{2m_{test}}\sum _{i=1}^{m_{test}}(h_{\theta }(x_{test}^{(i)})-y_{test}^{(i)})^2$

So we can choose model within following method :

Optimize parameters in $\theta$ using the training set.
Find the polynomial degree d with the least error using validation set.
Estimate the generalization error using test set.

Difference with validation sets and test sets : The validation sets are used to compare their performances and decide to select a model among different models. The test sets are used to obtain the performance characteristics such as accuracy. So, Validations sets are part of training sets used in model building process. However, test sets can't.

'Data Science > Classification' 카테고리의 다른 글

[Models] Classification Models (0)	2022.09.20
[Theorem] Bias vs Variance (0)	2022.09.19
[Theorem] Regularization (0)	2022.09.19
[Theorem] Overfitting (0)	2022.09.19
[Theorem] Logistic Regression (1)	2022.09.19

See the forest

[Theorem] Validation Sets

1. Decide what to do?

2. Hypothesis

2.1 Linear Regression

2.2 Logistic Regression

3. Model Selection

'Data Science > Classification' 카테고리의 다른 글

티스토리툴바

[Theorem] Validation Sets

1. Decide what to do?

2. Hypothesis

2.1 Linear Regression

2.2 Logistic Regression

3. Model Selection

'Data Science > Classification' 카테고리의 다른 글

'Data Science/Classification' Related Articles

티스토리툴바