본문 바로가기

Data Science/Classification

(6)
[Models] Classification Models Note : This is just tiny subsets of full modeling workflow. We must understand domian knowledge of our training datasets and do statistical analysis firstly. import pandas as pd x_train = pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/airline/x_train.csv") y_train = pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/airline/y_train.csv") x_test= pd.rea..
[Theorem] Bias vs Variance 1. Intersection between Bias and Variance Let's review about overfitting problem and underfitting problem. Underfitting problem is the problem when we use too much low degree polynomial term. Overfitting problem is the problem when we use too much high degree polynomial term. So, when we plot training error of \(J(\Theta)\) by degree of polynomial, we can see that in lower degree, error is high,..
[Theorem] Validation Sets 1. Decide what to do? In machine learning, errors can be raised sometimes even we set correct aglrotihm terms. If so, what should we try next? We can make some solutions following : Get more training examples Try smaller sets of features Try getting additional features Try adding polynomial features Try decreasing lambda Try increasing lambda Then, how can we select in above solutions? So we nee..
[Theorem] Regularization 1. Regularization of Logistic Regression Because we don't know how many theta can affect overfitting, we make all theta become small. $$ \left(J(\theta )=\frac{1}{2m}\sum _{i=1}^m(h_{\theta }(x^{(i)})-y^{(i)})^2+\lambda \sum _{j=1}^m\theta _j^2)\right) $$ \(\lambda\) is called the regularization parameter which controls a trade off between two different goals. The first goal is that we would lik..
[Theorem] Overfitting 1. Overfitting in Linear Regression When degree of freedom is low, \(H(x)\) can only predict output in simple way and can't predict every case of x. This called 'underfitting' or 'high bias'. When degree of freedom is proper(not too low and not too high), predicting output is pretty well. When degree of freedom is high, model can predict output well, but can't generalize well to predict new data..
[Theorem] Logistic Regression 1. What is Classification Problem? Usually classification have two discrete output zero and one which first one is 'negative output', the other is a 'positive output'. For example, in classification for spam mail, zero means mail is not spam mail, one means mail is spam mail. $$ y \in 0, 1 $$ Multivariate classification have multiple discrete output. $$ y \in 0, 1, 2, ... $$ 2. Logistic Regressi..