본문 바로가기

Data Science/Regression

[Theorem] Multivariate Linear Regression

1. Multivariate Hypothesis

feet(x1)
number of rooms(x2)
Built Age(x3)
Price of House
1412
5
30
3520
1530
3
45
2420
642
2
56
1238

 

  • \(x^{i}_{j}\) : value of feature j in ith training example
  • \(x^i\) : the input features of the ith training example
  • \(m\) : the number of training examples
  • \(n\) : the number of features

if \(x_{2}^{2}\), it means 45, if \(x_2\), it means [30, 45, 56] 3 dimensional vector.

if \(x^2\), it means [1530, 3, 45] which means rows in chart, 4 dimensional vector.

 

The multivariate form of the hypothesis function is as follows.

 

$$H_{\theta }(x)=Y=\theta _0\ +\ \theta _1x_1+\theta _2x_2+...+\theta _nx_n $$

 

To make form matrictive, we should add x_0, which have value 1. So the total number of elements becomes \(n+1\).

 

$$H_{\theta }(x)=Y=\begin{bmatrix}\theta _0&\theta _1&\theta _2&...&\theta _n\end{bmatrix}\begin{bmatrix}x_0\\x_1\\x_2\\...\\x_n\end{bmatrix}\ =\theta $$

 

2. Multivariate Hypothesis Gradient Descent 

Cost function is same with cost function of one variable linear regression.

 

$$ J(\theta _0,\theta _1,\ ...\ ,\theta _n)=\frac{1}{2m}\sum _{i=1}^m(h_{\theta }(x^i)-y^i)^2=\frac{1}{2m}\sum _{i=1}^m((\sum _{j=0}^n\theta _jx_j^i)-y^i)^2 $$

 

Gradient descent terms of multivariate hypothesis are following :

 

$$ \theta _j\ :=\theta _j-\alpha \frac{\partial }{\partial \theta _j}J(\theta _0,\theta _1,\theta _2,...,\theta)  $$

$$ \theta _j\ :=\theta _j-\alpha \frac{1}{m}\sum _{i=1}^m(h_{\theta }(x^i)-y^i)\times x_j^i $$

$$j=0,....,n$$

 

3. Feature Scailing

Feature scailing is the way how we can speed up gradient descent by having each of our input values in roughly the same. This is because theta will descent quickly on small ranges and slowly on large ranges, and so will oscillate inefficiently down to the optimum.

 

The most using scaling terms are following :

 

$$ -1\le x_{(i)}\le 1 $$

$$ -0.5\le x_{(i)}\le 0.5 $$

 

Mean normalization involves substracting the average value for an input variable from the value for that input variable resulting in a new average value for the input variable of just zero.

 

$$ x_i\\ :=\frac{x_i-\mu _i}{S_i} $$

 

\(\mu\) means the average of all the values for feature(i), \(S\) means the range of values(max-min). We only do this calculation approximately not exactly, because it is just used for fast computing speed.

 

4. Tuning Multivariate Model

4.1 Jplot

If gradient is working properly, then \(J(\theta)\) should decrease after every iteration. So looking at the figures of iteration - \(J(\theta)\) can help us judging whether or not gradient descent has converged.

 

Another way is using automatic convergence test, if our cost function \(J(\theta)\) decrease by less than some small value epsilon, some small value 10 to the minus 3 in one iteration.

 

4.2 Alpha

If we see a figure like where \(J(\theta)\) is actually increasing, then that gives us a clear sign that gradient descent is not working. And a theta like this usually means that we should be using learning rate alpha.

 

If our learning rate is small enough, then \(J(\theta)\) should decrease on every iteration. But of course, It can be slow to convergence. If our learning rate is too large enough, then \(J(\theta)\) may not convergence.

 

4.3 Derivative features

When we predict house price, if we have features named "length" and "depth", then hypothesis have \(x_1\) and \(x_2\) for length and depth. But these features can be abstracted for area which \(x_1 \times x_2\).

 

So we can make hypothesis have \(x_1\), stands for area. By abstracting features, we can spare time such as feature scailing, J plot, Alpha.

 

4.4 Plolynomial Form

Our hypothesis function need not be linear if that does not fit the data well. So we need to change the behavior or curve of our hypothesis function by making it a quardratic, cubic, or square root function such as below :

 

$$ h_{\theta }(x)=\theta _0+\theta _1x_1+\theta _2x_1^2 $$

$$ h_{\theta }(x)=\theta _0+\theta _1x_1+\theta _2\sqrt{x_1} $$

 

One important thing to keep in mind is, if we choose our features this way then feature scailing is very important. If we have quardratic, range will be quardratic too.

 

5. Normal Equation

If we have chart with m example, n feature, we can make matrix following :

 

$$ X=\begin{bmatrix}x_0^1&x_1^1&x_2^1&...&x_n^1\\x_0^2&x_1^2&x_2^1&...&x_n^2\\&&...&&\\x_0^m&x_1^m&x_2^m&...&x_n^m\end{bmatrix} $$

$$ y=\begin{bmatrix}y^1\\y^2\\...\\y^m\end{bmatrix} $$

$$ \theta =\begin{bmatrix}\theta _0&\theta _1&\theta _2&...&\theta _n\end{bmatrix} $$

 

matrix \(X\) is \(m \times (n + 1)\) matrix, matrix \(y\) is \(m \times 1 \) dimensional matrix.

 

Normal equation between \(X\) and \(y\) is following :

 

$$ X\theta =y $$

$$ X^TX\theta =X^Ty $$

$$ \theta =(X^TX)^{-1}X^Ty $$

 

So we can find parameter theta by using normal equation without iteration. Also, there is no need to do feature scailing with the normal equation.

 

Gradient Descent Normal Equation
Need to choose alpha, Needs many iterations No need to choose alpha, No need to iterate
Works well when n is large Slow if n is very large

 

When implementing the normal equation, we can get parameter theta when \(X^T X\) is not invertible.There are two possible cuasing noninvertibility. 

 

  1. Redundant features, where two features are very closly related.
  2. Too many features \((m < m)\).

'Data Science > Regression' 카테고리의 다른 글

[Models] Regression Models  (0) 2022.09.20
[Models] Underfitting and Overfitting  (0) 2022.09.20
[Models] How to make model  (0) 2022.09.20
[Theorem] Linear Regression  (0) 2022.09.19