1. What is Neural Network
Polynomial terms in linear regression and logistic regression, we have heavy features to set hypothesis. For example, if we have \(50 \times 50\) pixel images, then total pixels becomes 2500. So total features of logistic regression becomes \(n = 2500 + \alpha\) (very big, when applying polynomial term). If we have too many features, we can have overfitting problem and low speed problem.
So, setting all the quardratic feqtures is just not a good way to learn complex nonlinear hypothesis. Instead, we use Neural Networks.
Like human brain, single node works as neuron system which get input data \(x_0, x_1, x_2, ... , x_n\). With sigmoid(logistic) activation function, output data will be \(H(x)\).
In complicated neural network, neural system divided into multiple layers which has weighted parameters. Input data becomes layer1, output data \(H(x)\) become final layer, and layer which is not input and output becomes hidden layer.
To analyze neural network, we make layer \(j\), and label hidden layer node a activation units. \(\Theta_j\) means matrix of weights controlling function mapping from layer \(j\) to layer \(j+1\). The values for each of the activation nodes is obtained as following :
$$ a_1^{(2)}=g(\Theta _{10}^{(1)}x_0+\Theta _{11}^{(1)}x_1+\Theta _{12}^{(1)}x_2+\Theta _{13}^{(1)}x_3) $$
$$ a_2^{(2)}=g(\Theta _{20}^{(1)}x_0+\Theta _{21}^{(1)}x_1+\Theta _{22}^{(1)}x_2+\Theta _{23}^{(1)}x_3) $$
$$ a_3^{(2)}=g(\Theta _{30}^{(1)}x_0+\Theta _{31}^{(1)}x_1+\Theta _{32}^{(1)}x_2+\Theta _{33}^{(1)}x_3) $$
$$ h_{\Theta }(x)=a_1^{(3)}=g(\Theta _{10}^{(2)}a_0+\Theta _{11}^{(2)}a_1+\Theta _{12}^{(2)}a_2+\Theta _{13}^{(2)}a_3) $$
2. Vectorization of Neural Network
$$ a_1^{(2)}=g(\Theta _{10}^{(1)}x_0+\Theta _{11}^{(1)}x_1+\Theta _{12}^{(1)}x_2+\Theta _{13}^{(1)}x_3) $$
$$ a_2^{(2)}=g(\Theta _{20}^{(1)}x_0+\Theta _{21}^{(1)}x_1+\Theta _{22}^{(1)}x_2+\Theta _{23}^{(1)}x_3) $$
$$ a_3^{(2)}=g(\Theta _{30}^{(1)}x_0+\Theta _{31}^{(1)}x_1+\Theta _{32}^{(1)}x_2+\Theta _{33}^{(1)}x_3) $$
$$ h_{\Theta }(x)=a_1^{(3)}=g(\Theta _{10}^{(2)}a_0+\Theta _{11}^{(2)}a_1+\Theta _{12}^{(2)}a_2+\Theta _{13}^{(2)}a_3) $$
To vectorize above term, we start abstracting a as below.
$$ a_1^{(2)}=g(z_1^{(2)}) $$
$$ a_2^{(2)}=g(z_2^{(2)}) $$
$$ a_3^{(2)}=g(z_3^{(2)}) $$
For layer j = 2, z means as below.
$$ a^{(j)}=g(z^{(j)}) $$
$$ z_k^{(2)}=\Theta _{k,0}^{(1)}x_0+\Theta _{k,1}^{(1)}x_1+\Theta _{k,2}^{(1)}x_2+...+\Theta _{k,n}^{(1)}x_n $$
So, the vectorized version of x, z is below.
$$ x=\begin{bmatrix}x_0\\x_1\\x_2\\...\\x_n\end{bmatrix}\\ z^{(j)}=\begin{bmatrix}z_1^{(j)}\\z_2^{(j)}\\...\\z_n^{(j)}\end{bmatrix} $$
Calculating this for multiple hidden layer,
$$ z^{(j+1)}=\Theta ^{(j)}a^{(j)} $$
$$ x=\\ a^{(1)} $$
$$ h_{\Theta }(x)=a^{(j+1)}=g(z^{(j+1)}) $$
By using multiple layer of neural network, we can solve complicate non-linear regression problem easily by setting multiple hidden layer.
3. Multiclass Classification
To classify data into multiple classes, we let our hypothesis function return a vector of values. Say we want to classify our data into one of four categories. We will use the following example to see how this classification is done. This algorithm takes as input image and classify it accordingly.
In multiclass classification, we difine y as following :
$$ y^{(i)}=\begin{bmatrix}1\\0\\0\\0\end{bmatrix},\begin{bmatrix}0\\1\\0\\0\end{bmatrix},\begin{bmatrix}0\\0\\1\\0\end{bmatrix},\begin{bmatrix}0\\0\\0\\1\end{bmatrix} $$
Each \(y\) represents a different image. The inner layer, each provide us with some new information which leads to our final function.
$$ \begin{bmatrix}x_0\\x_1\\...\\x_n\end{bmatrix}\to \begin{bmatrix}a_0^{(2)}\\a_1^{(2)}\\...\\a_n^{(2)}\end{bmatrix}\to \begin{bmatrix}a_0^{(3)}\\a_1^{(3)}\\...\\a_n^{(3)}\end{bmatrix}\to ...\begin{bmatrix}h_{\Theta }(x)_1\\h_{\Theta }(x)_2\\h_{\Theta }(x)_3\\h_{\Theta }(x)_4\end{bmatrix} $$
4. Cost Function
We define a few variables that we need to use.
- \(L\) : total number of layers in the network
- \(s_l\) : number of units in layer l
- \(K\) : number of output unit/classes
In regularization of logistic regression, we has the cost function following :
$$ J(\theta )=-\frac{1}{m}[\sum _{i=1}^m\\ y^{(i)}\log (h_{\theta }(x^{(i)}))+(1-y^{(i)})\log (1-h_{\theta }(x^{(i)}))]+\frac{\lambda }{2m}\sum _{j=1}^n\theta _j^2 $$
In neural netwrok, we have multiclass classification. So we denote \(H(x)\) as \(H(x)_k\) that results in the Kth output. So for neural networks, it is going to be more complicated.
$$ J(\theta )=-\frac{1}{m}[\sum _{i=1}^m\sum _{k=1}^K\\ y_k^{(i)}\log (h_{\Theta }(x^{(i)}))_k)+(1-y_k^{(i)})\log (1-(h_{\Theta }(x^{(i)})_k))]+\frac{\lambda }{2m}\sum _{l=1}^{L-1}\sum _{i=1}^{s_l}\sum _{j=1}^{s_{l+1}}(\Theta _{j,i}^{(l)})^2 $$
Because of \(\Theta\) has \(s_{l+1} \times s_l\), sum of \(j\) opreate does first.
5. Back Propagation
To minimze neural network cost function, we use Back Propagation, just like what we were doing gradient descent in regression problem.
$$ \frac{\delta }{\delta \Theta _{i,j}^{(l)}}J(\Theta ) $$
- \(i\) : number of examples
- \(j\) : number of features
- \(l\) : number of layer
We use forward propagation to get \(H(x)\). But, to get partail derivative term of \(J\), We compute back propagation using \(\delta\).
$$ \delta ^{(4)}=a^{(4)}-y =h(x)-y $$
$$ \delta ^{(3)}=(\Theta ^{(3)})^T\delta ^{(4)}.*g'(z^{(3)}) $$
To generalize above term,
$$ \delta ^{(l)}=(\Theta ^{(l)})^T\delta ^{(l+1)}.*g'(z^{(l)})\\ =\\ (\Theta ^{(l)})^T\delta ^{(l+1)}.*a^{(l)}.*(1-a^{(l)}) $$
$$ l=2...L $$
And aslo, we can get derivative term like these.
6. Summarization
6.1 Multi-layer Perceptron
With above Neural network system, we will review what we studied. This neural network system has 4 layer with one input layer, one output layer and two hidden layer.
- \(l\) : number of layers
- \(L\) : final number of layers
The vectorization of those layer is same as below.
$$ x=a^{(1)}=\begin{bmatrix}x_0&x_1&x_2\end{bmatrix} $$
$$ a^{(2)}=\begin{bmatrix}a_0^{(2)}&a_1^{(2)}&a_2^{(2)}&a_3^{(2)}&a_4^{(2)}\end{bmatrix} $$
$$ a^{(3)}=\begin{bmatrix}a_0^{(3)}&a_1^{(3)}&a_2^{(3)}&a_3^{(3)}&a_4^{(3)}\end{bmatrix} $$
$$ a^{(4)}=\begin{bmatrix}a_0^{(4)}&a_1^{(4)}&a_2^{(4)}&a_3^{(4)}\end{bmatrix}=\begin{bmatrix}h_{\Theta }(x)_1&h_{\Theta }(x)_2&h_{\Theta }(x)_3&h_{\Theta }(x)_4\end{bmatrix} $$
6.2 Feed Forward Propagation
Forward propagation is getting \(h(x)\) with weighted parameter Theta. With sigmoid function \(g(z)\), we can get \(h(x)\) by forward propagation.
$$ x=a^{(1)} $$
$$ a^{(2)}=g(z^{(2)})=g(\Theta ^{(1)}a^{(1)}) $$
$$ a^{(3)}=g(z^{(3)})=g(\Theta ^{(2)}a^{(2)}) $$
$$ a^{(4)}=h_{\Theta }(x)_k=g(z^{(4)})=g(\Theta ^{(3)}a^{(3)}) $$
6.3 Parameter \(\Theta\)
Before make parameter \(\Theta\) matrix, we first make dimension of layer vector.
$$ a^{(1)}=3\times 1 $$
$$ a^{(2)}=5\times 1 $$
$$ a^{(3)}=5\times 1 $$
$$ a^{(4)}=4\times 1 $$
Theta matrix's dimension get upper layer dimension times lower layer dimension. If there isn't bias term at first, then matrix's dimension get upper layer dimension times lower layer dimension + 1.
$$ \Theta^{(1)} = 5\times 3 $$
$$ \Theta^{(2)} = 5\times 5 $$
$$ \Theta^{(3)} = 4\times 5 $$
6.4 Cost Function with Regularization
$$ J(\theta )=-\frac{1}{m}[\sum _{i=1}^m\sum _{k=1}^4\\ y_k^{(i)}\log (h_{\Theta }(x^{(i)}))_k)+(1-y_k^{(i)})\log (1-(h_{\Theta }(x^{(i)})_k))]+\frac{\lambda }{2m}\sum _{l=1}^3\sum _{i=1}^{s_l}\sum _{j=1}^{s_{l+1}}(\Theta _{j,i}^{(l)} $$
6.5 Back Propagation
To minimze cost function J, we need to calcluate derivative term called 'back propagation'.
Step 1 : Calculating feed forward propagation
$$ x=a^{(1)} $$
$$ a^{(2)}=g(z^{(2)})=g(\Theta ^{(1)}a^{(1)}) $$
$$ a^{(3)}=g(z^{(3)})=g(\Theta ^{(2)}a^{(2)}) $$
$$ a^{(4)}=h_{\Theta }(x)_k=g(z^{(4)})=g(\Theta ^{(3)}a^{(3)}) $$
Step 2 : Setting \(\delta\)
$$ \delta ^{(4)}=a^{(4)}-y=h(x)-y $$
$$ \delta ^{(3)}=(\Theta ^{(3)})^T\delta ^{(4)}.*g'(z^{(3)})=(\Theta ^{(3)})^T\delta ^{(4)}.*a^{(3)}.*(1-a^{(3)}) $$
$$ \delta ^{(2)}=(\Theta ^{(2)})^T\delta ^{(3)}.*g'(z^{(2)})=(\Theta ^{(2)})^T\delta ^{(3)}.*a^{(2)}.*(1-a^{(2)}) $$
$$ \Delta ^{(3)}:=\Delta ^{(3)}+a^{(3)}\delta ^{(4)} $$
$$ \Delta ^{(2)}:=\Delta ^{(2)}+a^{(2)}\delta ^{(3)} $$
$$ \\ \Delta ^{(1)}:=\Delta ^{(1)}+a^{(1)}\delta ^{(2)} $$
Step 3 : Setting Derivative Terms
$$ \frac{\delta }{\delta \Theta ^{(l)}}J(\Theta ):=\frac{1}{m}(\Delta ^{(l)}+\lambda \Theta ^{(l)})\\ :=\frac{1}{m}(a^{(l)}\delta ^{(l+1)}+\lambda \Theta ^{(l)}) $$
$$ \frac{\delta }{\delta \Theta ^{(l)}}J(\Theta ):=\frac{1}{m}(\Delta ^{(l)})\\ :=\frac{1}{m}(a^{(l)}\delta ^{(l+1)}) $$
'Data Science > Neural Network' 카테고리의 다른 글
[Tensorflow] Overfitting and Underfitting (0) | 2022.09.21 |
---|---|
[Tensorflow] Stochastic Gradient Descent (0) | 2022.09.21 |
[Tensorflow] Deep Neural Networks (0) | 2022.09.21 |
[Tensorflow] A Single Neuron (0) | 2022.09.20 |
[Theorem] Optimizing Neural Network (0) | 2022.09.19 |