[Theorem] Neural Network

1. What is Neural Network

Polynomial terms in linear regression and logistic regression, we have heavy features to set hypothesis. For example, if we have $50 \times 50$ pixel images, then total pixels becomes 2500. So total features of logistic regression becomes $n = 2500 + α$ (very big, when applying polynomial term). If we have too many features, we can have overfitting problem and low speed problem.

So, setting all the quardratic feqtures is just not a good way to learn complex nonlinear hypothesis. Instead, we use Neural Networks.

Like human brain, single node works as neuron system which get input data $x_{0}, x_{1}, x_{2}, . . ., x_{n}$ . With sigmoid(logistic) activation function, output data will be $H (x)$ .

In complicated neural network, neural system divided into multiple layers which has weighted parameters. Input data becomes layer1, output data $H (x)$ become final layer, and layer which is not input and output becomes hidden layer.

To analyze neural network, we make layer $j$ , and label hidden layer node a activation units. $Θ_{j}$ means matrix of weights controlling function mapping from layer $j$ to layer $j + 1$ . The values for each of the activation nodes is obtained as following :

$a_{1}^{(2)} = g (Θ_{10}^{(1)} x_{0} + Θ_{11}^{(1)} x_{1} + Θ_{12}^{(1)} x_{2} + Θ_{13}^{(1)} x_{3})$

$a_{2}^{(2)} = g (Θ_{20}^{(1)} x_{0} + Θ_{21}^{(1)} x_{1} + Θ_{22}^{(1)} x_{2} + Θ_{23}^{(1)} x_{3})$

$a_{3}^{(2)} = g (Θ_{30}^{(1)} x_{0} + Θ_{31}^{(1)} x_{1} + Θ_{32}^{(1)} x_{2} + Θ_{33}^{(1)} x_{3})$

$h_{Θ} (x) = a_{1}^{(3)} = g (Θ_{10}^{(2)} a_{0} + Θ_{11}^{(2)} a_{1} + Θ_{12}^{(2)} a_{2} + Θ_{13}^{(2)} a_{3})$

2. Vectorization of Neural Network

$a_{1}^{(2)} = g (Θ_{10}^{(1)} x_{0} + Θ_{11}^{(1)} x_{1} + Θ_{12}^{(1)} x_{2} + Θ_{13}^{(1)} x_{3})$

$a_{2}^{(2)} = g (Θ_{20}^{(1)} x_{0} + Θ_{21}^{(1)} x_{1} + Θ_{22}^{(1)} x_{2} + Θ_{23}^{(1)} x_{3})$

$a_{3}^{(2)} = g (Θ_{30}^{(1)} x_{0} + Θ_{31}^{(1)} x_{1} + Θ_{32}^{(1)} x_{2} + Θ_{33}^{(1)} x_{3})$

$h_{Θ} (x) = a_{1}^{(3)} = g (Θ_{10}^{(2)} a_{0} + Θ_{11}^{(2)} a_{1} + Θ_{12}^{(2)} a_{2} + Θ_{13}^{(2)} a_{3})$

To vectorize above term, we start abstracting a as below.

$a_{1}^{(2)} = g (z_{1}^{(2)})$

$a_{2}^{(2)} = g (z_{2}^{(2)})$

$a_{3}^{(2)} = g (z_{3}^{(2)})$

For layer j = 2, z means as below.

$a^{(j)} = g (z^{(j)})$

$z_{k}^{(2)} = Θ_{k, 0}^{(1)} x_{0} + Θ_{k, 1}^{(1)} x_{1} + Θ_{k, 2}^{(1)} x_{2} + . . . + Θ_{k, n}^{(1)} x_{n}$

So, the vectorized version of x, z is below.

$x = [\begin{matrix} x_{0} \\ x_{1} \\ x_{2} \\ . . . \\ x_{n} \end{matrix}] z^{(j)} = [\begin{matrix} z_{1}^{(j)} \\ z_{2}^{(j)} \\ . . . \\ z_{n}^{(j)} \end{matrix}]$

Calculating this for multiple hidden layer,

$z^{(j + 1)} = Θ^{(j)} a^{(j)}$

$x = a^{(1)}$

$h_{Θ} (x) = a^{(j + 1)} = g (z^{(j + 1)})$

By using multiple layer of neural network, we can solve complicate non-linear regression problem easily by setting multiple hidden layer.

3. Multiclass Classification

To classify data into multiple classes, we let our hypothesis function return a vector of values. Say we want to classify our data into one of four categories. We will use the following example to see how this classification is done. This algorithm takes as input image and classify it accordingly.

In multiclass classification, we difine y as following :

$y^{(i)} = [\begin{matrix} 1 \\ 0 \\ 0 \\ 0 \end{matrix}], [\begin{matrix} 0 \\ 1 \\ 0 \\ 0 \end{matrix}], [\begin{matrix} 0 \\ 0 \\ 1 \\ 0 \end{matrix}], [\begin{matrix} 0 \\ 0 \\ 0 \\ 1 \end{matrix}]$

Each $y$ represents a different image. The inner layer, each provide us with some new information which leads to our final function.

$[\begin{matrix} x_{0} \\ x_{1} \\ . . . \\ x_{n} \end{matrix}] \to [\begin{matrix} a_{0}^{(2)} \\ a_{1}^{(2)} \\ . . . \\ a_{n}^{(2)} \end{matrix}] \to [\begin{matrix} a_{0}^{(3)} \\ a_{1}^{(3)} \\ . . . \\ a_{n}^{(3)} \end{matrix}] \to . . . [\begin{matrix} h_{Θ} (x)_{1} \\ h_{Θ} (x)_{2} \\ h_{Θ} (x)_{3} \\ h_{Θ} (x)_{4} \end{matrix}]$

4. Cost Function

We define a few variables that we need to use.

$L$ : total number of layers in the network
$s_{l}$ : number of units in layer l
$K$ : number of output unit/classes

In regularization of logistic regression, we has the cost function following :

$J (θ) = - \frac{1}{m} [\sum_{i = 1}^{m} y^{(i)} \log (h_{θ} (x^{(i)})) + (1 - y^{(i)}) \log (1 - h_{θ} (x^{(i)}))] + \frac{λ}{2 m} \sum_{j = 1}^{n} θ_{j}^{2}$

In neural netwrok, we have multiclass classification. So we denote $H (x)$ as $H (x)_{k}$ that results in the Kth output. So for neural networks, it is going to be more complicated.

$J (θ) = - \frac{1}{m} [\sum_{i = 1}^{m} \sum_{k = 1}^{K} y_{k}^{(i)} \log (h_{Θ} (x^{(i)}))_{k}) + (1 - y_{k}^{(i)}) \log (1 - (h_{Θ} (x^{(i)})_{k}))] + \frac{λ}{2 m} \sum_{l = 1}^{L - 1} \sum_{i = 1}^{s_{l}} \sum_{j = 1}^{s_{l + 1}} (Θ_{j, i}^{(l)})^{2}$

Because of $Θ$ has $s_{l + 1} \times s_{l}$ , sum of $j$ opreate does first.

5. Back Propagation

To minimze neural network cost function, we use Back Propagation, just like what we were doing gradient descent in regression problem.

$\frac{δ}{δ Θ_{i, j}^{(l)}} J (Θ)$

$i$ : number of examples
$j$ : number of features
$l$ : number of layer

We use forward propagation to get $H (x)$ . But, to get partail derivative term of $J$ , We compute back propagation using $δ$ .

$δ^{(4)} = a^{(4)} - y = h (x) - y$

$δ^{(3)} = (Θ^{(3)})^{T} δ^{(4)} . * g^{'} (z^{(3)})$

To generalize above term,

$δ^{(l)} = (Θ^{(l)})^{T} δ^{(l + 1)} . * g^{'} (z^{(l)}) = (Θ^{(l)})^{T} δ^{(l + 1)} . * a^{(l)} . * (1 - a^{(l)})$

$l = 2. . . L$

And aslo, we can get derivative term like these.

6. Summarization

6.1 Multi-layer Perceptron

With above Neural network system, we will review what we studied. This neural network system has 4 layer with one input layer, one output layer and two hidden layer.

$l$ : number of layers
$L$ : final number of layers

The vectorization of those layer is same as below.

$x = a^{(1)} = [\begin{matrix} x_{0} & x_{1} & x_{2} \end{matrix}]$

$a^{(2)} = [\begin{matrix} a_{0}^{(2)} & a_{1}^{(2)} & a_{2}^{(2)} & a_{3}^{(2)} & a_{4}^{(2)} \end{matrix}]$

$a^{(3)} = [\begin{matrix} a_{0}^{(3)} & a_{1}^{(3)} & a_{2}^{(3)} & a_{3}^{(3)} & a_{4}^{(3)} \end{matrix}]$

$a^{(4)} = [\begin{matrix} a_{0}^{(4)} & a_{1}^{(4)} & a_{2}^{(4)} & a_{3}^{(4)} \end{matrix}] = [\begin{matrix} h_{Θ} (x)_{1} & h_{Θ} (x)_{2} & h_{Θ} (x)_{3} & h_{Θ} (x)_{4} \end{matrix}]$

6.2 Feed Forward Propagation

Forward propagation is getting $h (x)$ with weighted parameter Theta. With sigmoid function $g (z)$ , we can get $h (x)$ by forward propagation.

$x = a^{(1)}$

$a^{(2)} = g (z^{(2)}) = g (Θ^{(1)} a^{(1)})$

$a^{(3)} = g (z^{(3)}) = g (Θ^{(2)} a^{(2)})$

$a^{(4)} = h_{Θ} (x)_{k} = g (z^{(4)}) = g (Θ^{(3)} a^{(3)})$

6.3 Parameter $Θ$

Before make parameter $Θ$ matrix, we first make dimension of layer vector.

$a^{(1)} = 3 \times 1$

$a^{(2)} = 5 \times 1$

$a^{(3)} = 5 \times 1$

$a^{(4)} = 4 \times 1$

Theta matrix's dimension get upper layer dimension times lower layer dimension. If there isn't bias term at first, then matrix's dimension get upper layer dimension times lower layer dimension + 1.

$Θ^{(1)} = 5 \times 3$

$Θ^{(2)} = 5 \times 5$

$Θ^{(3)} = 4 \times 5$

6.4 Cost Function with Regularization

$J (θ) = - \frac{1}{m} [\sum_{i = 1}^{m} \sum_{k = 1}^{4} y_{k}^{(i)} \log (h_{Θ} (x^{(i)}))_{k}) + (1 - y_{k}^{(i)}) \log (1 - (h_{Θ} (x^{(i)})_{k}))] + \frac{λ}{2 m} \sum_{l = 1}^{3} \sum_{i = 1}^{s_{l}} \sum_{j = 1}^{s_{l + 1}} (Θ_{j, i}^{(l)}$

6.5 Back Propagation

To minimze cost function J, we need to calcluate derivative term called 'back propagation'.

Step 1 : Calculating feed forward propagation

$x = a^{(1)}$

$a^{(2)} = g (z^{(2)}) = g (Θ^{(1)} a^{(1)})$

$a^{(3)} = g (z^{(3)}) = g (Θ^{(2)} a^{(2)})$

$a^{(4)} = h_{Θ} (x)_{k} = g (z^{(4)}) = g (Θ^{(3)} a^{(3)})$

Step 2 : Setting $δ$

$δ^{(4)} = a^{(4)} - y = h (x) - y$

$δ^{(3)} = (Θ^{(3)})^{T} δ^{(4)} . * g^{'} (z^{(3)}) = (Θ^{(3)})^{T} δ^{(4)} . * a^{(3)} . * (1 - a^{(3)})$

$δ^{(2)} = (Θ^{(2)})^{T} δ^{(3)} . * g^{'} (z^{(2)}) = (Θ^{(2)})^{T} δ^{(3)} . * a^{(2)} . * (1 - a^{(2)})$

$Δ^{(3)} := Δ^{(3)} + a^{(3)} δ^{(4)}$

$Δ^{(2)} := Δ^{(2)} + a^{(2)} δ^{(3)}$

$Δ^{(1)} := Δ^{(1)} + a^{(1)} δ^{(2)}$

Step 3 : Setting Derivative Terms

$\frac{δ}{δ Θ^{(l)}} J (Θ) := \frac{1}{m} (Δ^{(l)} + λ Θ^{(l)}) := \frac{1}{m} (a^{(l)} δ^{(l + 1)} + λ Θ^{(l)})$

$\frac{δ}{δ Θ^{(l)}} J (Θ) := \frac{1}{m} (Δ^{(l)}) := \frac{1}{m} (a^{(l)} δ^{(l + 1)})$

'Data Science > Neural Network' 카테고리의 다른 글

[Tensorflow] Overfitting and Underfitting (0)	2022.09.21
[Tensorflow] Stochastic Gradient Descent (0)	2022.09.21
[Tensorflow] Deep Neural Networks (0)	2022.09.21
[Tensorflow] A Single Neuron (0)	2022.09.20
[Theorem] Optimizing Neural Network (0)	2022.09.19

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

See the forest

[Theorem] Neural Network

1. What is Neural Network

2. Vectorization of Neural Network

3. Multiclass Classification

4. Cost Function

5. Back Propagation

6. Summarization

6.1 Multi-layer Perceptron

6.2 Feed Forward Propagation

6.3 Parameter $Θ$

6.4 Cost Function with Regularization

6.5 Back Propagation

Step 1 : Calculating feed forward propagation

Step 2 : Setting $δ$

Step 3 : Setting Derivative Terms

'Data Science > Neural Network' 카테고리의 다른 글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역

[Theorem] Neural Network

1. What is Neural Network

2. Vectorization of Neural Network

3. Multiclass Classification

4. Cost Function

5. Back Propagation

6. Summarization

6.1 Multi-layer Perceptron

6.2 Feed Forward Propagation

6.3 Parameter Θ

6.4 Cost Function with Regularization

6.5 Back Propagation

Step 1 : Calculating feed forward propagation

Step 2 : Setting δ

Step 3 : Setting Derivative Terms

'Data Science > Neural Network' 카테고리의 다른 글

'Data Science/Neural Network' Related Articles

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역

6.3 Parameter $Θ$

Step 2 : Setting $δ$