The objective of this article is to explain what is behind an artificial neural network, those black boxes in which most of the time we simply create the structures, enter an input value and magically obtain a prediction of a value that will be more or less close to reality.

But behind that magic there is a series of mathematical operations that follow a very structured logic and that in the next lines I will explain in a practical way using, of course, mathematics and the Python programming language, where we will create our own neural network that be able to learn a classification system

What is a neural network?

  • A neural network is defined as a system consisting of a series of all interconnected elements, called 'neurons', which are organized in layers that process information using dynamic state responses to external inputs.
  • In the context of this structure, the input layer introduces patterns into the neural network that has a neuron for each component present in the input data and communicates to one or more hidden layers, all layers of the network except the input and output. It is in the hidden layers where all the processing happens, through a connection system characterized by weights and biases (commonly known as W and b)

How does it work?

  • With the input value received by the neuron, a weighted sum is calculated, also adding the bias, and according to the result and a pre-established activation function (which we will see below), the activation or excitation of the neuron is decided. The neuron then transmits the information to other connected neurons in a process called "PassFoward." At the end of this process, the last hidden layer is linked to the output layer that has a neuron for each possible desired output.


z = \ sum_ {i = XNUMX} ^ nX_iW_i + b_i

a = F (z)

Activation Features

It is used to determine the output of the neural network as yes or no. Map the resulting values ​​between 0 to 1 or -1 to 1. The activation functions can be basically divided into 2 types:

  • Linear activation function : Which are no longer used in Deep Learning, since the output of the functions will not be confined between any range and the sum of different linear functions remains a linear function, thus limiting the activation of the neuron.
  • Nonlinear activation functions: They are used in neural networks, as we will see below, these functions allow a narrowing of the output data. Some examples are the hyperbolic sigmoid or tangent function

Sigmoid function

  • The main reason we use the sigmoid function is because it exists between (0 to 1). Therefore, it is especially used for models where we have to predict probability as an outcome. Since the probability of anything exists only between the range of 0 and 1

sigmoid (x) = \ frac {XNUMX} {XNUMX + e ^ {- x}}

sigmoid '(x) = x {(XNUMX-x)}

Hyperbolic or Gaussian Tangent Function

  • It is a function similar to the Sigmoid but produces scale outputs of [-1, +1]. Furthermore, it is a continuous function. In other words, the function produces results for each value of x.

cosh (x) = \ frac {e ^ {- x} + e ^ {- x}} {XNUMX}

tanh (x) = \ frac {e ^ {- x} - e ^ {- x}} {e ^ {- x} + e ^ {- x}}

tanh '(x) = \ frac {XNUMX} {\ cosh ^ XNUMX {x}}

RELU (Rectified Lineal Unit) function

  • ReLU is the most widely used activation function in the world right now. Since then, it has been used in almost all convolutional neural networks or deep learning.
  • As you can see, ReLU is half ground (from below). f (z) is zero when z is less than zero and f (z) is equal to z when z is greater than or equal to zero.
  • It is a function used in the hidden layers of our neural network, NOT in the output layers.

relu (x) = \ max (XNUMX, x)

relu '(x) = XNUMX. (x> XNUMX)

We create the functions in Python

How does our neural network learn?

The final step of PassForward is to evaluate the predicted output (Yr) against an expected output (Yr). The output Yr is part of the training data set (x, y) where x is the input (as we saw in the previous section). The evaluation between Yp and Yr is done through a cost function. For this exercise we have used two cost functions:

  • MSE (root mean square error)
  • Binary cross entropy.

We call this cost function C and denote it as follows:

C = cost (Y_p, Y_r)

Y_p = value \ which \ has \ predicted \ our \ network

Y_r = \ real \ value \ of \ data

Where he cost it can be equal to MSE, cross entropy or any other cost function. Based on the value of C, the model "knows" how much to adjust its parameters (Weight and BIAS) to approach the expected output y. This happens using the backpropagation algorithm or also known as Backpropagation.

Cost functions and their derivatives

MSE (root mean square error)

MSE = \ frac {XNUMX} {n} {\ sum_ {XNUMX = XNUMX} ^ n {(Y_i - \ hat {Y_i})} ^ XNUMX}

MSE´ = \ sum_ {i = XNUMX} ^ n {(Y_i - \ hat {Y_i})}

Cross entropy binary

H (y, P) = \ begin {cases} P = XNUMX, - \ log (P) \\ P = XNUMX, - \ log (XNUMX-P) \ end {cases}

These two expressions can be joined into one to obtain a single cost function

H (y, P) = - \ frac {XNUMX} {n} {\ sum_ {XNUMX = XNUMX} ^ n} (y \ log (P) + (XNUMX-y) \ log (XNUMX-P))

H (y, P) ´ = - (\ frac {y} {P} - \ frac {XNUMX-y} {XNUMX-P}) "))

Python loss functions

BackPropagation and descending gradient

Backpropagation aims to minimize the cost function by adjusting the weights (w) and biases (bias) of the network. The level of fit is determined by the gradients of the cost function with respect to those parameters. (derivatives)

The derivative of a function C measures the sensitivity to change of the value of the function (output value) with respect to a change in its argument x (input value). In other words, the derivative tells us which direction C is going.

The gradient shows how much the parameter x must change (in a positive or negative direction) to minimize C.

To calculate these gradients we use the technique of Chain rule

Derived from the cost function with respect to the weight

  • It can be expressed with the chain rule, multiplying the derivative of the Cost with respect to the weighted sum (z) by the derivative of (z) with respect to the value of the weight (w)

\ frac {\ partial C} {\ partial w ^ l_ {jk}} = \ frac {\ partial C} {\ partial z ^ l_j} \ frac {\ partial z ^ l_j} {\ partial w ^ l_ {jk} }

By \ definition \ we know \ that \: z ^ l_j = \ sum_ {k = XNUMX} ^ mw ^ l_ {jk} a ^ {l-XNUMX} _ {k} + b ^ l_ {j}

Calculating \ the \ derivative \ we can \ say \ that: \ frac {\ partial z ^ l_j} {\ partial w ^ l_ {jk}} = a ^ {l-XNUMX} _ {k}

Final \ value: \ frac {\ partial C} {\ partial w ^ l_ {jk}} = \ frac {\ partial C} {\ partial z ^ l_j} a ^ {l-XNUMX} _ {k}

l = number \ of \ layer

j = number \ of \ neuron

Derived from the Cost function with respect to the BIAS parameter

  • It can be expressed with the chain rule, multiplying the derivative of Cost with respect to the weighted sum (z) by the derivative of (z) with respect to the value of BIAS (b)

\ frac {\ partial C} {\ partial b ^ l_ {j}} = \ frac {\ partial C} {\ partial z ^ l_j} \ frac {\ partial z ^ l_j} {\ partial b ^ l_ {j} }

Calculating \ the \ derivative \ we can \ say \ that: \ frac {\ partial z ^ l_j} {\ partial b ^ l_ {j}} = XNUMX

Final \ value: \ frac {\ partial C} {\ partial b ^ l_ {j}} = \ frac {\ partial C} {\ partial z ^ l_j} XNUMX

The common part in both equations is often called "local gradient" and is expressed as follows:

\ delta ^ l_ {j} = \ frac {\ partial C} {\ partial z ^ l_ {j}}

Rule \ of \ the \ string: \ delta ^ l_ {j} = \ frac {\ partial C} {\ partial a ^ l_ {j}} \ frac {\ partial a ^ l_ {j}} {\ partial z ^ l_ {j}}

Compute the error of the previous layer

\ delta ^ {l-XNUMX} = \ delta ^ lw ^ l \ frac {\ partial a ^ {l-XNUMX}} {\ partial z ^ {l-XNUMX}}

When we have the development of the partial variables we can adjust the parameters of our network

We update the BIAS parameter using the Gradient Vector

'b_ {l + XNUMX} = b_ {l} - \ epsilon \ frac {\ partial C} {\ partial b}

\ epsilon = The \ ratio \ of \ learning \ (How much \ we \ move \ in \ our \ cost \ function \ in \ each \ iteration)

We update the weight parameter using the Gradient vector

w_ {l + XNUMX} = w_ {l} - \ epsilon \ frac {\ partial C} {\ partial w}

\ epsilon = The \ ratio \ of \ learning \ (How much \ we \ move \ in \ our \ cost \ function \ in \ each \ iteration)

We carry out this process interactively until we manage to minimize the error in our cost function. Descending Gradient Algorithm

We program the Neural Network in Python

All the project available in:

https://github.com/jmcalvomartin/python/tree/master/projects/Create_ANN