Deep learning_ideal not flash fire blog-1024programmer

class=”markdown_views prism-github-gist”>

What is deep learning?

Deep learning can be understood as a combination of the two terms “deep” and “learning”. “Depth” is reflected in the number of layers of the neural network. Generally speaking, the more layers of the neural network, that is, the deeper, the better the learning effect; “learning” is reflected in the fact that the neural network can automatically Correct parameters such as weight bias to fit better learning effects.

An example of a 2-layer neural network

What can deep learning be applied to?

The most typical and widespread application of deep learning is image recognition. In addition, deep learning can also be applied to speech, natural language and other fields.

What deep learning frameworks are there now?

The current mainstream deep learning frameworks include: TensorFlow, Keras, Caffe, PyTorch, etc.

What is the relationship between deep learning and neural networks?

Deep learning is a deep neural network with a deeper layer. In the field of deep learning, there are various neural networks, the most famous of which is the convolutional neural network (CNN).

What is the relationship between a neural network and a perceptron?

The perceptron is the origin of the neural network, and the neural network is a perceptron that uses a nonlinear activation function (such as the sigmoid function).

What activation functions are there in neural networks?

Activation functions are divided into two categories:

Activation function of the hidden layer: sigmoid function, ReLU function, etc.

Activation function of the output layer: softmax function, identity function, etc.

What are the parameters in a neural network?

The parameters in the neural network are mainly divided into two categories:

Common parameters (automatic update): weight, bias

Hyperparameters (manually set): learning rate, batch size, number of neurons in each layer, weight decay coefficient, etc.

What is generalization ability?

Generalization ability refers to the ability of a neural network to process unobserved data (data not included in the training set). Obtaining generalization ability is the ultimate goal of deep learning.

What is overfitting?

The opposite of generalization ability is overfitting, that is, it only shows a good learning effect on the training data set, but cannot handle other similar data sets. We need to avoid overfitting.

Typical manifestations of overfitting are as follows:

As shown in the figure above, the recognition rate of the training set reaches 100% (Under normal circumstances, the recognition rate cannot reach 100%), while the recognition rate of the test set is only about 70%, which means that overfitting has occurred.
How to suppress overfitting?

Overfitting needs to be avoided. Generally, there are two ways to suppress overfitting:

Weight decay

Dropout optimization

What is a loss function?

Neural network learning refers to the process of automatically obtaining the optimal weight bias parameter from the training data, but how to evaluate this optimal parameter? The loss function is introduced here. The loss function is used as an indicator of neural network learning. The smaller the value of the loss function, the better the learning effect of the neural network, which means that the parameters at this moment are relatively optimal parameters.

What are the main loss functions?

There are two commonly used loss functions:

Mean square error

Cross entropy error

In general, use the cross entropy error as the loss function of the softmax function; use the mean square error as the loss function of the identity function.

What are the methods for finding the optimal parameters?

The purpose of neural network learning is to find the weight bias parameters that make the loss function as small as possible, that is, to find the optimal parameters. At present, there are four main ways to find the optimal parameters:

Stochastic Gradient Descent (SGD): The most widely used

Momentum

Ada Grad

Adam: The overall performance is the best

The comparison of the parameter update paths of these four methods is as follows:

What is the main idea of stochastic gradient descent (SGD)?

Take the gradient of the loss function with respect to the weight bias parameter as a clue, update the parameters along the gradient direction, and repeat the steps several times to gradually approach the optimal parameters.

What methods are there to find the gradient in the stochastic gradient descent method?

The gradient here specifically refers to the gradient of the loss function with respect to the weight bias parameter. There are two main ways to find the gradient:

Numerical differentiation

Error backpropagation method: faster

What is the overall process of neural network learning?

The learning of neural network is mainly the following four steps:

Step 1 (mini-batch): randomly select a part of the data from the training data

Step 2 (calculate the gradient): use the error backpropagation method to calculate the gradient of the loss function with respect to each weight parameter

Step 3 (Update parameters): Make small updates to the weight parameters along the gradient direction

Step 4 (repeat): Repeat steps 1, 2, 3.

How to set the initial value of the weight?

Although the weight parameter is updated automatically, itsThe initial value is still very important, and a good initial value can speed up the learning progress.

There are three main weight initial values:

Gaussian distribution with a standard deviation of 0.01

Xavier initial value: suitable for activation function is sigmoid function

He initial value: suitable for activation function is ReLU function

Under the condition of using Adam to update the parameters, the comparison of the three weight initial values based on the MNIST data set is as follows:

It can be seen that the initial value of He has the best overall performance
.
Video Analysis