• Studying neural networks: where to start. Neural networks: types, principles of operation and areas of application

    In the first half of 2016, the world heard about many developments in the field of neural networks - Google (Go network player AlphaGo), Microsoft (a number of services for image identification), startups MSQRD, Prisma and others demonstrated their algorithms.

    Bookmarks

    The editors of the site tell us what they are neural networks, what they are needed for, why they have taken over the planet now, and not years earlier or later, how much you can earn from them and who are the main market players. Experts from MIPT, Yandex, Mail.Ru Group and Microsoft also shared their opinions.

    What are neural networks and what problems can they solve?

    Neural networks are one of the directions in the development of artificial intelligence systems. The idea is to simulate as closely as possible the work of a human nervous system- namely, her ability to learn and correct mistakes. This is main feature any neural network - it is capable of learning independently and acting on the basis of previous experience, making fewer and fewer mistakes each time.

    The neural network imitates not only the activity, but also the structure of the human nervous system. Such a network consists of a large number of individual computing elements (“neurons”). In most cases, each “neuron” belongs to a specific layer of the network. The input data is sequentially processed at all layers of the network. The parameters of each “neuron” can change depending on the results obtained on previous sets of input data, thus changing the order of operation of the entire system.

    The head of the Mail.ru Search department at Mail.Ru Group, Andrey Kalinin, notes that neural networks are capable of solving the same problems as other machine learning algorithms, the difference lies only in the approach to training.

    All tasks that neural networks can solve are somehow related to learning. Among the main areas of application of neural networks are forecasting, decision making, pattern recognition, optimization, and data analysis.

    Director of technological cooperation programs at Microsoft in Russia, Vlad Shershulsky, notes that neural networks are now used everywhere: “For example, many large Internet sites use them to make reactions to user behavior more natural and useful to their audience. Neural networks underlie most modern systems speech recognition and synthesis, as well as image recognition and processing. They are used in some navigation systems, be it industrial robots or driverless cars. Neural network-based algorithms protect information systems from malicious attacks and help identify illegal content on the network.”

    In the near future (5-10 years), Shershulsky believes, neural networks will be used even more widely:

    Imagine an agricultural combine, the actuators of which are equipped with many video cameras. It takes five thousand pictures per minute of each plant in its trajectory and, using a neural network, analyzes whether it is a weed, whether it is affected by disease or pests. And each plant is treated individually. Fiction? Not really anymore. And in five years it may become the norm. - Vlad Shershulsky, Microsoft

    Mikhail Burtsev, head of the laboratory of neural systems and deep learning at the MIPT Center for Living Systems, provides a tentative map of the development of neural networks for 2016-2018:

    • systems for recognizing and classifying objects in images;
    • voice interaction interfaces for the Internet of things;
    • service quality monitoring systems in call centers;
    • systems for identifying problems (including predicting maintenance time), anomalies, cyber-physical threats;
    • intellectual security and monitoring systems;
    • replacing some of the functions of call center operators with bots;
    • video analytics systems;
    • self-learning systems that optimize control material flows or location of objects (in warehouses, transport);
    • intelligent, self-learning control systems for production processes and devices (including robotics);
    • the emergence of universal on-the-fly translation systems for conferences and personal use;
    • emergence of bot consultants technical support or personal assistants with functions similar to a person.

    Director of Technology Distribution at Yandex Grigory Bakunov believes that the basis for the spread of neural networks in the next five years will be the ability of such systems to make various decisions: “The main thing that neural networks now do for a person is to save him from unnecessary decision-making. So they can be used almost anywhere where not very intelligent decisions are made by a living person. In the next five years, it is this skill that will be exploited, which will replace human decision-making with a simple machine.”

    Why have neural networks become so popular right now?

    Scientists have been developing artificial neural networks for more than 70 years. The first attempt to formalize a neural network dates back to 1943, when two American scientists (Warren McCulloch and Walter Pitts) presented an article on the logical calculus of human ideas and neural activity.

    However, until recently, says Andrey Kalinin from Mail.Ru Group, the speed of neural networks was too low for them to become widespread, and therefore such systems were mainly used in developments related to computer vision, and in other areas other machine learning algorithms were used.

    A labor-intensive and time-consuming part of the neural network development process is its training. In order for a neural network to correctly solve the assigned problems, it is required to “run” its work on tens of millions of sets of input data. It is with the advent of various accelerated learning technologies that Andrei Kalinin and Grigory Bakunov associate the spread of neural networks.

    The main thing that has happened now is that various tricks have appeared that make it possible to create neural networks that are much less susceptible to retraining. - Grigory Bakunov, Yandex

    “Firstly, a large and publicly available array of labeled images (ImageNet) has appeared on which you can learn. Secondly, modern video cards make it possible to train neural networks and use them hundreds of times faster. Thirdly, ready-made, pre-trained neural networks have appeared that recognize images, on the basis of which you can create your own applications without having to spend a long time preparing the neural network for work. All this ensures a very powerful development of neural networks specifically in the field of image recognition,” notes Kalinin.

    What is the size of the neural network market?

    “Very easy to calculate. You can take any field that uses low-skill labor, such as call center agents, and simply subtract all human resources. I would say that we are talking about a multi-billion dollar market, even within a single country. It is easy to understand how many people in the world are employed in low-skilled jobs. So, even speaking very abstractly, I think we are talking about a hundred-billion-dollar market all over the world,” says Grigory Bakunov, director of technology distribution at Yandex.

    According to some estimates, more than half of the professions will be automated - this is the maximum volume by which the market for machine learning algorithms (and neural networks in particular) can be increased. - Andrey Kalinin, Mail.Ru Group

    “Machine learning algorithms are the next step in automating any processes, in developing any software. Therefore, the market at least coincides with the entire software market, but rather exceeds it, because it becomes possible to make new intelligent solutions that are inaccessible to old software,” continues Andrey Kalinin, head of the Mail.ru Search department at Mail.Ru Group.

    Why neural network developers create mobile applications for the mass market

    In the last few months, several high-profile entertainment projects using neural networks have appeared on the market - this is the popular video service, which is the social network Facebook, and Russian applications for image processing (investments from Mail.Ru Group in June) and others.

    The abilities of their own neural networks were demonstrated by both Google (AlphaGo technology won against the champion in Go; in March 2016, the corporation sold at auction 29 paintings drawn by neural networks, etc.), and Microsoft (the CaptionBot project, which recognizes images in photographs and automatically generates captions for them ; the WhatDog project, which determines the breed of a dog from a photograph; the HowOld service, which determines the age of a person in a photo, and so on), and Yandex (in June, the team integrated a service for recognizing cars in photographs into the Avto.ru application; presented a musical recording recorded by neural networks album; in May she created the LikeMo.net project for drawing in the style of famous artists).

    Such entertainment services are created not to solve global problems, which neural networks are aimed at, but to demonstrate the capabilities of a neural network and conduct its training.

    "Games - characteristic feature our behavior as a species. On the one hand, almost everything can be simulated using game situations. typical scenarios human behavior, and on the other hand, game creators and, especially, players can get a lot of pleasure from the process. There is also a purely utilitarian aspect. A well-designed game not only brings satisfaction to the players: as they play, they train the neural network algorithm. After all, neural networks are based on learning by example,” says Vlad Shershulsky from Microsoft.

    “First of all, this is done to show the capabilities of the technology. There is really no other reason. If we are talking about Prisma, then it is clear why they did it. The guys built some kind of pipeline that allows them to work with pictures. To demonstrate this, they chose a fairly simple method of creating stylizations. Why not? This is just a demonstration of how the algorithms work,” says Grigory Bakunov from Yandex.

    Andrey Kalinin from Mail.Ru Group has a different opinion: “Of course, this is impressive from the public’s point of view. On the other hand, I wouldn't say that entertainment products can't be applied to more useful areas. For example, the task of stylizing images is extremely relevant for a number of industries (design, computer games, animation are just a few examples), and full use of neural networks can significantly optimize the cost and methods of creating content for them.”

    Major players in the neural networks market

    As Andrey Kalinin notes, by and large, most of the neural networks on the market are not much different from each other. “Everyone’s technology is approximately the same. But using neural networks is a pleasure that not everyone can afford. To independently train a neural network and run many experiments on it, you need large training sets and a fleet of machines with expensive video cards. It is obvious that such opportunities exist large companies", he says.

    Among the main market players, Kalinin mentions Google and its division Google DeepMind, which created the AlphaGo network, and Google Brain. Microsoft has its own developments in this area - they are carried out by the Microsoft Research laboratory. The creation of neural networks is carried out at IBM, Facebook (a division of Facebook AI Research), Baidu (Baidu Institute of Deep Learning) and others. Many developments are being carried out at technical universities around the world.

    Yandex Technology Distribution Director Grigory Bakunov notes that interesting developments in the field of neural networks are also found among startups. “I would remember, for example, the company ClarifAI. This is a small startup, once made by people from Google. Now they are perhaps the best in the world at identifying the contents of a picture.” Such startups include MSQRD, Prisma, and others.

    In Russia, developments in the field of neural networks are carried out not only by startups, but also by large technology companies - for example, the Mail.Ru Group holding uses neural networks for processing and classifying texts in Search and image analysis. The company is also conducting experimental developments related to bots and conversational systems.

    Yandex is also creating its own neural networks: “Basically, such networks are already used in working with images and sound, but we are exploring their capabilities in other areas. Now we are doing a lot of experiments in using neural networks in working with text.” Developments are being carried out at universities: Skoltech, MIPT, Moscow State University, Higher School of Economics and others.

    Accordingly, the neural network takes two numbers as input and must output another number - the answer. Now about the neural networks themselves.

    What is a neural network?


    A neural network is a sequence of neurons connected by synapses. The structure of a neural network came to the world of programming straight from biology. Thanks to this structure, the machine gains the ability to analyze and even remember various information. Neural networks are also capable of not only analyzing incoming information, but also reproducing it from their memory. For those interested, be sure to watch 2 videos from TED Talks: Video 1 , Video 2). In other words, a neural network is a machine interpretation of the human brain, which contains millions of neurons transmitting information in the form of electrical impulses.

    What types of neural networks are there?

    For now, we will consider examples on the most basic type of neural networks - a feed-forward network (hereinafter referred to as a feedforward network). Also in subsequent articles I will introduce more concepts and tell you about recurrent neural networks. SPR, as the name suggests, is a network with serial connection neural layers, in it information always goes only in one direction.

    What are neural networks for?

    Neural networks are used to solve complex problems that require analytical calculations similar to what the human brain does. The most common applications of neural networks are:

    Classification- distribution of data by parameters. For example, you are given a set of people as input and you need to decide which of them to give credit to and which not. This work can be done by a neural network, analyzing information such as age, solvency, credit history, etc.

    Prediction- the ability to predict the next step. For example, the rise or fall of shares based on the situation in the stock market.

    Recognition- Currently, the most widespread use of neural networks. Used in Google when you search for a photo or in phone cameras when it detects the position of your face and highlights it and much more.

    Now, to understand how neural networks work, let's take a look at its components and their parameters.

    What is a neuron?


    A neuron is a computational unit that receives information and produces on it simple calculations and passes it on. They are divided into three main types: input (blue), hidden (red) and output (green). There is also a displacement neuron and a context neuron, which we will talk about in the next article. In the case when a neural network consists of a large number of neurons, the term layer is introduced. Accordingly, there is an input layer that receives information, n hidden layers (usually no more than 3) that process it, and an output layer that outputs the result. Each neuron has 2 main parameters: input data and output data. In the case of an input neuron: input=output. In the rest, the input field contains the total information of all neurons from the previous layer, after which it is normalized using the activation function (for now let’s just imagine it as f(x)) and ends up in the output field.


    Important to remember that neurons operate with numbers in the range or [-1,1]. But how, you ask, then process numbers that fall outside this range? At this point, the simplest answer is to divide 1 by that number. This process is called normalization and it is very often used in neural networks. More on this a little later.

    What is a synapse?


    A synapse is a connection between two neurons. Synapses have 1 parameter - weight. Thanks to it, input information changes as it is transmitted from one neuron to another. Let's say there are 3 neurons that transmit information to the next one. Then we have 3 weights corresponding to each of these neurons. For the neuron whose weight is greater, that information will be dominant in the next neuron (for example, color mixing). In fact, the set of weights of a neural network or the weight matrix is ​​a kind of brain of the entire system. It is thanks to these weights that the input information is processed and turned into a result.

    Important to remember, that during the initialization of the neural network, the weights are placed in a random order.

    How does a neural network work?


    This example shows part of a neural network, where the letters I denote input neurons, the letter H denotes a hidden neuron, and the letter w denotes weights. The formula shows that the input information is the sum of all input data multiplied by their corresponding weights. Then we will give 1 and 0 as input. Let w1=0.4 and w2 = 0.7 The input data of neuron H1 will be as follows: 1*0.4+0*0.7=0.4. Now that we have the input, we can get the output by plugging the input into the activation function (more on that later). Now that we have the output, we pass it on. And so, we repeat for all layers until we reach the output neuron. Having launched such a network for the first time, we will see that the answer is far from correct, because the network is not trained. To improve the results we will train her. But before we learn how to do this, let's introduce a few terms and properties of a neural network.

    Activation function

    An activation function is a way of normalizing input data (we talked about this earlier). That is, if you have a large number at the input, passing it through the activation function, you will get an output in the range you need. There are quite a lot of activation functions, so we will consider the most basic ones: Linear, Sigmoid (Logistic) and Hyperbolic tangent. Their main differences are the range of values.

    Linear function


    This function is almost never used, except when you need to test a neural network or pass a value without conversion.

    Sigmoid


    This is the most common activation function and its range of values ​​is . This is where most of the examples on the web are shown, and is also sometimes called the logistic function. Accordingly, if in your case there are negative values ​​(for example, stocks can go not only up, but also down), then you will need a function that also captures negative values.

    Hyperbolic tangent


    It only makes sense to use hyperbolic tangent when your values ​​can be both negative and positive, since the range of the function is [-1,1]. It is not advisable to use this function only with positive values ​​as this will significantly worsen the results of your neural network.

    Training set

    A training set is a sequence of data that a neural network operates on. In our case of elimination or (xor), we have only 4 different outcomes, that is, we will have 4 training sets: 0xor0=0, 0xor1=1, 1xor0=1,1xor1=0.

    Iteration

    This is a kind of counter that increases every time the neural network goes through one training set. In other words, this is the total number of training sets completed by the neural network.

    era

    When the neural network is initialized, this value is set to 0 and has a manually set ceiling. The larger the epoch, the better trained the network and, accordingly, its result. The epoch increases each time we go through the entire set of training sets, in our case, 4 sets or 4 iterations.


    Important do not confuse iteration with epoch and understand the sequence of their increment. First n
    once the iteration increases, and then the epoch and not vice versa. In other words, you cannot first train a neural network on only one set, then on another, and so on. You need to train each set once per era. This way, you can avoid errors in calculations.

    Error

    Error is a percentage that reflects the difference between the expected and received responses. The error is formed every era and must decline. If this doesn't happen, then you are doing something wrong. The error can be calculated in different ways, but we will consider only three main methods: Mean Squared Error (hereinafter MSE), Root MSE and Arctan. There is no restriction on use like there is in the activation function, and you are free to choose any method that will give you the best results. You just have to keep in mind that each method counts errors differently. With Arctan, the error will almost always be larger, since it works on the principle: than more difference, the larger the error. The Root MSE will have the smallest error, so it is most common to use an MSE that maintains balance in error calculation. Algorithms, Machine learning

    Welcome to the second part of the Neural Networks tutorial. I would like to immediately apologize to everyone who was waiting for the second part much earlier. For certain reasons I had to put off writing it. I actually didn't expect that the first article would have such a demand and that so many people would be interested this topic. Taking into account your comments, I will try to provide you with as much information as possible while at the same time keeping it as clear as possible. In this article, I will talk about ways to teach/train neural networks (in particular, the backpropagation method) and if, for some reason, you have not yet read the first part, I highly recommend starting with it. In the process of writing this article, I also wanted to talk about other types of neural networks and training methods, however, when I started writing about them, I realized that this would go against my method of presentation. I understand that you are eager to get as much information as possible, however, these topics are very broad and require detailed analysis, and my main goal is not to write another article with a superficial explanation, but to convey to you every aspect of the topic and make the article as easy to understand as possible. development. I hasten to upset those who like to “code”, since I still will not resort to using a programming language and will explain everything “on my fingers”. Enough introduction, let's now continue studying neural networks.

    What is a displacement neuron?


    Before we begin our main topic, we must introduce the concept of another type of neuron - a displacement neuron. A displacement neuron or bias neuron is the third type of neuron used in most neural networks. The peculiarity of this type of neurons is that its input and output are always equal to 1 and they never have input synapses. Displacement neurons can either be present in the neural network one at a time per layer, or completely absent; it cannot be 50/50 (in red in the diagram are weights and neurons that cannot be placed). The connections of bias neurons are the same as those of ordinary neurons - with all neurons of the next level, except that there cannot be synapses between two bias neurons. Consequently, they can be placed on the input layer and all hidden layers, but not on the output layer, since they simply will not have anything to form a connection with.

    What is a displacement neuron used for?



    A displacement neuron is needed in order to be able to obtain an output result by shifting the graph of the activation function to the right or left. If this sounds confusing, let's look at a simple example where there is one input neuron and one output neuron. Then we can establish that the output of O2 will be equal to the input of H1, multiplied by its weight, and passed through the activation function (formula in the photo on the left). In our specific case, we will use sigmoid.

    From a school mathematics course, we know that if we take the function y = ax+b and change the values ​​of “a” in it, then the slope of the function will change (the colors of the lines on the graph on the left), and if we change “b”, then we will shift function to the right or left (the colors of the lines on the graph on the right). So “a” is the weight of H1, and “b” is the weight of the bias neuron B1. This is a rough example, but that's pretty much how it works (if you look at the activation function on the right in the image, you'll notice a very strong similarity between the formulas). That is, when during training we adjust the weights of the hidden and output neurons, we change the slope of the activation function. However, adjusting the weight of bias neurons can give us the opportunity to shift the activation function along the X axis and capture new regions. In other words, if the point responsible for your solution is located as shown in the graph on the left, then your neural network will never be able to solve the problem without using bias neurons. Therefore, you will rarely see neural networks without bias neurons.

    Also, displacement neurons help in the case when all input neurons receive 0 as input and no matter what weights they have, they will all pass 0 to the next layer, but not in the case of the presence of a displacement neuron. The presence or absence of bias neurons is a hyperparameter (more on that later). In short, you must decide for yourself whether you need to use bias neurons or not by running the NN with and without bias neurons and comparing the results.

    IMPORTANT be aware that sometimes displacement neurons are not indicated on the diagrams, but their weights are simply taken into account when calculating the input value, for example:

    Input = H1*w1+H2*w2+b3
    b3 = bias*w3

    Since its output is always equal to 1, we can simply imagine that we have an additional synapse with a weight and add this weight to the sum without mentioning the neuron itself.

    How to make the NS give correct answers?

    The answer is simple - you need to train her. However, no matter how simple the answer is, its implementation in terms of simplicity leaves much to be desired. There are several methods for teaching neural networks and I will highlight 3, in my opinion, the most interesting:
    • Backpropagation method
    • Resilient propagation or Rprop method
    • Genetic Algorithm
    Rprop and GA will be discussed in other articles, but now we will look at the basis of the basics - the backpropagation method, which uses the algorithm gradient descent.

    What is gradient descent?

    This is a way of finding the local minimum or maximum of a function by moving along a gradient. If you understand the concept of gradient descent, then you should not have any questions while using the backpropagation method. First, let's figure out what a gradient is and where it is present in our neural network. Let's build a graph where the x-axis will be the neuron weight values ​​(w) and the y-axis will be the error corresponding to this weight (e).


    Looking at this graph, we will understand that the graph of the function f(w) is the dependence of the error on the selected weight. In this graph, we are interested in the global minimum - the point (w2,e2) or, in other words, the place where the graph comes closest to the x-axis. This point will mean that by choosing the weight w2 we will get the smallest error - e2 and, as a consequence, the best result of all possible. The gradient descent method will help us find this point (the gradient is indicated in yellow on the graph). Accordingly, each weight in the neural network will have its own graph and gradient, and for each it is necessary to find a global minimum.

    So what is this gradient? A gradient is a vector that determines the steepness of a slope and indicates its direction relative to any point on a surface or graph. To find the gradient you need to take the derivative of the graph at a given point (as shown in the graph). Moving in the direction of this gradient, we will smoothly slide into the valley. Now imagine that the error is a skier, and the graph of the function is a mountain. Accordingly, if the error is 100%, then the skier is at the very top of the mountain, and if the error is 0%, then at the bottom. Like all skiers, the error strives to go down as quickly as possible and reduce its value. In the end, we should get the following result:


    Imagine that a skier is thrown, using a helicopter, onto a mountain. How high or low depends on the case (similar to how in a neural network, when initializing, the weights are placed in a random order). Let's say the error is 90% and this is our starting point. Now the skier needs to go down using a gradient. On the way down, at each point we will calculate the gradient, which will show us the direction of descent and, when the slope changes, correct it. If the slope is straight, then after the nth number of such actions we will reach the lowland. But in most cases the slope (function graph) will be wavy and our skier will face a very serious problem - a local minimum. I think everyone knows what a local and global minimum of a function is, to refresh your memory here is an example. Getting into a local minimum is fraught with the fact that our skier will forever remain in this lowland and never slide down the mountain, therefore we will never be able to get the correct answer. But we can avoid this by equipping our skier with a jetpack called a momentum. Here is a brief illustration of the moment:

    As you probably already guessed, this backpack will give the skier the necessary acceleration to overcome the hill that keeps us in the local minimum, but there is one BUT here. Let's imagine that we set a certain value for the moment parameter and were able to easily overcome all local minima and reach the global minimum. Since we cannot simply turn off the jetpack, we can skip the global minimum if there are still lows near it. In the final case, this is not so important, since sooner or later we will still return back to the global minimum, but it is worth remembering that the greater the moment, the greater the scope with which the skier will ski in the lowlands. Along with the moment, the backpropagation method also uses such a parameter as the learning rate. As many will probably think, the faster the learning speed, the faster we will train the neural network. No. The learning rate, as well as the torque, is a hyperparameter - a value that is selected through trial and error. The speed of learning can be directly related to the speed of the skier and we can say with confidence that the further you go, the quieter you will go. However, there are also certain aspects here, since if we don’t give the skier any speed at all, he won’t go anywhere at all, and if we give him a low speed, the travel time can stretch over a very, very long period of time. What then happens if we give too much speed?


    As you can see, nothing good. The skier will begin to slide down the wrong path and perhaps even in the other direction, which, as you understand, will only distance us from finding the correct answer. Therefore, in all these parameters you need to find golden mean to avoid non-convergence of the NS (more on this later).

    What is Method Backpropagation(MOR)?

    Now we have reached the point where we can discuss how to make sure that your NS can learn correctly and give the right decisions. The MPA is very well visualized in this GIF:


    Now let's look at each stage in detail. If you remember, in the previous article we calculated the output of the NS. In another way, this is called forward pass, that is, we sequentially transfer information from input neurons to output neurons. After which we calculate the error and, based on it, do a reverse transmission, which consists of sequentially changing the weights of the neural network, starting with the weights of the output neuron. The value of the scales will change in the direction that will give us the best result. In my calculations, I will use the method of finding the delta, since this is the simplest and most understandable method. I will also use a stochastic method for updating the weights (more on that later).

    Now let's continue from where we left off the calculations in the previous article.

    Task data from the previous article


    Data: I1=1, I2=0, w1=0.45, w2=0.78,w3=-0.12,w4=0.13,w5=1.5,w6=-2.3.

    H1input = 1*0.45+0*-0.12=0.45
    H1output = sigmoid(0.45)=0.61

    H2input = 1*0.78+0*0.13=0.78
    H2output = sigmoid(0.78)=0.69

    O1input = 0.61*1.5+0.69*-2.3=-0.672
    O1output = sigmoid(-0.672)=0.33

    O1ideal = 1 (0xor1=1)

    Error = ((1-0.33)^2)/1=0.45

    Result - 0.33, error - 45%.


    Since we have already calculated the result of the NN and its error, we can immediately proceed to the MOP. As I mentioned earlier, the algorithm always starts with an output neuron. In that case, let's calculate the value for it? (delta) according to formula 1.

    Since the output neuron does not have outgoing synapses, we will use the first formula (? output), therefore for hidden neurons we will already use the second formula (? hidden). Everything is quite simple here: we calculate the difference between the desired and the obtained result and multiply it by the derivative of the activation function from the input value of a given neuron. Before we begin the calculations, I want to draw your attention to the derivative. Firstly, as has probably already become clear, with MOR it is necessary to use only those activation functions that can be differentiated. Secondly, in order not to do unnecessary calculations, the derivative formula can be replaced with a more friendly and simple formula type:


    Thus, our calculations for point O1 will look like this.

    Solution

    O1output = 0.33
    O1ideal = 1
    Error = 0.45

    O1 = (1 - 0.33) * ((1 - 0.33) * 0.33) = 0.148


    This completes the calculations for neuron O1. Remember that after calculating the delta of a neuron, we must immediately update the weights of all outgoing synapses of this neuron. Since in the case of O1 there are none, we move on to the neurons of the hidden level and do the same thing, except that we now have the second formula for calculating the delta and its essence is to multiply the derivative of the activation function from the input value by the sum of the products of all outgoing weights and the delta of the neuron with which this synapse is connected. But why are the formulas different? The fact is that the whole point of the MOR is to spread the error of the output neurons to all the weights of the NN. The error can be calculated only at the output level, as we have already done, we also calculated the delta in which this error already exists. Consequently, now we will use a delta instead of an error, which will be transmitted from neuron to neuron. In this case, let's find the delta for H1:

    Solution

    H1output = 0.61
    w5 = 1.5
    ?O1 = 0.148

    H1 = ((1 - 0.61) * 0.61) * (1.5 * 0.148) = 0.053


    Now we need to find the gradient for each outgoing synapse. This is where they usually insert a 3-tier fraction with a bunch of derivatives and other mathematical hell, but that’s the beauty of using the delta counting method, because ultimately your formula for finding the gradient will look like this:

    Here point A is the point at the beginning of the synapse, and point B is at the end of the synapse. So we can calculate the gradient of w5 like this:

    Solution

    H1output = 0.61
    ?O1 = 0.148

    GRADw5 = 0.61 * 0.148 = 0.09


    Now we have all the necessary data to update the weight w5 and we will do this thanks to the MOP function, which calculates the amount by which this or that weight needs to be changed and it looks like this:


    I strongly recommend that you do not ignore the second part of the expression and use the moment, as this will allow you to avoid problems with a local minimum.

    Here we see 2 constants that we already talked about when we looked at the gradient descent algorithm: E (epsilon) - learning rate, ? (alpha) - moment. Translating the formula into words, we get: the change in the weight of the synapse is equal to the learning rate coefficient multiplied by the gradient of this weight, add the moment multiplied by the previous change in this weight (at the 1st iteration it is 0). In this case, let's calculate the change in weight w5 and update its value by adding ?w5 to it.

    Solution

    E = 0.7
    ? = 0.3
    w5 = 1.5
    GRADw5 = 0.09
    ?w5(i-1) = 0

    W5 = 0.7 * 0.09 + 0 * 0.3 = 0.063
    w5 = w5 + ?w5 = 1.563


    Thus, after applying the algorithm, our weight increased by 0.063. Now I suggest you do the same for H2.

    Solution

    H2output = 0.69
    w6 = -2.3
    ?O1 = 0.148
    E = 0.7
    ? = 0.3
    ?w6(i-1) = 0

    H2 = ((1 - 0.69) * 0.69) * (-2.3 * 0.148) = -0.07

    GRADw6 = 0.69 * 0.148 = 0.1

    W6 = 0.7 * 0.1 + 0 * 0.3 = 0.07

    W6 = w6 + ?w6 = -2.2


    And of course, don’t forget about I1 and I2, because they also have synapses whose weights we also need to update. However, remember that we do not need to find deltas for the input neurons since they do not have input synapses.

    Solution

    w1 = 0.45, ?w1(i-1) = 0
    w2 = 0.78, ?w2(i-1) = 0
    w3 = -0.12, ?w3(i-1) = 0
    w4 = 0.13, ?w4(i-1) = 0
    ?H1 = 0.053
    ?H2 = -0.07
    E = 0.7
    ? = 0.3

    GRADw1 = 1 * 0.053 = 0.053
    GRADw2 = 1 * -0.07 = -0.07
    GRADw3 = 0 * 0.053 = 0
    GRADw4 = 0 * -0.07 = 0

    W1 = 0.7 * 0.053 + 0 * 0.3 = 0.04
    ?w2 = 0.7 * -0.07 + 0 * 0.3 = -0.05
    ?w3 = 0.7 * 0 + 0 * 0.3 = 0
    ?w4 = 0.7 * 0 + 0 * 0.3 = 0

    W1 = w1 + ?w1 = 0.5
    w2 = w2 + ?w2 = 0.73
    w3 = w3 + ?w3 = -0.12
    w4 = w4 + ?w4 = 0.13


    Now let's make sure that we did everything correctly and again calculate the output of the neural network only with updated weights.

    Solution

    I1 = 1
    I2 = 0
    w1 = 0.5
    w2 = 0.73
    w3 = -0.12
    w4 = 0.13
    w5 = 1.563
    w6 = -2.2

    H1input = 1 * 0.5 + 0 * -0.12 = 0.5
    H1output = sigmoid(0.5) = 0.62

    H2input = 1 * 0.73 + 0 * 0.124 = 0.73
    H2output = sigmoid(0.73) = 0.675

    O1input = 0.62* 1.563 + 0.675 * -2.2 = -0.51
    O1output = sigmoid(-0.51) = 0.37

    O1ideal = 1 (0xor1=1)

    Error = ((1-0.37)^2)/1=0.39

    Result - 0.37, error - 39%.


    As we can see after one MOP iteration, we managed to reduce the error by 0.04 (6%). Now you need to repeat this over and over again until your error is small enough.

    What else do you need to know about the learning process?

    A neural network can be trained with or without a teacher (supervised, unsupervised learning).

    Tutored training- this is the type of training inherent in such problems as regression and classification (we used it in the example given above). In other words, here you act as a teacher and the NS as a student. You provide input data and the desired result, that is, the student, looking at the input data, will understand that he needs to strive for the result that you provided him.

    Unsupervised learning- this type of training does not occur very often. There is no teacher here, so the network does not get the desired result or the number of them is very small. Basically, this type of training is inherent in neural networks whose task is to group data according to certain parameters. Let's say you submit 10,000 articles on Habré and after analyzing all these articles, the NS will be able to distribute them into categories based, for example, on frequently occurring words. Articles that mention programming languages, to programming, and where there are words like Photoshop, to design.

    There is also such an interesting method as reinforcement learning(reinforcement learning). This method deserves a separate article, but I will try to briefly describe its essence. This method is applicable when we can, based on the results received from the NS, give it an assessment. For example, we want to teach the NS to play PAC-MAN, then every time the NS scores a lot of points we will encourage her. In other words, we give the NS the right to find any way to achieve the goal, as long as it gives good result. In this way, the network will begin to understand what they want to achieve from it and tries to find the best way achieving this goal without the constant provision of data by the “teacher”.

    Training can also be done using three methods: stochastic method, batch method and mini-batch method. There are so many articles and studies on which method is best and no one can come to a general answer. I am a supporter of the stochastic method, but I do not deny the fact that each method has its pros and cons.

    Briefly about each method:

    Stochastic(it is also sometimes called online) the method works according to the following principle - found?w, immediately update the corresponding weight.

    Batch method it works differently. We sum up the ?w of all the weights at the current iteration and only then update all the weights using this sum. One of the most important advantages of this approach is the significant savings in calculation time, but accuracy in this case can be greatly affected.

    Mini-batch method is a golden mean and tries to combine the advantages of both methods. The principle here is this: we freely distribute weights into groups and change their weights by the sum?w of all weights in a particular group.

    What are hyperparameters?

    Hyperparameters are values ​​that must be selected manually and often through trial and error. Among these values ​​are:
    • Moment and speed of learning
    • Number of hidden layers
    • Number of neurons in each layer
    • Presence or absence of displacement neurons
    Other types of neural networks contain additional hyperparameters, but we will not talk about them. Choosing the right hyperparameters is very important and will directly affect the convergence of your NN. It is quite simple to understand whether it is worth using displacement neurons or not. The number of hidden layers and neurons in them can be calculated by brute force based on one simple rule- the more neurons, the more accurate the result and the exponentially more time, which you will spend on her training. However, it is worth remembering that you should not make a neural network with 1000 neurons to solve simple problems. But with the choice of the moment and speed of learning, everything is a little more complicated. These hyperparameters will vary depending on the task at hand and the architecture of the neural network. For example, for solving XOR, the learning rate can be in the range of 0.3 - 0.7, but in a neural network that analyzes and predicts stock prices, a learning rate above 0.00001 leads to poor convergence of the neural network. You shouldn’t focus your attention on hyperparameters now and try to thoroughly understand how to choose them. This will come with experience, but for now I advise you to simply experiment and look for examples of solving a particular problem on the Internet.

    What is convergence?



    Convergence indicates whether the NN architecture is correct and whether the hyperparameters were selected correctly in accordance with the task. Let's say our program displays the NS error at each iteration in the log. If the error decreases with each iteration, then we are at on the right track and our NS converges. If the error jumps up and down or freezes at a certain level, then the NN does not converge. In 99% of cases, this is solved by changing the hyperparameters. The remaining 1% will mean that you have an error in the architecture of the neural network. It also happens that the convergence is affected by retraining the neural network.

    What is retraining?

    Overfitting, as the name suggests, is the state of a neural network when it is oversaturated with data. This problem occurs if you train the network on the same data for too long. In other words, the network will begin not to learn from data, but to remember and “cram” it. Accordingly, when you submit new data to the input of this neural network, noise may appear in the received data, which will affect the accuracy of the result. For example, if we show the NS different photographs of apples (only red ones) and say that this is an apple. Then, when the NS sees yellow or green apple, it will not be able to determine that it is an apple, since she remembered that all apples must be red. Conversely, when the NN sees something red and shaped like an apple, such as a peach, she will say that it is an apple. This is noise. On the graph, the noise will look like this.


    It can be seen that the graph of the function fluctuates greatly from point to point, which are the output data (result) of our NN. Ideally, this graph should be less wavy and straight. To avoid overtraining, you should not train the neural network for a long time on the same or very similar data. Also, overfitting can be caused a large number parameters that you supply to the input of the NS or an overly complex architecture. Thus, when you notice errors (noise) in the output after the training phase, you should use one of the regularization methods, but in most cases this will not be necessary.

    Conclusion

    I hope this article was able to clarify the key points of such a difficult subject as Neural Networks. However, I believe that no matter how many articles you read, it is impossible to master such a complex topic without practice. Therefore, if you are just at the beginning of your journey and want to explore this promising and developing industry, then I advise you to start practicing by writing your own neural network, and only then resort to the help of various frameworks and libraries. Also, if you are interested in my method of presenting information and you want me to write articles on other topics related to Machine Learning, then vote in the poll below for the topic that interests you. See you in future articles :)

    Today, on every corner here and there people are shouting about the benefits of neural networks. But only a few really understand what it is. If you turn to Wikipedia for explanations, your head will spin from the height of the citadels of scientific terms and definitions built there. If you are far from genetic engineering, and the confusing, dry language of university textbooks only causes confusion and no ideas, then let’s try to figure out the problem of neural networks together.

    To understand the problem, you need to find out the root cause, which lies completely on the surface. Remembering Sarah Connor, with a shudder of the heart we understand that once the pioneers of computer development Warren McCulloch and Walter Pitts pursued the selfish goal of creating the first Artificial Intelligence.

    Neural networks are an electronic prototype of a self-learning system. Like a child, a neural network absorbs information, chews it, gains experience and learns. During the learning process, such a network develops, grows and can draw its own conclusions and make decisions independently.

    If the human brain consists of neurons, then we can conditionally agree that an electronic neuron is some kind of imaginary box that has many input holes and one output hole. The internal algorithm of the neuron determines the order of processing and analysis of the received information and converting it into a single useful body of knowledge. Depending on how well the inputs and outputs work, the entire system either thinks quickly or, conversely, can slow down.

    Important: Typically, neural networks use analog information.

    Let us repeat that there can be many input streams of information (scientifically, this connection of initial information and our “neuron” is called synapses), and they are all of a different nature and have unequal significance. For example, a person perceives the world around him through the organs of vision, touch and smell. It is logical that vision is more important than smell. Based on different life situations, we use certain senses: in complete darkness, touch and hearing come to the fore. By the same analogy, synapses in neural networks will have different significance in different situations, which is usually denoted by the weight of the connection. When writing code, a minimum threshold for the passage of information is set. If the connection weight is higher set value, then the result of testing by the neuron is positive (and equal to one in the binary system), if less, then negative. It is logical that the higher the bar is set, the more accurate the work of the neural network will be, but the longer it will take.

    For a neural network to work correctly, you need to spend time training it - this is the main difference from simple programmable algorithms. Like a small child, a neural network needs an initial information base, but if the initial code is written correctly, then the neural network itself will be able to not only make the right choice from the available information, but also make independent assumptions.

    When writing primary code, you need to explain your actions literally on your fingers. If we work, for example, with images, then at the first stage its size and class will matter to us. If the first characteristic tells us the number of inputs, then the second will help the neural network itself sort out the information. Ideally, having loaded the primary data and compared the class topology, the neural network itself will then be able to classify new information. Let's say we decide to upload a 3x5 pixel image. Simple Arithmetic will tell us that there will be inputs: 3*5=15. And the classification itself will determine the total number of outputs, i.e. neurons. Another example: a neural network needs to recognize the letter “C”. The specified threshold is a complete match to the letter; this will require one neuron with the number of inputs equal to the size of the image.

    Let's assume that the size will be the same 3x5 pixels. Feeding the program various pictures letters or numbers, we will teach her to determine the image of the symbol we need.

    As in any training, the student must be punished for the wrong answer, but we will not give anything for the correct answer. If the program perceives the correct answer as False, then we increase the weight of the input at each synapse. If, on the contrary, if the result is incorrect, the program considers the result positive or True, then we subtract the weight from each input to the neuron. It’s more logical to start learning by getting to know the symbol we need. The first result will be incorrect, but by slightly adjusting the code, when further work the program will work correctly. The given example of an algorithm for constructing code for a neural network is called parceltron.


    There are also more complex options for the operation of neural networks with the return of incorrect data, their analysis and logical conclusions of the network itself. For example, an online future predictor is quite a programmed neural network. Such programs are capable of learning both with and without a teacher, and are called adaptive resonance. Their essence lies in the fact that neurons already have their own expectations about what kind of information they want to receive and in what form. Between expectation and reality there is a thin threshold of the so-called vigilance of neurons, which helps the network to correctly classify incoming information and not miss a single pixel. The trick of the AR neural network is that it learns independently from the very beginning and independently determines the threshold of neuron vigilance. Which, in turn, plays a role in the classification of information: the more vigilant the network, the more meticulous it is.

    We have received the very basics of knowledge about what neural networks are. Now let's try to summarize the information received. So, neural networks is an electronic prototype of human thinking. They consist of electronic neurons and synapses - flows of information at the input and output of the neuron. Neural networks are programmed according to the principle of learning with a teacher (a programmer who uploads primary information) or independently (based on assumptions and expectations from the information received, which is determined by the same programmer). Using a neural network, you can create any system: from simple definition drawing on pixel images to psychodiagnostics and economic analytics.

    The issues of artificial intelligence and neural networks are currently becoming more popular than ever before. Many users are increasingly turning to us with questions about how neural networks work, what they are and what is the principle of their operation?

    These questions, along with their popularity, also have considerable complexity, since the processes are complex machine learning algorithms designed for various purposes, from analyzing changes to modeling the risks associated with certain actions.

    What are neural networks and their types?

    The first question that arises for those interested is, what is a neural network? In the classical definition, this is a certain sequence of neurons that are interconnected by synapses. Neural networks are a simplified model of biological analogues.

    A program with a neural network structure allows the machine to analyze input data and remember the result obtained from certain sources. Subsequently, such an approach makes it possible to retrieve from memory the result corresponding to the current data set, if it was already available in the experience of network cycles.

    Many people perceive a neural network as an analogue of the human brain. On the one hand, this judgment can be considered close to the truth, but, on the other hand, the human brain is too complex a mechanism for it to be possible to recreate it with the help of a machine even by a fraction of a percent. A neural network is, first of all, a program based on the principle of the brain, but in no way its analogue.

    A neural network is a bunch of neurons, each of which receives information, processes it and transmits it to another neuron. Each neuron processes the signal in exactly the same way.

    How then do you get different results? It's all about the synapses that connect neurons to each other. One neuron can have a huge number of synapses that strengthen or weaken the signal, and they have the ability to change their characteristics over time.

    It is the correctly selected parameters of the synapses that make it possible to obtain the correct result of transforming the input data at the output.

    Having defined in general terms what a neural network is, we can identify the main types of their classification. Before proceeding with the classification, it is necessary to introduce one clarification. Each network has a first layer of neurons, called the input layer.

    It does not perform any calculations or transformations, its task is only one thing: to receive and distribute to other neurons input signals. This is the only layer that is common to all types of neural networks; their further structure is the criterion for the main division.

    • Single layer neural network. This is a structure for the interaction of neurons, in which, after the input data enters the first input layer, the final result is immediately transferred to the output layer. In this case, the first input layer is not considered, since it does not perform any actions other than reception and distribution, this has already been mentioned above. And the second layer performs all the necessary calculations and processing and immediately produces the final result. Input neurons are combined with the main layer by synapses that have different weighting coefficients, ensuring the quality of connections.
    • Multilayer neural network. As is clear from the definition, this type of neural network, in addition to the input and output layers, also has intermediate layers. Their number depends on the complexity of the network itself. She's in to a greater extent resembles the structure of a biological neural network. These types of networks were developed quite recently; before that, all processes were implemented using single-layer networks. Respectively similar solution has much more capabilities than its ancestor. In the information processing process, each intermediate layer represents an intermediate stage of information processing and distribution.

    Depending on the direction of information distribution across synapses from one neuron to another, networks can also be classified into two categories.

    • Direct propagation networks or unidirectional, that is, a structure in which the signal moves strictly from the input layer to the output layer. Signal movement in the opposite direction is impossible. Similar developments are quite widespread in present moment successfully solve problems such as recognition, predictions or clustering.
    • Networks with feedback or recurrent. Such networks allow the signal to travel not only in the forward direction, but also in the reverse direction. What does this give? In such networks, the result of the output can be returned to the input based on this, the output of the neuron is determined by the weights and input signals, and is complemented by the previous outputs, which are again returned to the input. Such networks are characterized by the function of short-term memory, on the basis of which signals are restored and supplemented during processing.

    These are not the only options for classifying networks.

    They can be divided into homogeneous and hybrid based on the types of neurons that make up the network. And also heteroassociative or autoassociative, depending on the network training method, with or without a teacher. You can also classify networks according to their purpose.

    Where are neural networks used?

    Neural networks are used to solve a variety of problems. If we consider tasks by degree of complexity, then the usual one is suitable for solving the simplest problems. computer program, more
    complex problems that require simple forecasting or approximate solution of equations, programs using statistical methods are used.

    But tasks of an even more complex level require a completely different approach. This applies in particular to pattern recognition, speech recognition or complex prediction. In a person’s head, such processes occur unconsciously, that is, while recognizing and remembering images, a person is not aware of how this process occurs, and accordingly cannot control it.

    It is precisely these problems that neural networks help solve, that is, they are created to carry out processes whose algorithms are unknown.

    Thus, neural networks are widely used in the following areas:

    • recognition, and this direction is currently the broadest;
    • predicting the next step, this feature is applicable in trading and stock markets;
    • classification of input data by parameters; this function is performed by credit robots, which are able to make a decision in approving a loan to a person, relying on an input set of different parameters.

    The capabilities of neural networks make them very popular. They can be taught many things, such as playing games, recognizing a certain voice, and so on. Based on the fact that artificial networks are built on the principle of biological networks, they can be taught all the processes that a person performs unconsciously.

    What is a neuron and a synapse?

    So what is a neuron in the context of artificial neural networks? This concept refers to a unit that performs calculations. It receives information from the input layer of the network, performs simple calculations with it, and feeds it to the next neuron.

    The network contains three types of neurons: input, hidden and output. Moreover, if the network is single-layer, then it does not contain hidden neurons. In addition, there are a variety of units called displacement neurons and context neurons.

    Each neuron has two types of data: input and output. In this case, the input data of the first layer is equal to the output data. In other cases, the neuron input receives the total information of the previous layers, then it goes through a normalization process, that is, all values ​​falling outside the desired range are transformed by the activation function.

    As mentioned above, a synapse is a connection between neurons, each of which has its own degree of weight. It is thanks to this feature that the input information changes during the transmission process. During processing, the information transmitted by the synapse with a large weight will be dominant.

    It turns out that the result is influenced not by neurons, but by synapses that give a certain set of weights to the input data, since the neurons themselves perform exactly the same calculations every time.

    In this case, the weights are set in random order.

    Scheme of operation of a neural network

    To imagine the principle of operation of a neural network, no special skills are required. The input layer of neurons receives certain information. It is transmitted through synapses to the next layer, with each synapse having its own weight coefficient, and each next neuron can have several incoming synapses.

    As a result, the information received by the next neuron is the sum of all data, each multiplied by its own weight coefficient. The resulting value is substituted into the activation function and output information is obtained, which is transmitted further until it reaches final exit. The first launch of the network does not give the correct results, since the network has not yet been trained.

    The activation function is used to normalize the input data. There are many such functions, but there are several main ones that are most widely used. Their main difference is the range of values ​​in which they operate.

    • The linear function f(x) = x, the simplest of all possible, is used only for testing the created neural network or transmitting data in its original form.
    • Sigmoid is considered the most common activation function and has the form f(x) = 1 / 1+e-×; Moreover, the range of its values ​​is from 0 to 1. It is also called the logistic function.
    • To cover negative values, a hyperbolic tangent is used. F(x) = e²× - 1 / e²× + 1 - this is the form of this function and the range it has is from -1 to 1. If the neural network does not provide for the use of negative values, then it should not be used.

    In order to give the network the data with which it will operate, training sets are needed.

    Integration is a meter that increases with each training set.

    The epoch is an indicator of the training of a neural network; this indicator increases each time the network goes through a cycle of a full set of training sets.

    Accordingly, in order to train the network correctly, you need to perform sets, consistently increasing the epoch indicator.

    Errors will be identified during training. This is the percentage difference between the obtained and the desired result. This indicator should decrease as the epoch indicator increases, otherwise there is a developer error somewhere.

    What is a bias neuron and what is it for?

    In neural networks there is another type of neuron - a displacement neuron. It differs from the main type of neurons in that its input and output are equal to one in any case. Moreover, such neurons do not have input synapses.

    The arrangement of such neurons occurs one per layer and no more, and they cannot synapse with each other. It is not advisable to place such neurons on the output layer.

    What are they for? There are situations in which the neural network simply cannot find the right solution due to the fact that desired point will be out of reach. This is precisely why such neurons are needed to be able to shift the definition area.

    That is, the weight of the synapse changes the bend of the function graph, while the displacement neuron allows for a shift along the X coordinate axis, so that the neural network can capture an area inaccessible to it without a shift. In this case, the shift can be carried out both to the right and to the left. Shift neurons are usually not marked schematically; their weight is taken into account by default when calculating the input value.

    Also, bias neurons will allow you to get a result in the case when all other neurons produce 0 as an output parameter. In this case, regardless of the weight of the synapse, exactly this value will be transmitted to each subsequent layer.

    The presence of a displacement neuron will allow you to correct the situation and get a different result. The feasibility of using displacement neurons is determined by testing the network with and without them and comparing the results.

    But it is important to remember that to achieve results it is not enough to create a neural network. It also needs to be trained, which also requires special approaches and has its own algorithms. This process can hardly be called simple, since its implementation requires certain knowledge and effort.