Recurrent neural networks. A recurrent neural network solves the problem of maintaining equilibrium

N+1, together with MIPT, continues to introduce the reader to the most striking aspects of modern research in the field of artificial intelligence. In we wrote about general principles machine learning and specifically about the backpropagation method for training neural networks. Today our interlocutor is Valentin Malykh, junior researcher at the Laboratory of Neural Systems and Deep Learning. Together with him, we will talk about an unusual class of these systems - recurrent neural networks, their features and prospects, both in the field of all kinds of entertainment in the DeepDream style, and in “useful” areas. Let's go.

What are recurrent neural networks (RNNs) and how do they differ from regular ones?

Let's first remember what “ordinary” neural networks are, and then it will immediately become clear how they differ from recurrent ones. Let's imagine the simplest neural network - a perceptron. It consists of one layer of neurons, each of which receives a piece of input data (one or more bits, real numbers, pixels, etc.), modifies it taking into account its own weight and passes it on. In a single-layer perceptron, the output of all neurons is combined in one way or another, and the neural network gives the answer, but the capabilities of such an architecture are very limited. If you want to get more advanced functionality, you can go in several ways, for example, increase the number of layers and add a convolution operation, which would “layer” the incoming data into pieces of different scales. In this case, you will get deep learning convolutional neural networks that excel at image processing and cat recognition. However, both a primitive perceptron and a convolutional neural network have a common limitation: both input and output data have a fixed, pre-designated size, for example, a picture of 100 × 100 pixels or a sequence of 256 bits. From a mathematical point of view, a neural network behaves like an ordinary function, albeit a very complex one: it has a predetermined number of arguments, as well as a designated format in which it produces an answer. A simple example is the x 2 function, it takes one argument and produces one value.

The above features do not present any great difficulties if we are talking about the same pictures or predetermined sequences of symbols. But what if you want to use a neural network to process text or music? In the general case, any conditionally infinite sequence in which not only the content is important, but also the order in which the information follows. It is for these tasks that recurrent neural networks were invented. Their opposites, which we called “ordinary,” have a more strict name - feed-forward neural networks, since in them information is transmitted only forward through the network, from layer to layer. In recurrent neural networks, neurons exchange information with each other: for example, in addition to a new piece of incoming data, the neuron also receives some information about the previous state of the network. In this way, “memory” is implemented in the network, which fundamentally changes the nature of its work and allows you to analyze any sequence of data in which it is important in what order the values come - from sound recordings to stock quotes.

Scheme of a single-layer recurrent neural network: at each cycle of operation, the internal layer of neurons receives a set of input data X and information about the previous state of the internal layer A, on the basis of which it generates a response h.

The presence of memory in recurrent neural networks allows us to somewhat expand our analogy with x 2. If we called feedforward neural networks a “simple” function, then recurrent neural networks can almost with a clear conscience be called a program. In fact, the memory of recurrent neural networks (although not full-fledged, but more on that later) makes them Turing-complete: if the weights are correctly specified, the neural network can successfully emulate the work computer programs.

Let's delve a little deeper into history: when were RNNs invented, for what tasks and what, as it seemed then, should have been their advantage over a conventional perceptron?

Probably the first RNN was the Hopfield network (first mentioned in 1974, finalized in 1982), which implemented an associative memory cell in practice. It differs from modern RNNs in that it works with sequences of a fixed size. In the simplest case, the Hopfield network has one layer of internal neurons connected to each other, and each connection is characterized by a certain weight that determines its significance. Associated with such a network is a certain equivalent of physical “energy”, which depends on all the weights in the system. The network can be trained using gradient descent in energy, when the minimum corresponds to the state in which the network “remembered” a certain pattern, for example 10101 . Now, if a distorted, noisy or incomplete template is given to her input, say, 10000 , it will “remember” and restore it in the same way as it works associative memory in humans. This analogy is quite distant, so it should not be taken too seriously. Nevertheless, Hopfield networks successfully coped with their task and bypassed the capabilities of the then existing perceptrons. Interestingly, the original publication by John Hopfield in Proceedings of the National Academy of Sciences published in the “Biophysics” section.

The problem of long-term memory in simple RNNs: the more cycles have passed since the receipt of this or that information, the more likely it is that the significance of this data will not play a big role in the new work cycle.

Christopher Olah / colah.github.io

The next step in the evolution of RNNs was Jeff Elman's "simple recurrent network", described in 1990. In it, the author raised in detail the question of how it is possible (and whether it is even possible) to train a neural network to recognize time sequences. For example, if there is incoming data 1100 And 0110 , can they be considered the same set, shifted in time? Of course, it’s possible, but how can you train a neural network to do this? An ordinary perceptron will easily remember this pattern for any examples that are offered to it, but each time it will be a task of comparing two different signals, and not a task of evolution or shift of the same signal. Elman's solution, based on previous work in this area, was based on the fact that another “contextual” layer was added to a simple neural network, into which the state of the inner layer of neurons was simply copied at each cycle of the network. In this case, the connection between the contextual and internal layers could be trained. This architecture made it possible to reproduce time series relatively easily, as well as to process sequences of arbitrary length, which sharply distinguished Elman's simple RNN from previous concepts. Moreover, this network was able to recognize and even classify nouns and verbs in a sentence based only on word order, which was a real breakthrough for its time and aroused great interest among both linguists and consciousness researchers.

Elman's simple RNN was followed by more and more developments, and in 1997 Hochreiter and Schmidhuber published the article “Long Short-term memory” (“long-term short-term memory”, there are also many other translation variations), which laid the foundation for most modern RNNs. In their work, the authors described a modification that solved the problem of long-term memory of simple RNNs: their neurons “remember” recently received information well, but are not able to retain for a long time in memory something that was processed many cycles ago, no matter how important that information was. In LSTM networks, internal neurons are “equipped” complex system the so-called gates, as well as the concept of the cell state, which represents a kind of long-term memory. The gate determines what information will enter the cellular state, what will be erased from it, and what will affect the result that the RNS will produce at this step. We will not analyze LSTM in detail, however, we note that it is these variations of RNN that are widely used now, for example, for Google machine translation.

The operating principle of an LSTM type RNN: neurons in the inner layers can read and change the cell state, which combines the functions of short-term and long-term memory.

Christopher Olah / colah.github.io

Everything sounds great in words, but what can RNS do? So they were given a text to read or music to listen to - and then what?

One of the main areas of application of RNNs today is working with language models, in particular, analyzing the context and general connection of words in the text. For RNN, the structure of language is long-term information that must be remembered. This includes grammar, as well as the stylistic features of the body of texts on which training is carried out. In fact, the RNN remembers the order in which words usually appear, and can complete a sentence after receiving some seed. If this seed is random, the result may be a completely meaningless text, stylistically reminiscent of the template on which the RNS learned. If the source text was meaningful, RNS will help to stylize it, but in the latter case, RNS alone will not be enough, since the result should be a “mixture” of random, but stylized text from RNS and meaningful, but “uncolored” source part. This task is already so reminiscent of the Monet and Van Gogh styles that are now popular for photo processing that an analogy involuntarily suggests itself.

Indeed, the task of transferring style from one image to another is solved using neural networks and the convolution operation, which splits the image into several scales and allows neural networks to analyze them independently of each other, and subsequently mix them together. Similar operations with music (also using convolutional neural networks): in this case, the melody is the content, and the arrangement is the style. And RNS successfully copes with composing music. Since both tasks - writing and mixing a melody with an arbitrary style - have already been successfully solved using neural networks, combining these solutions remains a matter of technology.

Finally, let's figure out why RNS music is written at the very least, but problems arise with full-fledged texts by Tolstoy and Dostoevsky? The fact is that in instrumental music, no matter how barbaric it may sound, there is no sense in the same sense as it is in most texts. That is, you can like or dislike music, but if there are no words in it, it does not carry an information load (of course, if it is not a secret code). It is with giving meaning to their works that RNS have problems: they can perfectly learn the grammar of a language and remember how they should look the text is in a certain style, but RNS cannot (yet) create and convey any idea or information.

Scheme of a three-dimensional recurrent neural network for writing musical fragments: in contrast to the simplest architecture, this system actually combines two RNNs, separately describing the sequence in time and the combination of notes at each moment.

Daniel Johnson / hexahedria.com

A special case in this matter is the automatic writing of program code. Indeed, since a programming language by definition is language,RNS can learn it. In practice, it turns out that programs written by RNS compile and run quite successfully, but they do not do anything useful unless they are given a task in advance. And the reason for this is the same as in the case of literary texts: for RNS, a programming language is nothing more than a stylization, into which, unfortunately, they cannot put any meaning.

“Generating nonsense” is funny, but pointless, but for what real tasks are RNNs used?

Of course, RNS, in addition to entertainment, should also pursue more pragmatic goals. From their design it automatically follows that their main areas of application must be demanding on context and/or time dependence in the data, which are essentially the same thing. Therefore, RNNs are used, for example, for image analysis. It would seem that this area is usually perceived in the context of convolutional neural networks, however, there are challenges for RNNs here: their architecture allows you to quickly recognize details based on the context and environment. RNNs work similarly in the areas of text analysis and generation. Among the more unusual problems, we can recall attempts to use early RNS to classify the carbon nuclear magnetic resonance spectra of various benzene derivatives, and among modern ones, the analysis of the appearance of negative reviews about products.

What are the successes of RNN in machine translation? Are they exactly what Google Translate uses?

Currently, Google uses LSTM-type RNNs for machine translation, which has made it possible to achieve the greatest accuracy compared to existing analogues, however, according to the authors themselves, machine translation is still very far from human level. The difficulties that neural networks face in translation tasks are due to several factors: firstly, in any task there is an inevitable trade-off between quality and speed. At the moment, humans are very much ahead of artificial intelligence in this indicator. Since machine translation is most often used in online services, developers are forced to sacrifice accuracy for the sake of speed. In a recent Google publication on this topic, the developers detail many of the solutions that have optimized the current Google version Translate, but the problem still remains. For example, rare words, or slang, or deliberate distortion of a word (for example, for a brighter title) can confuse even a human translator, who will have to spend time to find the most adequate analogue in another language. Such a situation will bring the machine to a complete standstill, and the translator will be forced to “throw it away” compound word and leave it without translation. As a result, the problem of machine translation is not so determined by the architecture (RNNs successfully cope with routine tasks in this area), how complex and diverse the language is. The good news is that this problem is more technical in nature than writing meaningful texts, which probably requires a radically new approach.

Operating principle of the machine Google translator Translate, based on a combination of several recurrent neural networks.

research.googleblog.com/Google

Are there more unusual ways to use RNS? Here's a neural Turing machine, for example, what's the idea here?

The Neural Turing Machine, proposed two years ago by a team from Google DeepMind, differs from other RNNs in that the latter do not actually store information explicitly - it is encoded in the weights of neurons and connections, even in advanced variations like LSTM . In the neural Turing machine, the developers adhered to the more understandable idea of a “memory tape”, as in the classical Turing machine: in it, information is explicitly written “on tape” and can be read if necessary. At the same time, tracking what information is needed falls on a special neural network controller. In general, it can be noted that the idea of NMT is truly fascinating in its simplicity and ease of understanding. On the other hand, due to the technical limitations of modern hardware, it is not possible to apply NMT in practice, because training such a network becomes extremely long. In this sense, RNNs are an intermediate link between simpler neural networks and NMTs, since they store a certain “snapshot” of information, which does not fatally limit their performance.

What is the concept of attention in relation to RNS? What new things does it allow you to do?

The concept of attention is a way to “tell” the network where to spend more attention when processing data. In other words, attention in a recurrent neural network is a way to increase the importance of some data over others. Since a person cannot give hints every time (this would negate all the benefits of an RNN), the network must learn to give hints to itself. In general, the concept of attention is a very powerful tool in working with RNNs, as it allows you to quickly and efficiently tell the network which data is worth paying attention to and which not. Also, this approach may in the future solve the problem of performance in systems with large amounts of memory. To better understand how this works, we need to consider two models of attention: “soft” and “hard” (hard). In the first case, the network will still access all the data it has access to, but the significance (i.e. weight) of this data will be different. This makes the RNN more accurate, but not faster. In the second case, of all the existing data, the network will access only some (the rest will have zero weights), which solves two problems at once. The disadvantage of the “hard” concept of attention is the fact that this model ceases to be continuous, and therefore differentiable, which greatly complicates the task of training it. However, there are solutions to correct this shortcoming. Since the concept of attention has been actively developing in the last couple of years, we can only expect news from this field in the near future.

Finally, an example of a system that uses the concept of attention is Dynamic Memory Networks, a variation proposed by Facebook's research division. In it, the developers describe an “episodic memory module” that, based on the memory of events given as input data, as well as the question about these events, creates “episodes” that ultimately help the network find the correct answer to the question . This architecture was tested on bAbI, a large database of generated tasks for simple logical inference (for example, a chain of three facts is given, you need to give the correct answer: “Mary is at home. She went out into the yard. Where is Mary? In the yard.”), and showed the results , superior to classical architectures like LSTM.

What else is happening in the world of recurrent neural networks right now?

According to Andrej Karpathy, a neural network expert and author of the excellent blog, “the concept of attention is the most interesting recent architectural design in the world of neural networks.” However, research in the field of RNN is not the only focus. If we try to briefly formulate the main trend, it is now a combination of different architectures and the use of developments from other areas to improve RNN. Examples include the already mentioned neural networks from Google, which use methods taken from works on reinforcement learning, neural Turing machines, optimization algorithms like Batch Normalization and much more - all of this together deserves a separate article. In general, we note that although RNNs have not attracted as much attention as public favorites - convolutional neural networks, this can only be explained by the fact that the objects and tasks with which RNNs work are not as striking as DeepDream or Prisma. It's like in social networks- if a post is published without a picture, there will be less excitement around it.

So always post with a picture.

Taras Molotilin

To one of complex species artificial neural networks (ANN) are recurrent ones in which there are feedback connections. In the first recurrent ANNs, the main idea was to learn its output signal at the previous step. Recurrent networks implement nonlinear models that can be used for optimal control of time-varying processes, that is, feedback allows for adaptive memorization of past time events. Generalization of recurrent ANNs will make it possible to create a more flexible tool for constructing nonlinear models. Let's consider some architectures of recurrent ANNs.

The Jordan network is based on a multilayer perceptron. Feedback is implemented through the supply of not only initial data to the input layer, but also network output signals with a delay of one or several clock cycles, which makes it possible to take into account the history of the observed processes and accumulate information for developing the correct control strategy.

The Elman network, just like the Jordan network, is obtained from a multilayer perceptron by introducing feedback connections. Only the signals to the input layer come not from the network outputs, but from the outputs of the hidden layer neurons. An example of the Elman network architecture is shown in Fig. 1. Hidden layer outputs { c 1 , c 2 ,…, c k} supplied with a time delay to input neurons with weighting coefficients { w ij} -1 , Where i (i = 1,2,…, n) , j j = 1,2…, k).

Rice. 1. Example of Elman network architecture

To generalize recurrent ANNs, the article proposes to add a delay of the hidden layer feedback signals by several clock cycles. To do this, add dynamic stack memory to the layer. An example of the architecture of such an ANN is shown in Fig. 2.

Rice. 2. An example of the architecture of a recurrent ANN with dynamic stack memory several previous hidden layer outputs

Hidden layer outputs { c 1 , c 2 ,…, c k} fed to input neurons with weighting coefficients { w ij} - t , Where i– index of the neuron to which the signal is sent (i = 1,2,…, n) , j– index of the output signal of the hidden layer neuron ( j = 1,2…, k) , t– time delay index (t =1,2… m). We will change the number of time delays from 1 to m. Thus, the Elman network is obtained with m=1, and the multilayer perceptron – at m=0.

A detailed examination of the architecture of a recurrent network shows that feedback from the hidden layer or from the network output can be eliminated by adding feedback signals to the training set.

Let's consider the process of transforming the training sample to solve the problem of time series forecasting using a recurrent ANN with dynamic stack memory. As an example, we will use the average monthly values of solar radiation flux density at a wavelength of 10.7 for 2010-2012 (Table 1).

Table 1. Data on solar radiation flux density at a wavelength of 10.7 cm for 2010-2012

Example No.	Date	Radiation flux density 10 -22 [W/m 2 ]
1	January 2010	834,84
2	February 2010	847,86
3	March 2010	833,55
4	April 2010	759,67
5	May 2010	738,71
6	June 2010	725,67
7	July 2010	799,03
8	August 2010	797,10
9	September 2010	811,67
10	October 2010	816,77
11	November 2010	824,67
12	December 2010	843,23
13	January 2011	837,42
14	February 2011	945,71
15	March 2011	1153,87
16	April 2011	1130,67
17	May 2011	959,68
18	June 2011	959,33
19	July 2011	942,58
20	August 2011	1017,74
21	September 2011	1345,00
22	October 2011	1372,90
23	November 2011	1531,67
24	December 2011	1413,55
25	January 2012	1330,00
26	February 2012	1067,93
27	March 2012	1151,29
28	April 2012	1131,67
29	May 2012	1215,48
30	June 2012	1204,00

We transform the time series using the sliding window method, as shown in Table 2.

Table 2. ANN training sample for solving the forecasting problem, obtained as a result of transforming a time series using the windowing method

Example No.	ANN inputs ( x)			ANN outputs ( y)
Example No.	x 1	x 2	x 3	y 1
1	834,84	847,86	833,55	759,67
2	847,86	833,55	759,67	738,71
3	833,55	759,67	738,71	725,67
…

Let the hidden layer in a recurrent ANN contain three neurons, the output layer contain one neuron, and the dynamic memory stack contain the return signals of the hidden layer with a delay of two clock cycles (Fig. 3).

Rice. 3. Recurrent ANN with memory of two previous output signals of the hidden layer

Since the number of hidden layer neurons that have feedback to the input layer is three, the size of the input vector during ANN training when storing the previous output signal one step back will increase by three, and when storing the two previous output signals - by six. Let us denote the input signals of the training sample that do not change during transformation as ( x 1 , x 2 , x 3 ), and feedback signals – ( x 4 , x 5 , x 6 , x 7 , x 8 , x 9 ). Table 3 shows the transformed training set.

Table 3. Adding output signals of the hidden layer to the training set of the recurrent ANN

No.	ANN inputs ( x)									ANN outputs ( y)
No.	x 1	x 2	x 3	x 4	x 5	x 6	x 7	x 8	x 9	y 1
1	834,84	847,86	833,55	0	0	0	0	0	0	759,67
2	847,86	833,55	759,67	c 1 -1	c 2 -1	c 3 -1	0	0	0	738,71
3	833,55	759,67	738,71	c 1 -1	c 2 -1	c 3 -1	c 1 - 2	c 2 - 2	c 3 - 2	725,67
…

To the inputs ( x 4 , x 5 , x 6 ) output signals of the hidden layer are supplied with a delay of one clock cycle ( from 1 -1 ,c 2 -1 , c 3 -1 ), to the inputs ( x 7 , x 8 , x 9 ) – output signals of the hidden layer with a delay of two clock cycles ( from 1 -2,c 2 -2 , c 3 -2 }.

Thus, training a recurrent ANN with dynamic stack memory using the backpropagation method can be reduced to training a multilayer perceptron by transforming the training set. To implement the proposed methodology for training a recurrent ANN with dynamic stack memory, the capabilities of the neuroemulator have been expanded NeuroNADS .

The object-oriented model of a recurrent ANN with dynamic stack memory is presented in the class diagram (Fig. 4).

Rice. 4. Diagram of the main classes that implement a recurrent ANN with dynamic stack memory

Unlike class Layer, which is a container for multilayer perceptron neurons, class LayerMemory contains memory stackOut, implemented as a stack of previous layer signals. The stack size is set using the property stackSize. In the diagram (Fig. 5), the layer memory is depicted as a stack of layer output signals ( y -1 , y -2 , …, y - n), Where n– stack size. Each stack cell y - i consists of an array of layer neuron outputs ( y 1, y 2, …, y n). The stack is organized so that after memory overflows, the last cell y - n is deleted, the entire queue is shifted by one position, so y - i = y -(i -1) .

Rice. 5. Implementation of a layer with memory ( LayerMemory) for recurrent ANNs with dynamic stack memory

Let us forecast the average monthly density of solar activity at a wavelength of 10.7 cm for the first six months of 2012 based on data for 2010-2011 from Table. 1. To do this, we will build and train a recurrent ANN with dynamic stack memory (Fig. 3) using a neuroemulator NeuroNADS. We will take the first 24 examples of the time series for the training sample, and the remaining six examples for the test sample.

We will conduct training using a hybrid algorithm. Algorithm parameters: learning step – 0.3, maximum number of individuals in a generation – 10, mutation coefficient – 0.1. Criteria for stopping training: root mean square error – 0.001, number of epochs – 1000.

One of the best results of ANN training is shown in Fig. 6 and in Fig. 7. Indicators of time series forecasting errors are presented in table. 4.

blue graph of the original time series;
red graph of network output values on the training set;
green plot of predicted network values.

Rice. 6. Results of a survey of a recurrent ANN with dynamic stack memory on training and test samples (the x-axis is the example number, the y-axis is the value of the time series)

Rice. 7. Graph of changes in the root-mean-square error function of a recurrent ANN with dynamic stack memory during training (the x-axis is the number of epochs, the y-axis is the error value)

Table 4. Indicators of time series forecasting errors

Based on the training results, we can conclude that the recurrent ANN with dynamic stack memory coped with the task; the indicators of time series forecasting errors correspond to acceptable values. Thus, recurrent ANNs with dynamic stack memory can be trained using the proposed methodology, and the constructed ANN models can be used for forecasting time series.

The study was carried out with financial support from the Russian Foundation for Basic Research within the framework of scientific project No. 14-01-00579 a.

References:

Bodyansky E.V., Rudenko O.G. Artificial neural networks: architectures, training, applications. – Kharkov: TELETECH, 2004. – 369 p.
Osovsky S. Neural networks for information processing / Transl. from Polish I.D. Rudinsky. – M.: Finance and Statistics, 2002. – 344 p.
Information and analytical system [Electronic resource]: data on solar and geomagnetic activity – Access mode: http://moveinfo.ru/data/sun/select (free access) – Cap. from the screen. - Yaz. rus.
Krug P.G. Neural networks and neurocomputers. M.: MPEI Publishing House, 2002 – 176 p.
NeuroNADS neuroemulator [Electronic resource]: web service - Access mode: http://www.service.. from the screen. - Yaz. rus.
Belyavsky G.I., Puchkov E.V., Lila V.B. Algorithm and software implementation hybrid method of training artificial neural networks // Software products and systems. Tver, 2012. No. 4. pp. 96 - 100.

In our material today, we will remind readers of the concept of an artificial neural network (ANN), as well as what they are, and consider the issues of solving the forecasting problem using ANNs in general and recurrent ANNs in particular.

Neural networks

First, let's remember what an artificial neural network is. In one of the previous articles, we already discussed that an ANN is a network of artificial neurons (a “black box” with many inputs and one output) that transforms a vector of input signals (data) into a vector of output signals using a certain function called an activation function. In this case, between the layer of “receiving” neurons and the output layer there is at least one intermediate one.

The type of ANN structure defines the concept of feedback: thus, in a direct propagation ANN, the signal goes sequentially from the input layer of neurons through the intermediate ones to the output one; the recurrent structure implies the presence of feedback, when the signal from the output or intermediate neurons partially arrives at the inputs of the input layer of neurons (or one of the external intermediate layers).

Recurrent neural networks

If we look at recurrent ANNs in a little more detail, it turns out that the most modern (and considered the most “successful”) of them originate from a structure called a multilayer perceptron ( mathematical model brain – feedforward ANN with intermediate layers). At the same time, since their appearance, they have undergone significant changes - and the “new generation” ANNs are much simpler than their predecessors, despite the fact that they can successfully solve the problem of memorizing sequences. So, for example, the most popular Elman network today is designed in such a way that the return signal from the internal layer is sent not to the “main” input neurons, but to additional inputs - the so-called context. These neurons store information about the previous input vector (stimulus); it turns out that the output signal (network reaction) depends not only on the current stimulus, but also on the previous one.

Solution to the forecasting problem

It is clear that Elman networks are potentially suitable for forecasting (in particular, time series). However, it is also known that feedforward neural networks successfully cope with this task - although not in all cases. As an example, we propose to consider one of the most popular variations of the forecasting problem – time series (TS) forecasting. The formulation of the problem comes down to choosing an arbitrary VR with N samples. Next, the data is divided into three samples - training, testing and control - and fed to the input of the ANN. The resulting result will be presented in the form of a time series value at the required point in time.

In general, the task of forecasting time series using an ANN comes down to the following sequence of steps:

collecting data for training (a stage considered one of the most difficult);
preparation and normalization of data (reduction to VR form);
choice of ANN topology (at this stage a decision is made on the number of layers and the presence of feedback);
empirical (through experiment) selection of ANN characteristics;
empirical selection of training parameters;
ANN training;
checking training for adequacy to the task;
adjustment of parameters taking into account the previous step, final training;
verbalization of ANN (minimized description using several algebraic or logical functions) for further use.

Why recurrent ANNs?

It is clear that the decision about the ANN topology can affect the result; but let’s return to the beginning of the conversation: why did we deliberately choose forecasting using a recurrent network as the topic of this article? After all, if you google it, VR prediction in works is usually carried out using multilayer perceptrons (we remember that these are feedforward networks) and the backpropagation method. It is worth clarifying here: yes, indeed, in theory, such ANNs solve the forecasting problem well - provided that the degree of noise (errors and omissions in the input data), for example, of the original time series is minimal.

In practice, time series are quite noisy, which naturally causes problems when trying to forecast. The use of collections of feed-forward networks can reduce the degree of error; however, this significantly increases not only the complexity of the structure itself, but also its training time.

Using the Elman recurrent network allows you to solve the forecasting problem even on highly noisy time series (this is especially important for business). In general, this ANN is a structure of three layers, as well as a set of additional “contextual” elements (inputs). Feedback goes from the hidden layer to these elements; each link has a fixed weight, equal to one. At each time interval, input data is distributed among neurons in the forward direction; then the learning rule is applied to them. Thanks to fixed feedback, context elements always store a copy of the values from the hidden layer from the previous step (since they are sent backwards even before the learning rule is applied). Thus, the noise of the time series is gradually leveled out, and along with it the error is minimized: we get a forecast that, in the general case, will be more accurate than the result of the classical approach, which Western works confirm experimentally.

Resume

Having considered some aspects of the practical application of neural networks to solving the forecasting problem, we can conclude: the recurrent model is the future of forecasting. At least this applies to noisy time series - and, as you know, in practice, especially in business, things cannot be done without inaccuracies and omissions in the data. Western science, and after it enthusiastic practitioners, have already understood this. In the post-Soviet space, the general public has yet to reach these conclusions - we hope that this material will help our readers draw their conclusions today.

(Recurrent Neural Networks, RNNs) - popular models, used in natural language processing (NLP). First, they score arbitrary sentences based on how often they appeared in texts. This gives us a measure of grammatical and semantic correctness. Such models are used in machine translation. Secondly, language models generate new text. Training the model on Shakespeare's poems will allow you to generate a new text similar to Shakespeare.

What are recurrent neural networks?

The idea of an RNN is to use information sequentially. Traditional neural networks assume that all inputs and outputs are independent. But for many tasks this is not suitable. If you want to predict the next word in a sentence, it's better to take into account the words that precede it. RNNs are called recurrent because they perform the same task for each element of the sequence, with the output depending on previous calculations. Another interpretation of RNNs is that they are networks that have a “memory” that takes into account previous information. In theory, RNNs can use information in arbitrarily long sequences, but in practice they are limited to only a few steps (more on this later).

The diagram above shows that the RNN is deployed in full network. By unwrapping we simply write out the network for the complete sequence. For example, if the sequence is a sentence of 5 words, the scan will consist of 5 layers, one layer for each word. The formulas that specify the calculations in RNN are as follows:

x_t - input at time step t. For example, x_1 could be a one-hot vector corresponding to the second word of the sentence.
s_t is the hidden state at step t. This is the “memory” of the network. s_t depends, as a function, on previous states and the current input x_t: s_t=f(Ux_t+Ws_(t-1)). The function f is usually non-linear, such as tanh or ReLU. s_(-1), which is required to calculate the first hidden state, is usually initialized to zero (null vector).
o_t - output at step t. For example, if we want to predict a word in a sentence, the output could be a vector of probabilities in our dictionary. o_t = softmax(Vs_t)

A few notes:

You can interpret s_t as network memory. s_t contains information about what happened in previous time steps. The output o_t is calculated solely based on the "memory" s_t. In practice, everything is a little more complicated: s_t cannot contain information from too many previous steps;
Unlike traditional deep, which uses different parameters in each layer, RNN has the same (U, V, W) in all stages. This reflects the fact that we are performing the same task at each step, using only different inputs. This significantly reduces the total number of parameters we need to fit;
The diagram above has outputs at each step, but depending on the task, they may not be needed. For example, when determining the emotional coloring of a sentence, it is advisable to care only about the final result, and not about the coloring after each word. Likewise, we may not need to enter data at every step. The main feature of RNN is the hidden state, which contains some information about the sequence.

Where are recurrent neural networks used?

Recurrent neural networks have demonstrated great success in many NLP tasks. At this point, it should be mentioned that the most commonly used type of RNN are LSTMs, which are much better at capturing (storing) long-term dependencies than RNNs. But don't worry - these are essentially the same as the RNNs we'll cover in this tutorial, they just have a different way of computing the hidden state. We'll look at LSTM in more detail in another post. Here are some examples of RNN applications in NLP (without reference to an exhaustive list).

Language modeling and text generation

Given a sequence of words, we want to predict the probability of each word (in the dictionary). Language models allow us to measure the probability of selection, which is an important contribution to machine translation (since sentences are highly likely to be correct). A side effect of this ability is the ability to generate new texts by selecting from output probabilities. We can generate other things depending on what our data is. In language modeling, our input is typically a sequence of words (for example, encoded as a one-hot vector) and our output is a sequence of predicted words. When training, we feed the previous output o_t=x_(t+1) as the input to the next layer, since we want the result at step t to be the next word.

Research on language modeling and text generation:

Machine translation

Machine translation is similar to language modeling in that the vector of input parameters is a sequence of words in the source language (for example, German). We want to get a sequence of words in the target language (for example, English). The key difference is that we will only get this sequence after we have seen all the input parameters, since the first word of the sentence to be translated may require information from the entire sequence of input words.

RNN for machine translation

Speech recognition

Given an input sequence of acoustic signals from a sound wave, we can predict the sequence of phonetic segments along with their probabilities.

Generating image descriptions

Together with RNNs, they were used as part of a model for generating descriptions of unlabeled images. It's amazing how well they work. The combination model combines generated words with features found in images.

Deep visual-semantic fusion for image description generation.