• Lectures on the discipline “parallel computing” - Lectures

    Parallel Processing

    Parallel Processing

    Parallel processing is a model for executing an application process simultaneously by a group of processors. There are three ways to implement parallelism:
    -1- SIMD method of working with one command stream and several data streams, in which all processors working under the same program process their own data arrays under the control of the master processor;
    -2- MIMD method of working with multiple command streams and multiple data streams, in which processors work according to their programs independently of each other, only occasionally communicating with each other;
    -3- MISD way of working with multiple command streams and one data stream.

    In English: Parallel processing

    Finam Financial Dictionary.


    See what “Parallel processing” is in other dictionaries:

      Parallel Processing- One of the types of information processing when several operations can be performed simultaneously. Unlike conscious processing, which usually occurs sequentially, this type of processing occurs without conscious effort. For example, reading these... ...

      - (parallel processing) A method of working on a computer in which two or more parts of a program are executed not sequentially, but in parallel. Strictly speaking, this method can only be used on computers that have two or more... Dictionary of business terms

      parallel processing- - Telecommunications topics, basic concepts EN parallel processing...

      parallel processing- lygiagretusis apdorojimas statusas T sritis automatika atitikmenys: engl. parallel processing vok. Parallelverarbeitung rus. parallel processing, f pranc. traitement en parallèle, m … Automatikos terminų žodynas

      parallel information processing- a model of information processing in the brain, according to which information undergoes a series of transformations in certain “functional blocks” of the brain so that at any given time it is processed simultaneously (in parallel) in several... ... Great psychological encyclopedia

      PARALLEL INFORMATION PROCESSING- See information processing, parallel...

      Way parallel processing data a large number processors that implement the MIMD parallelism method. In English: Massively Parallel Processing English synonyms: MPP See also: Parallel processing Financial Dictionary Finam... Financial Dictionary

      PROCESSING, PARALLEL- Information processing in which more than one sequence of processing operations is carried out simultaneously, or in parallel. Processing may involve extremely low level, non-symbolic components, such as those used in... ... Dictionary in psychology

      parallel pipelining- lygiagretusis konvejerinis apdorojimas statusas T sritis radioelektronika atitikmenys: engl. parallel pipelining vok. Parallel Pipelineverarbeitung, f rus. parallel pipelining, f pranc. traitement de pipeline parallèle, m... Radioelektronikos terminų žodynas

      simultaneous processing- parallel processing - [L.G.Sumenko. English-Russian dictionary on information technology. M.: State Enterprise TsNIIS, 2003.] Topics information Technology in general Synonyms parallel processing EN simultaneous processing ... Technical Translator's Guide

    Books

    • Parallel data processing
    • Parallel data processing, A. O. Latsis. IN textbook An in-depth systematic review of parallel data processing technologies is given. The main focus is on traditional software technologies parallel programming...

    Ministry of Education and Science of the Russian Federation

    Federal Agency for Education

    South Russian State Technical University

    (Novocherkassk Polytechnic Institute)

    Shakhty Institute (branch) SRSTU (NPI)

    LECTURES ON DISCIPLINE

    "PARALLEL COMPUTING"

    Mines - 2010

    Introduction

    Basic Concepts

    1. General questions solving "big problems"

    1.1 Modern problems of science and technology that require supercomputers to solve

    1.2.2 Abstract parallel computing models

    1.2.3 Methods of parallel data processing, calculation error

    1.3 The concept of a parallel process and parallelization granules

    1.4 Interaction of parallel processes, process synchronization

    1.5 Possible acceleration in parallel computing (Amdahl's law)

    2. Principles of constructing multiprocessor computing systems

    2.1 Architecture of multiprocessor computing systems

    2.2 Distribution of calculations and data in multiprocessors computing systems with distributed memory

    2.3 Classification of parallel computing systems

    2.4 Multiprocessor computing systems with distributed memory

    2.4.1 Massively parallel supercomputers of the Cry T3 series

    2.4.2 BEOWULF class cluster systems

    Conclusion

    References

    Introduction

    Even at the dawn of the computer era, around the middle of the last century, designers of electronic computers began to think about the possibility of using parallel computing in computers. After all, an increase in performance is only due to improvement electronic components computer is a rather expensive method, which, moreover, faces limitations imposed by physical laws. Thus, parallel data processing and command parallelism were introduced into the design of computers, and now any user of a personal computer, perhaps without knowing it, is working on a parallel computer.

    One of the noticeable trends in the development of mankind is the desire to model the processes of the surrounding reality as strictly as possible in order to both improve living conditions in the present and predict the future as accurately as possible. Mathematical methods and digital modeling techniques in many cases make it possible to resolve such problems, however, over time, there is a serious qualitative and quantitative complication of the technology for solving problems. In many cases, the limitation is the lack of computing power of modern electronic computers, but the importance of the problems being solved has attracted huge financial resources to the field of creating highly complex electronic computers.

    For some time now, increasing the speed of computers of traditional (called “von Neumann”) architecture has become prohibitively expensive due to technological limitations in the production of processors, so developers have turned their attention to another way to increase productivity - combining electronic computers into multiprocessor computing systems. In this case, individual program fragments are executed in parallel (and simultaneously) on different processors, exchanging information via an internal computer network.

    The idea of ​​combining electronic computers in order to increase both productivity and reliability has been known since the late fifties.

    The requirements to obtain maximum performance at minimum cost led to the development of multiprocessor computing systems; systems of this kind are known, combining the computing power of thousands of individual processors. The next stage is attempts to unite millions of heterogeneous computers on the planet into a single computing complex with enormous performance through Internet networks. Today, the use of parallel computing systems is a strategic direction of development computer technology. The development of hardware is necessarily supported by the improvement of algorithmic and software components - parallel programming technologies.

    The method of parallelizing calculations has existed for a long time; organizing the joint functioning of many independent processors requires serious theoretical and practical research, without which a complex and relatively expensive multiprocessor installation often not only does not exceed, but is inferior in performance to a traditional computer.

    Potential parallelization varies across computing tasks various types– it is significant for scientific programs containing many cycles and lengthy calculations and significantly less for engineering problems, which are characterized by calculations using empirical formulas.

    Let's consider two main questions:

    1. Multiprocessor computing systems - (massively parallel supercomputers) Cray T3D(E) ​​with a number of processors from 40 to 2176. These are supercomputers with distributed memory on RISC processors of the Alpha21164A type, with a topology communication network– three-dimensional torus, operating UNIX system with microkernel and translators for FORTRAN, HPF, C/C++ languages. Supported programming models: MPI, PVM, HPF.

    2. Beowulf clusters of workstations. Workstation clusters are a collection of workstations connected to a local network. A cluster is a computing system with distributed memory and distributed control. A cluster system can have performance comparable to that of supercomputers. Workstation clusters are usually called Beowulf clusters (Beowulf cluster - after the project of the same name), connected by local Ethernet network and use the Linux operating system.

    Basic Concepts

    The most common programming technology for cluster systems and parallel computers distributed memory technology is currently MPI. The main way parallel processes interact in such systems is by passing messages to each other. This is reflected in the name of this technology - Message Passing Interface. The MPI standard defines an interface that must be followed both by the programming system on each computing platform and by the user when creating their programs. MPI supports Fortran and C languages. The full version of the interface contains a description of more than 125 procedures and functions.

    The MPI interface supports the creation of parallel programs in the MIMD (Multiple Instruction Multiple Data) style, which involves combining processes with different source codes. However, writing and debugging such programs is very difficult, so in practice programmers much more often use the SPMD (Single Program Multiple Data) model of parallel programming, within which the same code is used for all parallel processes. Nowadays, more and more MPI implementations support working with so-called "threads".

    Since MPI is a library, when compiling a program it is necessary to link the corresponding library modules.

    After receiving the executable file, you need to run it on the required number of processors. After launch, the same program will be executed by all running processes, the execution result, depending on the system, will be output to the terminal or written to a file.

    An MPI program is a set of parallel interacting processes. All processes are spawned once, forming a parallel part of the program. During the execution of an MPI program, the creation of additional processes or the destruction of existing ones is not allowed (in further versions of MPI this possibility appeared). Each process operates in its own address space; there are no shared variables or data in MPI. The main way of interaction between processes is by sending messages explicitly.

    To localize the interaction of parallel program processes, you can create groups of processes, providing them with a separate environment for communication - a communicator. The composition of the groups formed is arbitrary. Groups can completely coincide, be part of one another, not intersect, or partially intersect. Processes can only interact within a certain communicator; messages sent in different communicators do not intersect or interfere with each other. Communicators have the integer type in the Fortran language (in the C language, the predefined MPI Comm type).

    When a program starts, it is always assumed that all spawned processes operate within the framework of a comprehensive communicator. This communicator always exists and serves for the interaction of all running processes MPI programs. All interactions between processes take place within a specific communicator; messages transmitted in different communicators do not interfere with each other.

    Reduced Instruction Set Processors (RISC). The RISC (Reduced Instruction Set Computer) architecture of the processor is based on the idea of ​​​​increasing the speed of its operation by simplifying the instruction set.

    Research has shown that 33% of the instructions in a typical program are data transfers, 20% are conditional branches, and another 16% are arithmetic and logical operations. In the vast majority of instructions, address calculation can be performed quickly, in one cycle. More complex addressing modes are used in approximately 18% of cases. About 75% of the operands are scalar, that is, variables of integer, real, character, etc., and the rest are arrays and structures. 80% of scalar variables are local, and 90% of structural variables are global. Thus, most operands are local operands of scalar types. They can be stored in registers.

    According to statistics, most of the time is spent processing the “subroutine call” and “subroutine return” statements. When compiled, these statements produce long sequences of machine instructions with a large number of memory accesses, so even if the share of these statements is only 15%, they consume the bulk of the processor time. Only about 1% of routines have more than six parameters, and about 7% of routines contain more than six local variables.

    As a result of studying these statistics, it was concluded that in typical program simple operations dominate: arithmetic, logical and data transfers. Dominate and simple modes addressing. Most operands are scalar local variables. One of the most important resources for increasing productivity is optimizing these operators.

    The RISC architecture is based on the following principles and ideas. The set of instructions should be limited and include only simple instructions, the execution time of which after sampling and decoding is one clock cycle or a little more. Pipeline processing is used. Simple RISC instructions can be implemented efficiently in hardware, while complex instructions can only be implemented in microprogramming. The design of the control device in the case of RISC architecture is simplified, and this allows the processor to operate at large clock speeds. Using simple commands allows you to effectively implement both pipeline data processing and command execution.

    Complex instructions take longer to execute on a RISC processor, but their number is relatively small. In RISC processors, a small number of instructions are addressed to memory. Retrieving data from RAM requires more than one clock cycle. Most instructions work with operands located in registers. All commands have a unified format and fixed length. This makes loading and decoding instructions easier and faster since, for example, the opcode and address field are always in the same position. Variables and intermediate results of calculations can be stored in registers. Taking into account the statistics of the use of variables, most of the local variables and procedure parameters can be placed in registers. When a new procedure is called, the contents of the registers are usually moved into RAM, however, if the number of registers is large enough, it is possible to avoid much of the time-consuming memory exchange operations by replacing them with register operations. Thanks to the simplified architecture of the RISC processor, there is space on the chip to accommodate additional set registers

    Currently, computing systems with RISC architecture occupy leading positions in the global computer market for workstations and servers. The development of RISC architecture is associated with the development of compilers, which must effectively take advantage of the large register file, pipelining, etc.

    1. General issues of solving “big problems”

    The term “big problems” usually refers to problems whose solution requires not only the construction of complex mathematical models, but also carrying out a huge number of calculations, many orders of magnitude greater than those typical for programmable electronic computers. Here, electronic computers are used with appropriate resources - the size of the operational and external memory, speed of information transmission lines, etc.

    The upper limit on the number of calculations for "large problems" is determined only by the performance of existing computers. at the moment computing systems. When “running” computational tasks in real conditions The question is not “solve the problem at all”, but “solve it in an acceptable time” (hours/tens of hours).

    1.1. Modern tasks of science and technology that require

    for solving supercomputers

    Quite often one has to deal with problems that, while representing considerable value for society, cannot be solved with the help of relatively slow computers office or home class. The only hope in this case rests on high-speed computers, which are commonly called supercomputers. Only machines of this class can cope with processing large amounts of information. This could be, for example, statistical data or the results of meteorological observations, financial information. Sometimes processing speed is critical. Examples include weather forecasting and climate change modeling. The earlier a natural disaster is predicted, the greater the opportunity to prepare for it. An important task is the modeling of medicines, deciphering the human genome, tomography, including medical, and exploration of oil and gas fields. There are many examples that can be given.

    Modeling the processes of the surrounding reality in order to both improve living conditions in the present and reliably predict the future is one of the trends in the development of mankind. Mathematical methods and digital modeling techniques in many cases make it possible to solve such problems, however, over time, the technology for solving such problems becomes more complex. In many cases, the limitation is the lack of computing power of modern electronic computers.

    The requirements to obtain maximum performance at minimum cost led to the development of multiprocessor computing systems; systems of this kind are known that combine the computing power of thousands of individual processors.

    Listed below are some areas of human activity that require supercomputer power using parallel computing for their solution:

    Predictions of weather, climate and global atmospheric changes

    Materials Science

    Construction of semiconductor devices

    Superconductivity

    Pharmaceutical development

    Human genetics

    Astronomy

    Transport problems of large dimension

    Hydro and gas dynamics

    Controlled thermonuclear fusion

    Oil and gas exploration

    Computational problems in ocean sciences

    Speech recognition and synthesis, image recognition

    One of the biggest challenges is climate system modeling and weather prediction. In this case, the equations of continuum dynamics and the equations of equilibrium thermodynamics are solved jointly numerically. To simulate the development of atmospheric processes over 100 years and a number of discretization elements of 2.6×106 (a grid with a step of 10 in latitude and longitude over the entire surface of the Planet with 20 layers in height, the state of each element is described by 10 components) at any moment in time the state of the earth atmosphere is described by 2.6×107 numbers. With a time sampling step of 10 minutes, 5 × 104 ensembles need to be determined over the simulated time period (that is, 1014 required numerical values ​​of intermediate calculations). When estimating the number of computational operations required to obtain each intermediate result at 102÷103, the total number of floating-point calculations required to conduct a numerical experiment with a global atmospheric model reaches 1016÷1017.

    A supercomputer with a performance of 1012 op/sec in the ideal case (full load and efficient algorithmization) will perform such an experiment within several hours; for carrying out complete process modeling requires repeated (tens/hundreds of times) running of the program.

    The problem of supercomputing is so important that many states supervise work in the field of supercomputer technologies.

    State support is directly related to the fact that independence in the production and use of computer technology is in the interests of national security, and the country’s scientific potential is directly related and is largely determined by the level of development of computer technology and software.

    For the purpose of objectivity when comparing, the performance of super-electronic computers is calculated based on the execution of a previously known test task (“benchmark”, from the English benchmark). Peak performance is determined by the maximum number of operations that can be performed in a unit of time in the absence of connections between functional devices, characterizes the potential capabilities of the equipment and does not depend on the program being executed.

    The disadvantage of the method of assessing peak performance as the number of instructions executed by a computer per unit of time (MIPS, Million Instruction Per Second) gives only the most general idea of ​​performance, since it does not take into account the specifics of specific programs (it is difficult to predict at what number and which specific processor instructions the user's program).

    It should be noted that there are arguments against the widespread practical use of parallel computing:

    Parallel computing systems are prohibitively expensive. According to Grosch's law, confirmed by practice, computer performance grows in proportion to the square of its cost; as a result, it is much more profitable to obtain the required computing power purchasing one powerful processor rather than using several slower processors.

    Counterargument. The increase in speed of sequential electronic computers cannot continue indefinitely; computers are subject to rapid obsolescence and frequent financial costs are required for the purchase of new models. The practice of creating parallel computing systems of the Beowulf class has clearly shown the cost-effectiveness of this particular path.

    When organizing parallelism, performance losses grow unnecessarily quickly. According to Marvin Minsky's hypothesis, the computational acceleration achieved when using a parallel system is proportional to the binary logarithm of the number of processors (with 1000 processors, the possible acceleration is only 10).

    Counterargument. The given speedup estimate is correct for parallelizing certain algorithms. However, there are a large number of tasks, when solved in parallel, close to 100% utilization of all available processors of a parallel computing system is achieved.

    Serial computers are constantly improving. According to the well-known Moore's Law, the complexity of sequential microprocessors doubles every 18 months, so the required performance can be achieved on “regular” sequential computers.

    Counterargument. A similar development is characteristic of parallel systems.

    However, the use of parallelism allows you to obtain the necessary acceleration of calculations without waiting for the development of new, faster processors. The efficiency of parallelism strongly depends on the characteristic properties of parallel systems. All modern serial electronic computers operate in accordance with the classical von Neumann circuit; parallel systems are distinguished by a significant variety of architecture and the maximum effect from the use of parallelism can be obtained by making full use of all the features of the hardware (the consequence is the transfer of parallel algorithms and programs between different types systems is difficult and sometimes impossible).

    Counterargument. Given the actual variety of architectures of parallel systems, there are also certain “established” ways to ensure parallelism. The invariance of the created software is ensured by using standard software support for parallel computing (software libraries PVM, MPI, DVM, etc.). PVM and MPI are used in Cray-T3 supercomputers.

    Over decades of operation of serial electronic computers, a huge amount of software has been accumulated that is focused on serial electronic computers; processing it for parallel computers is practically impossible.

    Counterargument. If these programs provide a solution to the assigned tasks, then their processing is not necessary at all. However, if sequential programs do not allow obtaining solutions to problems in an acceptable time, or there is a need to solve new problems, then new software must be developed and it can initially be implemented in parallel execution.

    There is a limitation on the speedup of calculations with parallel implementation of the algorithm compared to sequential implementation.

    Counterargument. In fact, algorithms generally have no (certain) share sequential calculations does not exist. However, this is the essence of a property of the algorithm and has nothing to do with the possibility of parallel solution of the problem in general. It is necessary to learn how to apply new algorithms that are more suitable for solving problems on parallel systems.

    Thus, for every critical consideration against the use of parallel computing technologies there is a more or less significant counterargument.

    1.2 Parallel data processing

    1.2.1 Fundamental possibility of parallel processing

    Almost all algorithms developed to date are sequential. For example, when evaluating the expression a + b × c, you must first perform the multiplication and only then perform the addition. If electronic computers contain addition and multiplication nodes that can operate simultaneously, then in this case The addition node will be idle waiting for the multiplication node to complete its work. It is possible to prove the statement that it is possible to build a machine that will process a given algorithm in parallel.

    It is possible to build m processors that, when operated simultaneously, produce desired result in one single clock cycle of the computer.

    Such "multiprocessor" machines could theoretically be built for each specific algorithm and seemingly "bypass" the sequential nature of the algorithms. However, not everything is so simple - there are infinitely many specific algorithms, so the abstract reasoning developed above is not so directly related to practical significance. Their development convinced us of the very possibility of parallelization, formed the basis of the concept of unlimited parallelism, and made it possible to consider from a general perspective the implementation of so-called computing environments - multiprocessor systems dynamically configured for a specific algorithm.

    1.2.2. Abstract parallel computing models

    The parallel computing model provides a high-level approach to characterization and runtime comparison various programs, while abstracting away hardware and execution details. The first important parallel computing model was the Parallel Random Access Machine (PRAM), which provides an abstraction of a shared memory machine (PRAM is an extension of the RAM - Random Access Machine model). The BSP (Bulk Synchronous Parallel) model combines both shared and distributed memory abstractions. All processors are assumed to execute instructions synchronously; in the case of executing the same instruction, PRAM is an abstract SIMD machine, (SIMD - Single Instruction stream / Multiple Data stream - a single stream of instructions along with a multiple data stream), however, processors can execute various commands. Basic commands are reading from memory, writing to memory and ordinary logical and arithmetic operations.

    The PRAM model is idealized in the sense that each processor can access any memory location at any time (Write operations performed by one processor are visible to all other processors in the order in which they were performed, but writes performed by different processors, may be visible in any order). For example, each processor in PRAM can read data from a memory cell or write data to the same cell. This, of course, does not happen on real parallel machines, since memory modules at the physical level organize access to the same memory cell. Moreover, memory access times vary on real machines due to the presence of caches and the possible hierarchical organization of memory modules.

    The basic PRAM model supports concurrent (in this context parallel) reads and writes. PRAM submodels are known that take into account rules that allow avoiding conflict situations when several processors simultaneously access shared memory.

    Brent's theorem allows us to model circuits of functional elements using parallel random access machines (PRAMs). The functional elements can be either 4 basic ones (performing logical operations NOT, AND, OR, XOR - negation, logical AND, logical OR and exclusive OR, respectively), more complex NAND and NOR (AND-NOT and OR-NOT), and and any complexity.

    In what follows, it is assumed that the delay (i.e., the response time - the time after which the intended signal values ​​appear at the output of the element after the values ​​at the inputs have been established) is the same for all functional elements.

    We consider a circuit of functional elements connected without forming cycles (we assume that the functional elements have any number of inputs, but exactly one output - an element with several outputs can be replaced by several elements with a single output). The number of inputs determines the input power of the element, and the number of inputs to which the output of the element is connected determines its output power. It is usually assumed that the input powers of all elements used are bounded above, while the output powers can be any. The size of a circuit refers to the number of elements in it; the largest number of elements on the paths from the inputs of the circuit to the output of the element is called the depth of this element (the depth of the circuit is equal to the largest of the depths of its constituent elements).

    Figure 1. Simulation of a size 15, depth 5 circuit with two processors using a parallel random access machine (PRAM machine)

    Figure 1 shows the result of modeling a circuit of size (total number of processors) n=15 with a circuit depth (maximum number of elements at each depth level) d=5 with the number of processors p=2 (simultaneously simulated elements are combined into groups by rectangular areas, and for each group is indicated by the step at which its elements are modeled; modeling occurs sequentially from top to bottom in order of increasing depth, at each depth p pieces at a time). According to Brent's theorem, modeling such a circuit will take no more than ceil(15/2+1)=9 steps.

    1.2.3. Methods of parallel data processing, calculation error

    The following modes of execution of independent parts of the program are possible:

    Parallel execution – several data processing commands are executed at the same time; This mode of computing can be achieved not only by the presence of several processors, but also by using pipeline and vector processing devices.

    Distributed computing - this term is usually used to indicate a method of parallel data processing, which uses several processing devices that are sufficiently distant from each other and in which data transmission over communication lines leads to significant time delays.

    With this method of organizing calculations, data processing is effective only for parallel algorithms with low intensity of interprocessor data transfer flows; This is how, for example, multi-machine computing systems operate, formed by combining several individual electronic computers using communication channels of local or global information networks.

    Formally, this list can also include multitasking mode (time sharing mode), in which a single processor is used to execute processes; This mode is convenient for debugging parallel applications.

    There are two ways to process data in parallel: parallelism and pipelining.

    Parallelism presupposes the presence of p identical devices for processing data and an algorithm that allows each to perform an independent part of the calculations; at the end of processing, partial data is collected together to obtain the final result. In this case, we will speed up the process by p times. Not every algorithm can be successfully parallelized in this way (a natural condition for parallelization is the calculation of independent parts of the output data using the same - or similar - procedures; iteration or recursiveness cause biggest problems when parallelizing).

    The idea of ​​pipeline processing is to isolate individual stages of performing a general operation, and each stage, after completing its work, passes the result to the next one, while simultaneously accepting a new portion of input data. Each processing stage is performed by its own part of the data processing device (conveyor stage), each stage performs a specific action (micro-operation); overall data processing requires these parts to fire sequentially.

    The conveyor system when executing commands imitates the operation of an assembly plant conveyor, on which a product sequentially passes a number of work stations; and at each of them the product is new operation. The acceleration effect is achieved through the simultaneous processing of a number of products at different workplaces.

    Acceleration of calculations is achieved by using all stages of the pipeline for streaming data processing (data is streamed to the input of the pipeline and is sequentially processed at all stages). Pipelines can be scalar or vector devices (the only difference is that in the latter case instruction vectors can be used). In the case of a conveyor length l, the processing time of n independent operations will be l+n−1 (each stage operates per unit of time). When using such a device, processing a single portion of input data will require l×n time, and only for many portions will we obtain a computational speedup close to l.

    From Figure 2 it can be seen that the performance E of a conveyor device grows asymptotically with increasing length n of the data set at its input, tending to a theoretical maximum performance of 1/τ

    Figure 2. Conveyor device performance as a function of input data set length

    1.3. The concept of a parallel process and parallelization granules

    The most general scheme for performing sequential and parallel calculations is shown in Figure 3 (time points S and E are the beginning and end of the task, respectively).

    Figure 3. Diagrams of process execution during sequential computation - a), with close to ideal parallelization - b) and in the general case of parallelization - c)

    A process is a name given to certain sequences of commands that, like other processes, claim to use the processor for their execution; commands (instructions) within the process are executed sequentially, internal parallelism is not taken into account.

    In Figure 3, thin lines show the actions of creating processes and exchanging data between them, thick lines show the actual execution of processes (the x-axis is time). In the case of sequential calculations, a single process is created (Figure 3a) that performs the necessary actions; When executing an algorithm in parallel, several processes (parallel task branches) are required. In the ideal case, all processes are created simultaneously and terminated at the same time (Figure 3b); in the general case, the process diagram is much more complicated and represents a kind of graph (Figure 3c).

    The characteristic length of a sequentially executed group of instructions in each of the parallel processes is called the size of the granule (grain). In any case, it is advisable to strive for “coarse grain” (ideal – diagram 3b). Typically, the size of a grain (granule) is tens to hundreds of thousands of machine operations (which is orders of magnitude larger than the typical size of a Fortran or C/C++ operator). Depending on the size of the granules, they speak of fine-grained and coarse-grained parallelism.

    The size of the granule is also influenced by the convenience of programming - a certain logically complete fragment of the program is often designed in the form of a granule. It is advisable to strive for uniform loading of processors.

    Developing parallel programs is challenging due to the difficulty of identifying (usually hidden) parallelism in a program (that is, identifying parts of the algorithm that can execute independently of each other).

    Automating the detection of parallelism in algorithms is not easy and is unlikely to be fully solved in the near future. At the same time, over decades of operation of computer technology, such a number of sequential algorithms have been developed that there can be no question of manual parallelization.

    Even simplest model Parallel computing allows us to identify an important circumstance - the time of data exchange should be as short as possible to the execution time of the sequence of commands that form the granule.

    For real problems (traditionally), the characteristic size of the parallelization grain is several orders of magnitude larger than the characteristic size of the operator of a traditional programming language (C/C++ or Fortran).

    1.4. Interaction of parallel processes, process synchronization

    The execution of program commands forms a computational process; in the case of the execution of several programs on common or shared memory and the exchange of messages between these programs, it is customary to speak of a system of co-occurring interacting processes.

    Creation and destruction of UNIX-like processes operating systems executed by operators (system calls).

    Parallelism is often described in terms of macro statements; in parallel languages, parallel branches are started using the JOIN operator.

    For multiprocessor computing systems, ensuring process synchronization is of particular importance. For example, the moment in time of the onset of data exchange between processors is not a priori agreed upon in any way and cannot be determined accurately, since it depends on many difficult-to-estimate parameters of the functioning of multiprocessor computing systems, while synchronization is simply necessary to ensure a meeting for the exchange of information. In loosely coupled multiprocessor computing systems, one cannot at all hope for absolute synchronization of the machine clocks of individual processors; one can only talk about measuring time intervals in the time system of each processor.

    Synchronization is in an effective way preventing situations of “deadlock” - a situation when each of the interacting processes has received at its disposal part of the resources it needs, but neither it nor other processes have enough resources to complete processing and subsequently release resources.

    Parallel programming systems use high-level synchronization techniques. The MPI parallel programming technology uses a synchronization scheme for information exchange between processes.

    Synchronization can also be supported by hardware (for example, barrier synchronization in the Cray T3 supercomputer, with the help of which all processes wait for a certain point in the program, after reaching which further work is possible.

    1.5. Possible acceleration in parallel computing (Amdahl's law)

    It is of interest to estimate the magnitude of the possible increase in productivity, taking into account quality characteristics the most initially sequential program.

    Figure 4. Scheme for derivation of Amdahl's law

    Amdahl's Law (1967) relates the potential speedup of parallelization to the fraction of operations that are performed a priori sequentially. Let f(0

    When transferring the algorithm to a parallel machine, the calculation time will be distributed as follows:

    f×ts – execution time of a part of the algorithm that cannot be parallelized,

    · (1-f)×ts/n – time spent on executing the parallelized part of the algorithm.

    The time t p required for calculation on a parallel machine with n processors is

    t p =f×ts+(1-f)×ts/n .

    Acceleration of calculation time with a small proportion of sequential operations (f<<1) возможно достичь (не более чем в n раз) ускорения вычислений.

    In the case of f=0.5, it is impossible to achieve S>2 with any number of processors! Note that these restrictions are of a fundamental nature (they cannot be circumvented for a given algorithm), but a practical estimate of the fraction f of sequential operations is usually impossible a priori.

    Thus, the qualitative characteristics of the algorithm itself impose restrictions on possible acceleration during parallelization. For example, calculation algorithms using sequential formulas, characteristic of engineering calculations, are poorly parallelized (part f is significant), while at the same time, algorithms that can be reduced to linear programming problems are parallelized satisfactorily.

    It is not easy to estimate a priori the fraction of sequential operations f. However, one can try to formally use Amdahl's law to solve the inverse problem of determining f from experimental performance data; this makes it possible to quantitatively judge the achieved parallelization efficiency.

    Figure 6. Performance of a computing cluster system using the matrix multiplication procedure (experiment and calculation using the Amdahl formula)

    Figure 6 shows the results of an experiment on the SCI-MAIN cluster of the Research Computing Center of Moscow State University on the problem of matrix multiplication using a strip scheme (dimension 103 × 103 double-precision real numbers), the experimental data best fit (the least squares method was used) correspond to the Amdahl formula with f = 0.051.

    Amdahl's law is convenient for qualitative analysis of the parallelization problem.

    2. Principles of constructing multiprocessor computing systems

    2.1. Architecture of multiprocessor computing systems

    The architecture of parallel computers has developed almost from the very beginning of their creation and use, and in a variety of directions. The most general provisions lead to two classes - computers with shared memory and computers with distributed memory. Shared memory computers consist of multiple processors that have equal priority access to shared memory with a single address space (Figure 7a).

    Figure 7. Parallel computers:

    a) with shared memory b) with distributed memory

    A typical example of such an architecture is SMP (Symmetric Multi Processors) class computers, which include several processors, but one memory, a set of input/output devices and an operating system. The advantage of computers with shared memory is the relative ease of programming parallel tasks; the disadvantage is insufficient scalability. Real SMP systems usually contain no more than 32 processors; NUMA technology is used to further increase the computing power of such systems.

    In computers with distributed memory (multicomputer systems), each computing node is a full-fledged computer and includes a processor, memory, input/output devices, operating system, etc. (Figure 7b). A typical example of such an architecture is MPP (Massively Parallel Processing) class computer systems, in which homogeneous computing nodes are combined using some communication medium. The advantage of computers with distributed memory is (almost unlimited) scalability; the disadvantage is the need to use specialized software (messaging libraries) to exchange information between computing nodes. For multiprocessor computing systems with shared and distributed memory, the terms tightly and loosely coupled machines are used, respectively.

    As has been shown, the performance of communication channels greatly affects the ability to effectively parallelize, and this is important for both considered architectures. The simplest communication system is a common bus (Figure 8a), which connects all processors and memory, however, even if the number of devices on the bus is more than several dozen, the bus performance drops catastrophically due to the mutual influence and competition of devices for monopoly ownership of the bus during data exchanges.

    Figure 8. Multiprocessor systems

    a) - with a common bus, b) - with a matrix switch

    c) - with cascading switches (Omega network)

    More sophisticated approaches are used to build more powerful systems. An effective matrix switching scheme is (Figure 8b), in which devices are connected to each other by bidirectional switches that allow or prohibit the transfer of information between the corresponding modules. An alternative is to cascade the switches; for example, according to the Omega network scheme (Figure 8c). Moreover, each switch can connect any of its two inputs to any of the two outputs; to connect n processors with n memory blocks in this case, n×log2n/2 switches are required. The disadvantage of schemes with cascading switches is the delay in switch operation.

    For systems with distributed memory, almost all conceivable connection options are used (Figure 9), while the quality parameter in terms of message transmission speed is the average length of the path connecting arbitrary processors; In this case, we mean the physical path, since implementing the logical topology (by software) does not present any difficulties.

    Figure 9. Options for processor communication topologies in multiprocessor computing systems

    The simplest linear topology (Figure 9a) satisfactorily corresponds to many algorithms, which are characterized by the connection of only neighboring processes with each other (one-dimensional problems of mathematical physics and multidimensional ones, reduced to one-dimensional ones); disadvantage is the inability to transmit messages if there is a break anywhere. By halving the average path length and increasing communication reliability (if communication is disrupted, messages can be transmitted in the opposite direction, although at a lower speed), the connection of the first node to the last is achieved - a “ring” topology is obtained (Figure 9b).

    The “star” topology (Figure 9c) best corresponds to the distribution of load between processes, characteristic of “client/server” systems (the master node “distributes” tasks and “collects” the results of calculations, while the slave nodes interact with each other minimally).

    The lattice topology (Figure 9d) was used back in the early nineties when building the Intel Paragon supercomputer based on i860 processors; finding the minimum data transmission path between processors A and B for the “three-dimensional lattice” topology is illustrated in Figure 10. The “two-dimensional torus” topology (Figure 9e) expands the “two-dimensional lattice” with additional connections that reduce the length of the average path (of course, a “three-dimensional torus") and is characteristic of SCI network technology. A three-dimensional “clique” topology is used (Figure 9e), characterized by the presence of each processor communicating with each. Figure 9h shows a general view of the topology of complete communication of all processors with each other; This topology is characterized by the shortest average path length between processors, but is practically impossible to implement in hardware with a significant number of processors due to a catastrophic increase in the number of connections.

    The “hypercube” topology (Figure 9i) is characterized by a reduced average path length and proximity to the structures of many numerical calculation algorithms, which ensures high performance. An N-dimensional hypercube contains 2N processors. A two-dimensional hypercube is a square, a three-dimensional hypercube forms a regular cube, and a four-dimensional hypercube is a cube within a cube. For the nCube family of supercomputers, a hypercube of maximum dimension 13 contains 8192 processors; in the nCube 3 system, the number of processors can reach 65536 (16-dimensional hypercube).

    The following indicators are often used as the main characteristics of the data network topology:

    Figure 10. Finding the minimum path for transmitting messages between processors in a three-dimensional lattice topology

    The diameter is defined as the maximum distance (usually the shortest path between processors) between two processors on the network; this value characterizes the maximum required time for data transfer between processors (transfer time is, to a first approximation, directly proportional to the path length).

    Connectivity is an indicator characterizing the presence of different data transfer routes between network processors; a specific type of indicator can be defined, for example, as the minimum number of arcs that must be removed to divide the data network into two disconnected areas.

    Binary division width is an indicator defined as the minimum number of arcs that must be removed to divide a data network into two disconnected areas of the same size.

    Cost is defined, for example, as the total number of data lines in a multiprocessor computing system.

    It seems natural to combine the advantages of systems with shared (relative ease of creating parallel programs) and distributed memory (high scalability); The solution to this issue was the creation of computers with NUMA (Non Uniform Memory Access) architecture.

    In this sense, classic SMP computers have a UMA (Uniform Memory Access) architecture. This uses a mechanism (usually hardware level - whichever is faster) that allows user programs to treat all (physically) memory distributed between processors as a single address space. Examples of NUMA computers are the Cm system, built back in the seventies and containing a set of clusters united by an intercluster bus, and the BBN Butterfly complex combining 256 processors (1981, BBN Advanced Computers).

    2.2. Distribution of calculations and data in multiprocessors
    distributed memory computing systems

    If a multiprocessor computing system contains computing nodes with local RAM, in addition to distributing parts of the calculation among individual computing nodes, it is important to rationally distribute data (for example, blocks of processed matrices of significant dimension) among the available computing nodes. The fact is that the time spent on data exchange between the computing nodes processing this data and the computing nodes storing this data in their local RAM can slow down the computation process by orders of magnitude.

    It is clear that the location of a large pool of data on one computing node is hardly advisable due to the inevitable significant loss of time for organizing the transfer of individual blocks of these data to the processing computing node. On the other hand, a purely formal division of data into a number of blocks equal to the number of the computing node is fraught with the same thing.

    Rational distribution of data across the local RAM of a computing node should be carried out taking into account the frequency of access of each computing node to each block of data located on the corresponding computing nodes while striving to minimize the number of exchanges, which requires determining the fine information structure of the algorithm.

    It would seem that in the general case it is possible to construct a certain function of the complexity (for example, in terms of time) of calculations, taking into account both the resource costs of the calculations themselves and the complexity of exchanges for a given distribution of data across computing nodes and further minimization of this function according to the parameter (parameters) of data distribution ; Moreover, the distribution itself may be changeable. In reality, the construction of such a function is difficult due to the need to quantify the time parameters of operations and identify significant optimization parameters. A candidate for the role of such a function could be, for example, Amdahl’s network law described above.

    Standard packages for solving linear algebra problems use methods for distributing data across computing nodes, based on theoretical analysis and long-term practice.

    2.3. Classification of parallel computing systems

    The classification of computing system architectures aims to identify both the main factors characterizing each architecture and the interrelations of parallel computing structures.

    SISD (Single Instruction stream / Single Data stream) – single command stream and single data stream; The SISD class includes classical sequential (von Neumann type) machines. In such machines there is only one stream of (sequentially processed) instructions, each of which initiates one scalar operation. Machines with conveyor processing technology also fall into this class.

    SIMD (Single Instruction stream / Multiple Data stream) – a single command stream along with a multiple data stream. In this case, one stream of commands is preserved, but includes vector commands that perform one arithmetic operation on many data at once. Vector operations can be performed by a matrix of processors (as in the ILLIAC IV machine) or by a pipeline method (Cray-1). In fact, Pentium VI and Xeon microprocessors with the MMX, SSE, SSE2 instruction set are single-chip SIMD systems.

    From the CIS countries, SIMD systems should be called PS-2000 (1972 - 1975) - a highly parallel computer system for processing information with a productivity of up to 200 million op/s.

    MISD (Multiple Instruction stream/Single Data stream) – multiple command stream and single data stream. The architecture implies the presence of many processors processing the same data stream; it is believed that such machines do not exist.

    MIMD (Multiple Instruction stream / Multiple Data stream) - multiple streams of both commands and data. The MIMD class includes two types of machines: command-flow-controlled and data-flow-controlled; If computers of the first type use the traditional execution of commands sequentially in their location in the program, then the second type involves activating operators as they are currently ready.

    The class assumes the presence of several processors combined into a single complex, each working with its own flow of commands and data.

    A classic example is the Denelcor HEP (Heterogeneous Element Processor) system; contains up to 16 processor modules, connected through a multistage switch to 128 data memory modules, and all processor modules can work independently of each other with their own command streams, and each processor module can support up to 50 user command streams.

    2.4. Multiprocessor computing systems with distributed memory

    Since the last decade of the 20th century, there has been a tendency for distributed memory systems to monopolize supercomputer architectures, with readily available off-the-shelf devices increasingly being used as processors on computing nodes. The main advantages of such systems are enormous scalability (depending on the class of tasks being solved and the budget, the user can order a system with a number of nodes from several tens to thousands); which led to the emergence of a new name for the systems under consideration - massively parallel computers (computing systems of the MPP architecture - Massively Parallel Processing).

    The first supercomputer with massively parallel processing, the Connection Machine (CM-1), was equipped with 64,000 processors, each with its own memory. SM-1 scanned 16 thousand articles with breaking news reports in 1/20 of a second. and developed a processor integrated circuit with 4 thousand transistors in three minutes. Convex representatives of MPP systems are supercomputers of the Cry T3 series.

    Cry T3E (1350) is a multiprocessor computing system manufactured in 2000, with distributed memory built from RISC processors. The topology of the communication network is a three-dimensional torus. Operating system UNICOS/mk (UNIX operating system with microkernel). Translators for FORTRAN, HPF, C/C++ languages. Clock frequency 675 MHz. The number of processors is from 40 to 2176. The maximum amount of RAM for each node is 512 MB and the maximum speed is 2938 Gflop/s. Unlike its predecessor, the Cry T3D, this system does not require a front-end computer.

    The system uses an Alpha21164A processor, however, if necessary, it can be easily replaced with another, for example, a faster processor. Each processing element contains a central processing unit, a memory module, and a communications unit for communication with other processors. The communication channel capacity between processors is 325 MB/s.

    Programming models MPI, PVM, HPF, and Cray's own shmem messaging library are supported. The performance obtained by solving systems of linear algebraic equations reaches 1.12 Tflop/s.

    MPP - the system consists of homogeneous computing nodes, including:

    One, and sometimes several central processors (usually the RISC - Reduced Instruction Set Computing architecture, which is characterized by a long command word for specifying operations, a reduced set of instructions and the execution of most operations in one processor cycle),

    Local memory (direct access to the memory of other nodes is not possible),

    Communication processor (or network adapter),

    Hard drives (optional) and/or other input/output devices.

    Special I/O nodes and control nodes are added to the system. Computing nodes are connected by some communication medium (high-speed network, switches, etc.).

    Maintenance of multiprocessor systems is not an easy task - with hundreds/thousands of computing nodes, daily failure of several of them is inevitable; The 5k resource management system (software and hardware complex) of a massively parallel computer is required to handle such situations, bypassing a catastrophic general restart with loss of the context of currently executing tasks.

    2.4.1. Massively parallel supercomputers of the CRY T3 series

    Founded in 1972, Cry Research Inc. (now Cry Inc.), famous for the development of the Cry 1 vector supercomputer, in 1993–1995 released the Cry T3D/T3E models, which fully implement the principle of massively parallel systems (MPP architecture systems). In the maximum configuration, these computers combine 32 - 2048 DEC Alpha 21064/150 MHz, 21164/600 MHz, 21164A/675 MHz processors (depending on the model), all pre-processing and program preparation (for example, compilation) is performed on the control machine (host -computer).

    The developers of the Cry T3D/T3E series took the path of creating virtual shared memory. Each processor can directly access only its local memory, but all nodes share a single address space. When an attempt is made to access an address belonging to the local memory of another processor, a specialized hardware interrupt is generated and the operating system performs a page transfer from one node to another, and due to the extremely high speed of the communication system (the peak data transfer rate between two nodes reaches 480 MB/s), this approach generally justified. However, a sharply performance-reducing “ping-pong” effect has been noticed - if variables modified by several processors get on one page, this page continuously migrates between nodes. Computing nodes execute user programs in exclusive mode (single-task mode).

    The specific design of the Cry T3 series computers is characterized by a triple of numbers, for example, 24/16/576 (control nodes/operating system nodes/computing nodes); With the “three-dimensional torus” topology used, each node (regardless of its location) has six immediate neighbors. When choosing a route between two nodes A and B (the 3D coordinates of which are shown in Figure 11), network routers, starting the process from the initial vertex A, first perform an offset along the X coordinate until the coordinates of the next communication node and node B become equal; then similar actions are performed along the Y coordinate and then along the Z (physically similar routing occurs simultaneously along all three coordinates). Displacements can also be negative; if one or more connections fail, they can be bypassed.

    Another interesting feature of the Cry T3 architecture is support for barrier synchronization - hardware organization of all processes waiting for a certain point in the program, upon reaching which further work is possible. T3E series computers demonstrated performance of 1.8 – 2.5 Tflops (on 2048 Alpha/600 MHz microprocessors).

    A further development of the Cry T3 line of massively parallel computers is the Cry XT3 supercomputer. Each Cry XT3 compute node includes an AMD Opteron processor, local memory (1 - 8 GB) and a Hyper Transport channel that provides communication via the Cry SeaStar communication unit, with peak performance (for AMD Opteron 2.4 GHz) from 2.6 teraflops (548 processors) , RAM 4.3 TB, topology 6x12x8) up to 147 teraflops. Cray XT3 runs under the UNICOS/lc operating system, which allows you to combine up to 30,000 computing nodes, uses Fortran 77/90/95 and C/C++ compilers, MPI communication libraries (Message Passing Interface with support for the MPI 2.0 standard) and ShMem (developed by Cray Research Inc. library for working with shared memory), standard libraries for numerical calculations.

    Despite achieving such high results in the field of MPP systems, Cry Inc. produces vector-pipeline computers, and these models can be combined into an MPP system. The performance of each processor of the Cry SV1 computer reaches 4 Gflops (total peak system performance of 32 Gflops), the total number of processors can reach 1000.

    Figure 11. Communication lattice "three-dimensional torus" of the Cray T3E computer

    2.4.2. BEOWULF class cluster systems

    The prohibitive cost of industrial massively parallel computers haunted specialists who wanted to use computing systems of comparable power in their research, but were unable to purchase industrial supercomputers. Searches in this direction led to the development of computing clusters (not to be confused with clusters of databases and WEB servers); The technological basis for the development of clustering was the widely available and relatively inexpensive microprocessors and communication (network) technologies that appeared on the public market in the nineties.

    A computing cluster is a collection of computing nodes (from tens to tens of thousands) connected by a high-speed network in order to solve a single computing problem. Each node of a computing cluster is actually a programmable electronic computer (often a two- or four-processor/core SMP server) running its own operating system (in the vast majority Linux(*)); The unifying network is selected based on the required class of tasks to be solved and financial capabilities; the possibility of remote access to the cluster via InterNet is almost always implemented.

    Computing nodes and a control computer usually combine (at least) two (independent) networks - a control network (serves the purpose of managing the computing nodes) and a (often more productive) communication network (direct data exchange between processes running on the nodes); in addition, the control node has an output to Internet to access the resources of a cluster of remote users, the file server performs the functions of storing user programs (Figure 12). Cluster administration is carried out from the control computer (or through remote access); users have the right to access (in accordance with the rights assigned by the administrator) to cluster resources exclusively through the control computer.

    Windows clusters of significant power still remain exotic for well-known reasons (despite the Windows Compute Cluster Server - WCCS class solutions actively promoted by MS).

    One of the first cluster projects was the BEOWULF project. The BEOWULF project was founded in the CESDIS (Center of Excellence in Space Data and Information Sciences) research center created on the basis of the NASA organization GSFC (Goddard Space Flight Center) in 1994 and started with the assembly at GSFC of a 16-node cluster (on 486DX4/100 processors MHz, 16 Mb of memory, 3 network adapters on each node and 3 parallel 10 Mbit Ethernet cables); The computing system was intended to carry out work on the ESS (Earth and Space Sciences Project).

    Later, NASA departments assembled other models of BEOWULF-like clusters: for example, theHIVE (Highly-parallel Integrated Virtual Environment) of 64 dual-processor (Pentium Pro/200 MHz, 4 Gb of memory and 5 Fast Ethernet switches in each) nodes. It was within the framework of the Beowulf project that drivers were developed to implement the Channel Bonding mode.

    Figure 12. Enlarged diagram of a computing cluster

    "Beowulf" is a typical example of a multiprocessor MIMD (Multiple Instruction - Multiple Data) system, with several program branches running simultaneously, exchanging data at certain intervals. Many subsequent developments throughout the world are actually Beowulf clans.

    In 1998, at the Los Alamos National Laboratory, astrophysicist Michael Warren and members of the theoretical astrophysics group built the Avalon computing system, which was a Linux cluster on DEC Alpha/533 MHz processors. Initially, Avalon consisted of 68 processors, then it was expanded to 140, each node had 256 MB of RAM, a 3.2 Gb EIDE hard drive, and a Kingston network adapter.

    The nodes are connected using four Fast Ethernet switches and a central twelve-port Gigabit Ethernet switch from 3Com.

    A typical example of a massively parallel cluster computing system is MVS-1000M (communication network - Myrinet 2000, information exchange speed 120-170 MB/sec, auxiliary - Fast and Gigabit Ethernet) and MVS-15000BC.

    The requirement for maximum efficiency in the use of computing power resources (both processor, RAM and disk memory) of individual processors in a cluster inevitably leads to a reduction in the “intelligence” of the operating system of computing nodes to the level of monitors; on the other hand, distributed cluster operating systems are offered - for example, Amoeba, Chorus, Mach, etc.

    Bladed servers (*) are produced specifically to complete the hardware of computing clusters - narrow vertical boards that include a processor, RAM (usually 256 - 512 MB with an L2 cache of 128 - 256 KB), disk memory and network support chips; These boards are installed in standard 3U format “baskets” with a width of 19² and a height of 5.25², up to 24 pieces each (240 computing nodes per rack with a height of 180 cm). To reduce overall power consumption, Transmeta Crusoe TM 5x00 series processors with VLIW technology can be used, consuming only a few watts (versus 75 W for P4 or 130 W for IA-64 architecture crystals); at the same time, the total power consumption with 240 computing nodes does not exceed 1 kW.

    Conclusion

    Parallel programming is a whole area of ​​related issues, including hardware support, analysis of the structure of algorithms to identify parallelism, and algorithmic languages ​​for programming parallel tasks.

    Parallel computing technologies are currently rapidly developing in connection with the requirements of world science and technology.

    Quite often one has to deal with problems that, although of considerable value to society, cannot be solved using relatively slow office or home class computers. The only hope in this case rests on high-speed computers, which are commonly called supercomputers. Only machines of this class can cope with processing large amounts of information. This could be, for example, statistical data or the results of meteorological observations, financial information. Sometimes processing speed is critical. This includes weather forecasting and climate change modeling. The earlier a natural disaster is predicted, the greater the opportunity to prepare for it. An important task is the modeling of medicines, deciphering the human genome, exploration of oil and gas fields and other large-scale tasks, the solution of which is possible with the help of supercomputers and cluster systems.

    Today's problem is a clear lack of hardware for parallel computing in educational and scientific institutions, which does not allow for the comprehensive development of relevant technologies; The often practiced purely theoretical study of a subject leads to negative rather than positive consequences. Only now are technologies emerging that make it possible to introduce the practice of parallel computing into most educational and industrial laboratories. Creating effective parallel programs requires a much more serious and in-depth analysis of the structure of algorithms than with traditional sequential programming, and some approaches cannot be implemented without a serious change in the way programmers think. Along with theoretical analysis, constant practice in creating parallel programs is required to obtain practically significant results.

    References

    1. Voevodin V.V., Voevodin Vl.V. Parallel computing. -SPb.: BHV-Petersburg, 2004. -608 p.

    2. Harvey M. Deitel. Introduction to operating systems (translated from English by L.A. Teplitsky, A.B. Khodulev, Vs.S. Shtarkman, edited by Vs.S. Shtarkman). -M.: Mir, 1987 (electronic version, 2004)

    3. Gergel V.P., Strongin R.G. Fundamentals of parallel computing for multiprocessor computing systems (textbook, ed. 2, updated). -N.Novgorod: ed. Nizhny Novgorod State University named after N.I. Lobachevsky, -2003 (electronic version http://pilger.mgapi.edu/metods/1441/basic_pr.zip).

    4. Korneev V.V. Computing systems. -M.: Helios ARV, -2004, -512 p.

    5. Latsis A.O. How to build and use a supercomputer. -M.: Bestseller, 2003.

    6. Shpakovsky G.I. Organization of parallel computers and superscalar processors. // Textbook allowance. -Minsk: Belarusian State University, 1996. -296 p. (electronic version http://pilger.mgapi.edu/metods/1441/spakovsk.zip)

    7. Shpakovsky G.I., Serikova N.V. Programming for multiprocessor systems in the MPI standard. -Minsk: BSU, 2002. -325 p. (electronic version http://www.cluster.bsu.by/download/book_PDF.pdf, http://pilger.mgapi.edu/metods/1441/pos_mpi.pdf)

    By discipline "Parallelcalculations in optics and optoinformatics" in the 10th semester Forms...

  • Lectures on the discipline “Theoretical Foundations of Automated Control”

    Task

    ... ____________________________ “____”__________________200_ LecturesBydiscipline"Theoretical foundations... calculations: 1) completely centralized system - classic sequential algorithm 2) completely decentralized system - parallel ...

  • LECTURE NOTES on the discipline “NETWORK TECHNOLOGIES” (added version) for students of specialty 7

    Abstract

    Department of "Information systems in management" CONSPECT LECTURESBydiscipline"NETWORK TECHNOLOGIES" (extended version) for... a very important advantage is the ability to perform parallelcalculations. When in a computing system, the task...

  • Lecture notes on the discipline "neurocomputer systems" Shebakpol

    Abstract

    Abstract lecturesBydiscipline“Neurocomputer systems” Shebakpolsky M.F. Contents... about training, simplicity of the model parallelcalculations. There is no reason to believe ... computers have an inherent parallel nature calculations gets lost; every operation...

  • 1.2 Parallel data processing

    1.2.1 Fundamental possibility of parallel processing

    Almost all algorithms developed to date are sequential. For example, when evaluating the expression a + b × c, you must first perform the multiplication and only then perform the addition. If electronic computers contain addition and multiplication nodes that can operate simultaneously, then in this case the addition node will be idle waiting for the multiplication node to complete its work. It is possible to prove the statement that it is possible to build a machine that will process a given algorithm in parallel.

    It is possible to build m processors that, when operated simultaneously, produce the desired result in one single clock cycle of the computer.

    Such "multiprocessor" machines could theoretically be built for each specific algorithm and seemingly "bypass" the sequential nature of the algorithms. However, not everything is so simple - there are infinitely many specific algorithms, so the abstract reasoning developed above is not so directly related to practical significance. Their development convinced us of the very possibility of parallelization, formed the basis of the concept of unlimited parallelism, and made it possible to consider from a general perspective the implementation of so-called computing environments - multiprocessor systems dynamically configured for a specific algorithm.

    1.2.2 Abstract parallel computing models

    The parallel computing model provides a high-level approach to characterizing and comparing execution times of different programs while abstracting away hardware and execution details. The first important parallel computing model was the Parallel Random Access Machine (PRAM), which provides an abstraction of a shared memory machine (PRAM is an extension of the RAM - Random Access Machine model). The BSP (Bulk Synchronous Parallel) model combines both shared and distributed memory abstractions. All processors are assumed to execute instructions synchronously; in the case of executing the same instruction, PRAM is an abstract SIMD machine (SIMD - Single Instruction stream / Multiple Data stream - a single instruction stream along with a multiple data stream), however, processors can execute different instructions. The main commands are reading from memory, writing to memory, and ordinary logical and arithmetic operations.

    The PRAM model is idealized in the sense that each processor can access any memory location at any time (Write operations performed by one processor are visible to all other processors in the order in which they were performed, but writes performed by different processors are may be visible in any order). For example, each processor in PRAM can read data from a memory cell or write data to the same cell. This, of course, does not happen on real parallel machines, since memory modules at the physical level organize access to the same memory cell. Moreover, memory access times vary on real machines due to the presence of caches and the possible hierarchical organization of memory modules.

    The basic PRAM model supports concurrent (in this context parallel) reads and writes. PRAM submodels are known that take into account rules that allow avoiding conflict situations when several processors simultaneously access shared memory.

    Brent's theorem allows us to model circuits of functional elements using parallel random access machines (PRAMs). The functional elements can be either 4 basic ones (performing logical operations NOT, AND, OR, XOR - negation, logical AND, logical OR and exclusive OR, respectively), more complex NAND and NOR (AND-NOT and OR-NOT), and and any complexity.

    In what follows, it is assumed that the delay (i.e., the response time - the time after which the intended signal values ​​appear at the output of the element after the values ​​at the inputs have been established) is the same for all functional elements.

    We consider a circuit of functional elements connected without forming cycles (we assume that the functional elements have any number of inputs, but exactly one output - an element with several outputs can be replaced by several elements with a single output). The number of inputs determines the input power of the element, and the number of inputs to which the output of the element is connected determines its output power. It is usually assumed that the input powers of all elements used are bounded above, while the output powers can be any. The size of a circuit refers to the number of elements in it; the largest number of elements on the paths from the inputs of the circuit to the output of the element is called the depth of this element (the depth of the circuit is equal to the largest of the depths of its constituent elements).

    Figure 1. Simulation of a size 15, depth 5 circuit with two processors using a parallel random access machine (PRAM machine)

    Figure 1 shows the result of modeling a circuit of size (total number of processors) n=15 with a circuit depth (maximum number of elements at each depth level) d=5 with the number of processors p=2 (simultaneously simulated elements are combined into groups by rectangular areas, and for each group is indicated by the step at which its elements are modeled; modeling occurs sequentially from top to bottom in order of increasing depth, at each depth p pieces at a time). According to Brent's theorem, modeling such a circuit will take no more than ceil(15/2+1)=9 steps.

    Throughout the history of the development of computer technology, attempts have been made to find some kind of general classification under which all possible directions of development of computer architectures would fall. None of these classifications could cover the entire variety of architectural solutions being developed and did not stand the test of time. Nevertheless, a number of terms have come into scientific circulation and are widely used that are useful to know not only for developers, but also for computer users.

    Any computing system (be it a supercomputer or a personal computer) achieves its highest performance through the use of high-speed elements and the parallel execution of a large number of operations. It is the possibility of parallel operation of various system devices (working with overlap) that is the basis for accelerating basic operations.

    Parallel computers are often divided according to Flynn's classification into SIMD (Single Instruction Multiple Data) and MIMD (Multiple Instruction Multiple Data) machines. Like any other, the above classification is imperfect: there are cars that do not fall directly into it, there are also important features that are not taken into account in this classification. In particular, vector processors are often classified as SIMD machines, although their high performance depends on another form of parallelism - the pipeline organization of the machine. Multiprocessor vector systems, such as Cray Y-MP, consist of several vector processors and therefore can be called MSIMD (Multiple SIMD).

    Flynn's classification does not distinguish between other characteristics important to computing models, such as the level of grain in parallel computations and synchronization methods.

    There are four main types of architecture for parallel processing systems:

    1) Pipeline and vector processing.

    The basis of pipeline processing is the separate execution of some operation in several stages (in several stages) with the transfer of data from one stage to the next. Productivity increases due to the fact that several operations are performed simultaneously at different stages of the conveyor. Pipelining is only effective when the pipeline is close to full capacity and the rate at which new operands are supplied matches the pipeline's maximum capacity. If latency occurs, fewer operations will be executed in parallel and overall performance will decrease. Vector operations provide an ideal opportunity to fully load the computational pipeline.



    When a vector command is executed, the same operation is applied to all elements of the vector (or most often to the corresponding elements of a pair of vectors). It may take some setup time to configure a pipeline to perform a particular operation, but operands can then flow into the pipeline at the maximum rate allowed by memory capabilities. In this case, there are no pauses either in connection with the selection of a new command or in connection with the definition of a branch of calculations during a conditional transition. Thus, the main principle of computing on a vector machine is to perform some elementary operation or combination of several elementary operations, which must be repeatedly applied to some block of data. Such operations in the original program correspond to small compact loops.

    2) SIMD type machines. SIMD machines consist of a large number of identical processing elements that have their own memory. All processing elements in such a machine execute the same program. Obviously, such a machine, composed of a large number of processors, can provide very high performance only on those tasks in which all processors can do the same job. The computation model for a SIMD machine is very similar to the computation model for a vector processor: a single operation is performed on a large block of data.

    Unlike the limited pipeline operation of a vector processor, a matrix processor (synonymous with most SIMD machines) can be significantly more flexible. The processing elements of such processors are universal programmable computers, so a problem solved in parallel can be quite complex and contain branches. The typical manifestation of this computational model in a source program is roughly the same as in the case of vector operations: loops on array elements in which the values ​​produced in one iteration of the loop are not used in another iteration of the loop.

    The computing models on vector and matrix computers are so similar that these computers are often discussed as equivalent.

    3) MIMD type machines. The term "multiprocessor" covers most MIMD machines and (just as the term "matrix processor" applies to SIMD machines) is often used as a synonym for MIMD machines. In a multiprocessor system, each processing element (PE) executes its program quite independently of other processing elements. Processing elements, of course, must somehow communicate with each other, which necessitates a more detailed classification of MIMD-type machines. Shared-memory multiprocessors (tightly coupled multiprocessors) have data and instruction memory available to all PEs. PEs communicate with the shared memory using a common bus or exchange network. In contrast, in loosely coupled multiprocessor systems (machines with local memory), all memory is shared among the processing elements and each memory block is accessible only to the processor associated with it. The exchange network connects the processing elements with each other.

    The basic model of computing on a MIMD multiprocessor is a set of independent processes that periodically access shared data. There are a large number of variants of this model. At one end of the spectrum is the distributed computing model, in which a program is divided into a fairly large number of parallel tasks consisting of many subroutines. At the other end of the spectrum is the stream computing model, in which each operation in a program can be treated as a separate process. Such an operation waits for its input data (operands), which must be passed to it by other processes. Once they are received, the operation is performed and the resulting value is passed on to those processes that need it. In high- and medium-granularity streaming computing models, processes contain a large number of operations and are executed in a streaming manner.

    4) Multiprocessor machines with SIMD processors.

    Many modern supercomputers are multiprocessor systems that use vector processors or SIMD processors as processors. Such machines belong to the MSIMD class machines.

    Programming languages ​​and corresponding compilers for MSIMD machines typically provide language constructs that allow the programmer to describe "coarse-grained" parallelism. Within each task, the compiler automatically vectorizes suitable loops. MSIMD machines, as one can imagine, make it possible to use the better of these two decomposition principles: vector operations ("fine-grained" parallelism) for those parts of the program that are suitable for it, and the flexible capabilities of the MIMD architecture for other parts of the program.

    Over the years of development of computing technology, multiprocessor systems have undergone a number of stages of development. Historically, SIMD technology was the first to be mastered. However, there is currently a steady interest in MIMD architectures. This interest is mainly determined by two factors:

    1. The MIMD architecture provides great flexibility: with adequate hardware and software support, MIMD can operate as a single-user system providing high-performance data processing for a single application task, as a multi-program machine running multiple tasks in parallel, or as some combination of these capabilities.
    2. MIMD architecture can take full advantage of modern microprocessor technology based on strict cost/performance considerations. In fact, virtually all modern multiprocessor systems are built on the same microprocessors found in personal computers, workstations, and small single-processor servers.

    One of the distinctive features of a multiprocessor computing system is the communication network through which processors are connected to each other or to memory. The communication model is so important for a multiprocessor system that many performance characteristics and other estimates are expressed as the ratio of processing time to communication time corresponding to the tasks being solved. There are two main models of interprocessor communication: one is based on message passing, the other is based on the use of shared memory. In a shared memory multiprocessor system, one processor writes to a specific memory location and another processor reads from that memory location. To ensure data consistency and process synchronization, exchange is often implemented using the principle of mutually exclusive access to shared memory using the “mailbox” method.

    In local memory architectures, direct memory sharing is not possible. Instead, processors access shared data by passing messages over an exchange network. The effectiveness of the communication scheme depends on the communication protocols, the underlying communication networks and the memory bandwidth and communication channels.

    Often, and unreasonably, in shared memory and vector machines the costs of communication are not taken into account, since the problems of communication are largely hidden from the programmer. However, communication overhead in these machines exists and is determined by conflicts of buses, memory and processors. As more processors are added to a system, more processes compete to share the same data and bus, leading to a saturation condition. The shared memory system model is very convenient for programming and is sometimes seen as a high-level means of assessing the impact of communication on system operation, even if the underlying system is actually implemented using local memory and message passing.

    In circuit-switched and packet-switched networks, as traffic demands increase, the possibility of network congestion must be considered. Here, interprocessor communication connects network resources: channels, processors, message buffers. The volume of transmitted information can be reduced through careful functional decomposition of the task and careful dispatching of the functions performed.

    Thus, existing MIMD machines fall into two main classes depending on the number of processors being combined, which determines both the method of organizing memory and the method of their interconnections.

    The first group includes machines with common (shared) main memory, combining up to several dozen (usually less than 32) processors. The relatively small number of processors in such machines makes it possible to have one centralized shared memory and combine processors and memory using a single bus. If processors have sufficient cache memory, the high-performance bus and shared memory can accommodate memory accesses from multiple processors. Because there is a single memory with the same access time, these machines are sometimes called UMA (Uniform Memory Access). This method of organization with relatively small shared memory is currently the most popular. The structure of such a system is shown in Fig. 10.1.

    Rice. 10.1. Typical architecture of a multiprocessor system with shared memory.

    The second group of machines consists of large-scale distributed memory systems. In order to support a large number of processors, main memory must be distributed among them, otherwise the memory bandwidth simply may not be enough to satisfy requests coming from a very large number of processors. Naturally, this approach also requires the implementation of communication between processors. In Fig. Figure 10.2 shows the structure of such a system.

    As the number of processors grows, there is simply no way around the need to implement a distributed memory model with a high-speed network to communicate between processors. With the rapid growth in processor performance and the associated increased demands for increased memory bandwidth, the scale of systems (i.e., the number of processors in a system) that require distributed memory is decreasing, as well as the number of processors that can be supported on one shared bus and shared memory.

    Distributing memory among individual system nodes has two main advantages. First, it is a cost-effective way to increase memory bandwidth because most accesses can be performed in parallel to local memory in each node. Secondly, it reduces the access latency (access time) to local memory. These two advantages further reduce the number of processors for which a distributed memory architecture makes sense.

    Typically, I/O devices, as well as memory, are distributed across nodes, and in reality nodes may consist of a small number (2-8) of processors interconnected in some other way. While this clustering of multiple processors with memory and network interface can be quite useful in terms of cost efficiency, it is not very significant for understanding how such a machine works, so we will stick to systems with one processor per node for now. The main architectural difference to note in distributed memory machines is how communication is handled and what the logical memory model is.

    Rice. 10.2. Typical distributed memory machine architecture.

    Ministry of Education and Science of the Russian Federation

    FSBEI HPE "Bryansk State Engineering and Technological

    academy"

    Department of Information Technologies

    Sequential and parallel processing of information

    Calculation and graphic work No. 1

    by discipline

    "Information Processing Technologies"

    Option No. 16

    RGR-02068025.230400.084

    Bryansk 2015

    Introduction 3

    Parallel information processing 4

    Shared Memory Systems 6

    Parallel SQL Processing 7

    Sequential information processing 9

    10 Simple Batch Systems

    References 13

    Introduction

    This computational and graphical study examines sequential and parallel information processing. Examples are given for each of them.

    Sequential information processing is the sequential passage of information from input to output through a series of transformations (stages), so that in each period of time (specific to a given block), the transformation is carried out in only one functional block, and information comes to it only from the previous block.

    Parallel information processing is a model of information processing, according to which information undergoes a series of transformations in certain functional blocks, so that at any given time it is processed simultaneously (in parallel) in several blocks.

    Parallel information processing

    Parallel data processing, embodying the idea of ​​simultaneous execution of several actions, has two varieties: pipeline and parallelism.

    Parallel Processing. If a certain device performs one operation per unit of time, then it will perform a thousand operations in a thousand units. If we assume that there are five identical independent devices capable of operating simultaneously, then a system of five devices can perform the same thousand operations not in a thousand, but in two hundred units of time. Similarly, a system of N devices will perform the same work in 1000/N units of time. Similar analogies can be found in life: if one soldier digs up a garden in 10 hours, then a company of fifty soldiers with the same abilities, working simultaneously, will cope with the same work in 12 minutes - the principle of parallelism in action!

    Conveyor processing. What is needed to add two real numbers represented in floating point form? A whole lot of small operations such as comparing orders, aligning orders, adding mantissas, normalizing, etc. The processors of the first computers performed all these “micro-operations” for each pair of arguments, one after another, until they reached the final result, and only then proceeded to process the next pair of terms.

    The idea of ​​pipeline processing is to isolate individual stages of performing a general operation, and each stage, having completed its work, would pass the result to the next one, while simultaneously receiving a new portion of input data. We get an obvious gain in processing speed by combining previously spaced operations. Let's assume that there are five micro-operations in an operation, each of which is performed in one unit of time. If there is one indivisible serial device, then it will process 100 pairs of arguments in 500 units. If each micro-operation is separated into a separate stage (or otherwise called a stage) of a conveyor device, then on the fifth unit of time, at a different stage of processing of such a device, the first five pairs of arguments will be located, and the entire set of one hundred pairs will be processed in 5 + 99 = 104 units time - acceleration compared to a serial device is almost five times (according to the number of conveyor stages).

    It would seem that pipeline processing can be successfully replaced by ordinary parallelism, for which we duplicate the main device as many times as the number of stages of the pipeline are supposed to be allocated. In fact, the five devices in the previous example will process 100 pairs of arguments in 100 units of time, which is faster than the running time of the conveyor device! Thus, by increasing the number of devices fivefold, we significantly increase both the volume of equipment and its cost. Imagine that a car plant decided to remove the assembly line while maintaining the rate of car production. If previously there were a thousand cars on the assembly line at the same time, then, acting by analogy with the previous example, it is necessary to recruit a thousand teams, each of which is able to completely assemble the car from start to finish, performing hundreds of different types of operations, and do this in the same time as the car was previously on the assembly line.

    Today, parallelism in computer architecture will surprise few people. All modern microprocessors use some form of parallel processing. The Pentium 4 core can simultaneously contain up to 126 micro-operations at different stages of execution. At the same time, these ideas themselves appeared a very long time ago. Initially, they were implemented in the most advanced, and therefore single, computers of their time. Then, after proper development of the technology and cheaper production, they went down to middle-class computers, and finally today all this is fully embodied in workstations and personal computers.

    The performance of many applications running on single-processor computer systems can be significantly improved by using parallel processing tools. The following introduces the basic concepts of parallel processing and multiprocessor computer architecture.

    When multiple applications request their jobs to be processed on a single-processor computer, its single processor has to do all the work. The purpose of parallel processing is usually to improve application performance. When an application issues a job request to a multiprocessor computer, the computer breaks the job into logical subtasks and then processes them using multiple processors in parallel, reducing the time it takes to complete the job. The number of subtasks resulting from splitting one large task is called the degree of parallelism . The reduction in information processing time required to complete a task is directly proportional to the degree of parallelism. They try to increase the performance of systems with parallel processing in such a way as to ensure maximum performance of each processor in the system.