• Sequential and parallel processing of information. Types of parallel information processing

    Parallel data processing

    Computer science, cybernetics and programming

    Automatic concurrency detection. Degree and levels of parallelism. Types of parallelism. The performance of parallel computers depends on many factors and, to a large extent, on the architecture and structure of the system; draw the structure of a parallel system and explain: on the degree and level of parallelism in the system; from organizing data transfer between parallel working processors; from the switching system; from the interaction of processors and memory; on the relationship between hardware and software implementation macro operations.

    Lecture 1

    Parallel data processing

    Plan

    1. Tiered-parallel form of the algorithm.

    2. Automatic concurrency detection.

    3. Degree and levels of parallelism.

    4. Types of parallelism.

    Parallelism this is the ability to simultaneously perform several arithmetic, logical or service operations. Moreover, operations can be both large-block and small-block.

    The performance of parallel computers depends on many factors and, to a large extent, on the architecture and structure of the system(draw the structure of a parallel system and explain):

    On the degree and level of parallelism in the system;

    From organizing data transfer between parallel working processors;

    From the switching system;

    From the interaction of processors and memory;

    From the relationship between hardware and software implementation of a macro operation.

    Parallel processing can be based on various principles:

    Spatial parallelism;

    Temporal parallelism:

    1. Pipelining.
    2. Vectorization.
    3. Matrix.
    4. Systolic.
    5. Organization of the data flow processing structure.
    6. Organization of the system based on the hypercube structure.
    7. Dynamic restructuring of the aircraft structure.

    The description of any algorithm is hierarchical, based on the nesting property. When programming, allocatenesting levels:tasks, tasks, subtasks (processes), macrooperations, operations.Nesting determines the depth of parallelization and is one of the important properties of algorithms when analyzing parallel computing models.

    1. Tiered-parallel form of the algorithm

    The most general form of representation of algorithms is the information-control graph of the algorithm,which reflects the data dependence between the operators of the algorithm and unconditional and conditional transitions in the program. Such a graph implicitly contains all types of parallelism for the chosen method of solving the problem.A more specific form of representing task parallelism is the tier-parallel form (LPF) apparatus.

    Algorithm in tiered-parallel formis represented in the form of tiers, and the zero tier includes operators (branches) that are independent of each other.

    On the graph you can indicate transitions , meaning the transfer of the calculation results of a primitive operation from one tier to an operation from the next tier. The tiers are divided by transitions. There may be"empty" transitions And "empty" primitive operations. An empty operation corresponds to saving the result obtained at the previous tier. In a sequential chain of operations, an empty operation can be placed in any tier.

    When constructing the NAP, they rely on basic set primitive operations (PNO). The tiered-parallel form is characterized by the following parameters:

    1. Length of the graph (number of tiers) L.

    2. The width of the i-th tier is b i.

    3. Width of a tiered-parallel graph B = max (b i ).

    4. Average width of the NAP graph On Wed .

    5. Fill factor i-th tier k i .

    6. The spread coefficient of operations in the graph is Q j i , j BNO , where is the quantity j -th type of operations in i-th tier.

    7. The minimum required number of computers (from the BNO) to implement the algorithm represented by this graph in the YAPF.

    8. Minimum solution time of the algorithm (sum of response times of computers with the maximum amount of calculations for each tier) T min.

    9. Algorithm connectivity (the number of intermediate results that must be stored during the implementation of the algorithm) WITH .

    2. Automatic concurrency detection

    There are two possible ways to construct a parallel algorithm: directly from the problem statement or by transforming a sequential algorithm.

    Methods for constructing a parallel algorithm from a sequential one are based on identifying typical, frequently occurring constructions in the sequential algorithm, which, according to certain rules, are replaced by parallel ones. (This allows, to a certain extent, to increase the degree of parallelism lost by the algorithm when programming in a sequential language.)

    The nature of changes in the degree of parallelism during the preparation of a machine program is shown in Fig. 2.2.

    potential parallelism

    Method

    solutions

    Source

    Machine program

    Rice. 2.2. Changing potential parallelism during program development:

    1 parallel programming system;

    2 serial programming and

    vectorizing compiler

    Despite the lower level of parallelism achieved when constructing a parallel algorithm by converting from a sequential one, this method is widely used, since it provides the ability to use expensive application programs developed and debugged for sequential ODS.

    In a sequential program there are obvious and hidden parallel processing.

    When analyzing a program, a data flow graph is constructed. To detect obvious parallelism of processes, sets of input (read) variables are analyzed R and output (recorded) variables W each process.

    Explicit parallel processing can be detected among processes i and j (i ≠ j ), satisfying the following conditions:

    input data of one process must not be modified (written) by another process

    no two processes should modify shared variables

    a) R i W j =;

    b) W i R j =;

    c) W i W j =;

    Hidden Parallel processing requires some procedure for transforming a sequential program to make it possible to execute it in parallel. The transformation could be as follows:

    a) decreasing the height of trees of arithmetic expressions (Fig. 2.3). For arithmetic expressions with n variables or constants, reducing the height of the tree allows you to achieve faster order processing O(n/log2n ) when using O(n) processors;

    b) transformation of linear recurrence relations;

    ((a + b) + c) + d

    (a + b)+ (c + d)

    Rice. 2.3. Reducing tree height

    c) replacement of operators;

    d) converting blocks of conditional transitions and loops to canonical form;

    e) distribution of cycles.

    Parallel architectures reach high performance, if the parallelism transformation takes into account the features of the computer architecture on which the algorithm is supposed to be executed.

    When converting parallelism, programs take into account: 1) the layout of data in memory; 2) memory addressing (indexing); 3) choice of data route (method of connecting processors and storage).

    Fig.2.4. Storage

    shift matrices

    As an example of taking into account the memory layout, let's take diagonally addressed memory. To ensure parallel processing of matrices, the elements of their rows and columns must be distributed among the processor storage devices in such a way that they can be read and processed simultaneously. In this case, the matrix is ​​stored with a shift (Fig. 2.4).

    Any algorithm contains sequential (scalar) sections. It has been proven that the length of these scalar sections is the determining factor when implementing the algorithm on a parallel computer.

    3. Degree and levels of parallelism

    Degree of parallelism(D) this is the order of the number of parallel working devices in the system when implementing a task algorithm, provided that the number of processors (processing devices) is not limited.(There is another definitiondegrees of parallelismthis is the number of processors of a multiprocessor system participating in parallel in the execution of the program at each time t.)

    1) Low degree: from 2 to 10 processors.

    2) Medium degree: from 10 to 100 processors.

    3) High degree: from 100 to 10 4 processors.

    4) Ultra-high degree: from 10 4 to 10 6 processors.

    Rice. 2.5. Concurrency Profile

    Graphical representation of the parameter D(t ) as functions of time are calledprogram parallelism profile. Changes in the level of processor load during the observation period depend on many factors (algorithm, available resources, degree of optimization provided by the compiler, etc.). In Fig. Figure 2.5 shows a typical parallelism profile.

    IN application programs there is a wide range of potential parallelism. In computationally intensive programs, each cycle can execute from 500 to 3500 in parallel. arithmetic operations, if there is an existing computing environment for this. However, even a properly designed superscalar processor can support between 2 and 5.8 instructions per cycle. This drop is primarily due to communication and system costs.

    The degree of parallelism significantly depends on: the architecture of the computer, especially the switching system, the organization of interaction between parallel processors, and methods of data exchange between processors and memory. The level of parallelism has a stronger impact on computing performance than the degree of parallelism.

    Are considering algorithmic and circuit levels of parallelism.

    The following are distinguished:algorithmic levels of parallelism:

    1. Task level:

    a) between tasks;

    b) between task phases.

    2. Software level:

    a) between parts of the program(parts of one task are performed on many computers);

    b) within cycles.

    (If the individual iterations in a loop are not dependent on each other. For example : For I:=1 to N do A(I):=B(I) + C(I))

    3. Command level:

    a) between phases of command execution.

    4. Arithmetic and bit level:

    a) between elements of a vector operation;

    b) inside logic circuits ALU.

    Each level is characterized by certain properties, based on which special structures of computing tools have been developed. The command level is implemented in any modern computers, including personal computers.

    Circuit parallelism levelthis is the hardware level at which parallelization of data processing or organization of parallel computing is carried out.

    Parallel processing can be implemented at the following circuit levels:

    1. At the level of logic gates and memory elements.This lowest level transistor level. Here parallel logic circuits are built from logic gates ( PM ) (for example: parallel adder).

    2. Level of logical circuits and simple automata with memory.A parallel elementary automaton ( EA).

    3. Register level and integrated circuits memory.Elementary automata are used to obtain parallel circuits of microprocessors ( MP).

    4. Level of elementary microprocessors.Parallel macroprocessors are built from microprocessors to perform medium-block operations ( MAP).

    5 . The level of macroprocessors that implement large operations.Here parallelism of macrooperations is realized. Parallel multiprocessor systems are built on macroprocessors ( MPS).

    6. Level computers, processors and programs.Highest level of parallelism from multiprocessor systems, parallel computing systems ( Sun).

    4. Types of parallelism

    4.1. Natural parallelism and

    parallelism of multiple objects

    In the information graph, “vertical” independent subgraphs can be identified that do not mutually use any intermediate results obtained when implementing primitive operations of another subgraph. This type of parallelism is called natural parallelism of independent tasks.

    The task has natural parallelism, if in its original formulation it is reduced to an operation on multidimensional vectors, multidimensional matrices or lattice functions (Fig. 2.6). Intermediate task results are not used here. Each task is programmed independently of the others. This type of parallelism does not require combining computers into complexes. However, increasing the number of independent tasks in the ODS increases the system throughput. For example: processing database transactions on multiprocessor servers.

    1 task

    task 2

    Rice. 2.6. Information graph of a task characterized by natural parallelism

    Ori

    Ori

    Ori

    Ori

    Ori+1

    Ori+1

    Ori+1

    Ori+1

    at 1

    at 2

    at 3

    at 4

    Rice. 2.7. Information graph

    task characterized

    parallelism of many objects

    Parallelism of multiple objectsrepresents special case natural parallelism. Its meaning is that the task is to process information about different but similar objects processed using the same or almost the same program (Fig. 2.7).

    Here the so-calledintegral operations. The initial operands of integral operations are vectors or functions, or sets of objects, and the result is a number. For example, calculating the dot product for n-dimensional vectors

    includes two types of operations: a pairwise product of vector components and then an “integral operation” (an operation on an n-dimensional vector) summing up all the components of this vector.

    When parallelizing multiple objects, situations occur more often than in the general case when separate areas calculations must be performed differently for different objects.

    For example, when finding the values ​​of some functions limited to a certain area. The values ​​inside the area for all points are calculated using one formula, and at the boundary points using another.

    Parallelism of multiple objects is characterized by the following parameters:

    1. Total program length L the lengths of all operators across all branches are summed up.

    2. Average program length L avg is calculated based on the task rank.

    Basic quantitative characteristics parallelized task is task rank r (®) this is the number of parameters for which parallel processing should be carried out (for example, the number of vector components, the number of points at which the function is specified).

    3. Magnitude of problem discrepancy D

    If the information processing program for all r objects is exactly the same, then D =1 and the more the programs of different objects differ from each other, the more D.

    4.2. Parallelism of independent branches

    The essence parallelism of independent branchesconsists in the fact that in the program for solving a problem independent parts, called branches, can be distinguished. If the computer has the appropriate hardware, the branches can be executed in parallel (Fig. 2.8).

    Program branch Y independent of branch X if:

    Rice. 2.8. Information graph of a problem characterized by

    parallelism of independent branches

    between them no functional connections, i.e. none of the input variables of branch Y is the output variable of branch X or any branch that depends on X;

    1. between they have no connection across working memory fields;
    2. they must be fulfilledBy different programs ;
    3. independent in management, i.e. the condition for executing branch Y should not depend on the characteristics generated during the execution of branch X or the branch that depends on it.

    4.3. Parallelism of adjacent operations or

    local parallelism

    Parallelism of adjacent operationsoccurs when the input data for current operations is obtained at earlier stages of calculation and the construction of computational tools makes it possible to combine the execution of several operations that are not interconnected by output data and results.

    Local parallelism is characterized by the following parameters:

    1. Indicator of connectivity of adjacent operationsis the probability that the result of some operation will be used in the next operation. The less connected an operation is, the greater the depth of parallelism of adjacent operations. Typically the value has values ​​of 0.10.5.

    2. The probability that, starting from a given operation, there is a chain of length at least l l

    3. The probability that, starting from any operation in the program, there is a chain of exactly l operations that can be performed simultaneously l

    4. Depth of parallelism of adjacent operations L PSO this is the mathematical expectation of the length of a chain of operations that can be performed simultaneously

    Local optimization of programs consists of looking through several instructions that must be executed in a row and changing the order of some of them, possibly changing the numbers of registers and memory cells to ensure the maximum possible parallelism of adjacent operations.

    In most cases, the indicator of connectivity of adjacent operations depends not so much on the problem as on the quality of local optimization.

    ________________________________________________________________________________________________

    Course "Computer Organization"

    10 -

    (course project)

    The work was added to the site website: 2016-06-20

    ">Lecture " xml:lang="en-US" lang="en-US">6

    ">Parallel data processing

    ">Parallelism is the ability to simultaneously perform several arithmetic, logical or service operations. Moreover, the operations can be both large-block and small-block.

    Parallel processing can be based on various principles:

    Spatial parallelism;

    Temporal parallelism:

    1. Pipelining.
    2. ">Vectorization.
    3. ">Matrix.
    4. ">Systolic.
    5. ">Organization of the data flow processing structure.
    6. ">Organization of the system based on the hypercube structure.
    7. ">Dynamic restructuring of the aircraft structure.

    ">The description of any algorithm is hierarchical, based on the nesting property. When programming, nesting levels are distinguished: tasks, tasks, subtasks (processes), macrooperations, operations.

    ">1. Tiered-parallel form of the algorithm

    ">The most general form of representing algorithms is the information-control graph of the algorithm. A more specific form of representing task parallelism is the apparatus of tier-parallel form (LPF).

    ">The algorithm in tier-parallel form is represented in the form of tiers, and the zero tier includes operators (branches) that are independent of each other.

    ">On the graph you can designate transitions, meaning the transfer of the results of calculating a primitive operation from one tier to an operation from the next tier. Tiers are divided by transitions. There may be “empty” transitions and “empty” primitive operations.

    ">When constructing NFPs, they rely on a basic set of primitive operations (BNO). The tiered-parallel form is characterized by the following parameters:

    ">1. Length of the graph (number of tiers)" xml:lang="en-US" lang="en-US">L">.

    ">2. Width " xml:lang="en-US" lang="en-US">i">th tier - " xml:lang="en-US" lang="en-US">b;vertical-align:sub" xml:lang="en-US" lang="en-US">i">.

    ">3. Width of a tiered-parallel graph" xml:lang="en-US" lang="en-US">B">= " xml:lang="en-US" lang="en-US">max">(" xml:lang="en-US" lang="en-US">b;vertical-align:sub" xml:lang="en-US" lang="en-US">i">).

    ">4. Average width of the NAP graph V;vertical-align:sub">cf "> ">.

    ">5. Fill factor" xml:lang="en-US" lang="en-US">i">th tier " xml:lang="en-US" lang="en-US">k;vertical-align:sub" xml:lang="en-US" lang="en-US">i"> – ">.

    ">6. Dispersion coefficient of operations in the graph -" xml:lang="en-US" lang="en-US">Q;vertical-align:super" xml:lang="en-US" lang="en-US">j;vertical-align:sub" xml:lang="en-US" lang="en-US">i"> – ">, " xml:lang="en-US" lang="en-US">j">BNO, where ">- quantity " xml:lang="en-US" lang="en-US">j">th type of operations in" xml:lang="en-US" lang="en-US">i">th tier.

    ">7. The minimum required number of computers (from the BNO) to implement the algorithm represented by this graph in the YAPF.

    ">8. Minimum solution time of the algorithm (sum of response times of computers with the maximum amount of calculations for each tier) T;vertical-align:sub" xml:lang="en-US" lang="en-US">min">.

    ">9. Connectivity of the algorithm (the number of intermediate results that must be stored during the implementation of the algorithm) C.

    ">2. Automatic concurrency detection

    ">There are two possible ways to construct a parallel algorithm: directly from the problem statement or by transforming a sequential algorithm.

    ">Methods for constructing a parallel algorithm from a sequential one are based on identifying typical, frequently occurring constructions in the sequential algorithm, which, according to certain rules, are replaced by parallel ones.

    ">Despite the lower level of parallelism achieved when constructing a parallel algorithm by converting from a sequential one, this method is widely used, since it provides the ability to use expensive application programs developed and debugged for sequential ODS.

    ">In a sequential program, a distinction is made between explicit and hidden parallel processing.

    ">When analyzing a program, a data flow graph is constructed. To detect obvious parallelism of processes, sets of input (read) variables are analyzed" xml:lang="en-US" lang="en-US">R"> and output (written) variables" xml:lang="en-US" lang="en-US">W"> of each process.

    ">Hidden parallel processing requires some kind of transformation procedure for a sequential program to make it possible to execute it in parallel. The transformation could be as follows:

    ">a) decreasing the height of trees of arithmetic expressions (Fig. 6.3);

    ">b) transformation of linear recurrence relations;

    ">c) replacement of operators;

    ">d) transformation of blocks of conditional transitions and loops to canonical form;

    ">e) distribution of cycles.

    ">Parallel architectures achieve high performance if the parallelism conversion takes into account the features of the computer architecture on which the algorithm is supposed to be executed.

    ">As an example of taking into account the memory layout, let's take memory with diagonal addressing. To ensure parallel processing of matrices, the elements of their rows and columns must be distributed between the processor storage devices in such a way that they can be read and processed simultaneously. In this case, the matrix is ​​stored with shift (Fig. 6.4).

    ">Any algorithm contains sequential (scalar) sections. It has been proven that the length of these scalar sections is the determining factor when implementing the algorithm on a parallel computer.

    ">3. Degree and levels of parallelism

    ">Degree of parallelism"> (" xml:lang="en-US" lang="en-US">D">) "> this is the order of the number of parallel working devices in the system when implementing the task algorithm, provided that the number of processors (processing devices) is not limited.

    ">1) Low degree: from 2 to 10 processors.

    ">2) Medium degree: from 10 to 100 processors.

    ">3) High degree: from 100 to 10;vertical-align:super">4 "> processors.

    ">4) Ultra-high degree: from 10;vertical-align:super">4 "> up to 10 ;vertical-align:super">6 "> processors.

    ">Graphical representation of the parameter" xml:lang="en-US" lang="en-US">D">(" xml:lang="en-US" lang="en-US">t">) as a function of time is called the program parallelism profile. Figure 6.5 shows a typical parallelism profile.

    ">There is a wide range of potential parallelism in application programs. In computationally intensive programs, 500 to 3500 arithmetic operations can be performed in parallel in each cycle, if the existing computing environment is available for this. However, even a properly designed superscalar processor can support from 2 to 5.8 commands per cycle. This drop is primarily due to communication and system costs.

    The level of parallelism has a stronger impact on computing performance than the degree of parallelism.

    Algorithmic and circuit levels of parallelism are considered.

    The following algorithmic levels of parallelism are distinguished:

    1. Task level:

    a) between tasks;

    b) between task phases.

    2. Software level:

    a) between parts of the program;

    b) within cycles.

    3. Command level (between command execution phases).

    4. Arithmetic and bit level:

    ">a) between elements of a vector operation;

    ">b) inside the logical circuits of the ALU.

    ">Each of the levels is characterized by certain properties, based on which special structures of computing tools have been developed. The command level is implemented in any modern computers, including personal computers.

    ">Circuit level of parallelism is a hardware level at which parallelization of data processing or organization of parallel calculations is carried out.

    ">Parallel processing can be implemented at the following circuit levels:

    ">1. At the level of logic gates and memory elements (Fig. 6.6).

    ">2. Level of logical circuits and simple automata with memory (Fig. 6.7).

    ">3. Level of registers and memory integrated circuits (Fig. 6.8).

    4. Level of elementary microprocessors (Fig. 6.9).

    ">5. The level of macroprocessors that implement large operations (Fig. 6.10).

    6. Level of computers, processors and programs (Fig. 6.11).

    ">4. Types of parallelism

    ">4.1. Natural parallelism and

    ">parallelism of multiple objects

    In the information graph, “vertical” independent subgraphs can be identified that do not mutually use any intermediate results obtained when implementing primitive operations of another subgraph. This type of parallelism is called natural parallelism of independent tasks.

    A problem has natural parallelism if in its original formulation it is reduced to an operation on multidimensional vectors, multidimensional matrices or lattice functions (Fig. 6.12).

    Multiple object parallelism is a special case of natural parallelism. Its meaning is that the task is to process information about different but similar objects processed using the same or almost the same program (Fig. 6.13).

    ">Here, the so-called integral operations occupy a relatively small weight. When parallelizing multiple objects, situations occur more often than in the general case when individual sections of calculations must be performed differently for different objects.

    ">4.2. Parallelism of independent branches

    The essence of parallelism of independent branches is that in the program for solving a problem, independent parts, called branches, can be distinguished. If the computer has the appropriate hardware, the branches can be executed in parallel (Fig. 6.14).

    ">Program branch Y does not depend on branch X if:

    "> - there are no functional connections between them, i.e. none of the input variables of branch Y is the output variable of branch X or any branch that depends on X;

    "> - there is no connection between them via working memory fields;

    "> - they must be executed using different programs;

    "> - are independent in control, i.e. the condition for executing branch Y should not depend on the characteristics generated during the execution of branch X or the branch that depends on it.

    ">4.3. Parallelism of related operations or

    ">local parallelism

    Parallelism of adjacent operations occurs when the input data for current operations is obtained at earlier stages of calculation and the construction of computational tools makes it possible to combine the execution of several operations that are not interconnected by output data and results.

    Local optimization of programs consists of looking through several instructions that must be executed in a row and changing the order of some of them, possibly changing the numbers of registers and memory cells to ensure the maximum possible parallelism of adjacent operations.

    In most cases, the indicator of connectivity of adjacent operations depends not so much on the problem as on the quality of local optimization.

    ">5. Problem model

    The problem model is built for comparative analysis structures of parallel computers. Therefore, it should be quite general in nature and describe only the composition of forms of parallelism and types of connections.

    As a rule, any problem model is built based on an analysis of the modeled class of problems. Based on the results of the analysis, the algorithms are converted to parallel view. The algorithm under study can be represented as a program consisting of a sequence of sections of three types (Fig. 6.15):

    1. scalar sections (SC);
    2. sections with parallelism of independent branches (BT);
    3. vector sections (VC).

    Task model is a set of parameters characterizing a parallel program

    When constructing a model of a task, the main goal is to determine the relative time of its execution when implemented by the algorithm under study.

    ">Fig. 6.15. The ratio of the total number of calculations per different sections of the algorithm in the problem model

    " xml:lang="en-US" lang="en-US">W">sk

    " xml:lang="en-US" lang="en-US">Ww

    " xml:lang="en-US" lang="en-US">W">vk

    " xml:lang="en-US" lang="en-US">m;vertical-align:sub">sk

    " xml:lang="en-US" lang="en-US">m;vertical-align:sub" xml:lang="en-US" lang="en-US">tu

    " xml:lang="en-US" lang="en-US">m;vertical-align:sub">vk

    " xml:lang="en-US" lang="en-US">A

    " xml:lang="en-US" lang="en-US">В

    " xml:lang="en-US" lang="en-US">C

    volume of calculations

    relative length

    Ministry of Education and Science of the Russian Federation

    FSBEI HPE "Bryansk State Engineering and Technological

    academy"

    Department of Information Technologies

    Sequential and parallel information processing

    Calculation and graphic work No. 1

    by discipline

    "Information Processing Technologies"

    Option No. 16

    RGR-02068025.230400.084

    Bryansk 2015

    Introduction 3

    Parallel information processing 4

    Shared Memory Systems 6

    Parallel SQL Processing 7

    Sequential information processing 9

    10 Simple Batch Systems

    References 13

    Introduction

    This computational and graphical study examines sequential and parallel processing of information. Examples are given for each of them.

    Sequential information processing is the sequential passage of information from input to output through a series of transformations (stages), so that in each period of time (specific to a given block), the transformation is carried out in only one functional block, and information comes to it only from the previous block.

    Parallel information processing is a model of information processing, according to which information undergoes a series of transformations in certain functional blocks, so that at any given time it is processed simultaneously (in parallel) in several blocks.

    Parallel information processing

    Parallel data processing, embodying the idea of ​​simultaneous execution of several actions, has two varieties: pipeline and parallelism.

    Parallel Processing. If a certain device performs one operation per unit of time, then it will perform a thousand operations in a thousand units. If we assume that there are five identical independent devices capable of operating simultaneously, then a system of five devices can perform the same thousand operations not in a thousand, but in two hundred units of time. Similarly, a system of N devices will perform the same work in 1000/N units of time. Similar analogies can be found in life: if one soldier digs up a garden in 10 hours, then a company of fifty soldiers with the same abilities, working simultaneously, will cope with the same work in 12 minutes - the principle of parallelism in action!

    Conveyor processing. What is needed to add two real numbers represented in floating point form? A whole lot of small operations such as comparing orders, aligning orders, adding mantissas, normalizing, etc. The processors of the first computers performed all these “micro-operations” for each pair of arguments, one after another, until they reached the final result, and only then proceeded to process the next pair of terms.

    The idea of ​​pipeline processing is to isolate individual stages of performing a general operation, and each stage, having completed its work, would pass the result to the next one, while simultaneously receiving a new portion of input data. We get an obvious gain in processing speed by combining previously spaced operations. Let's assume that there are five micro-operations in an operation, each of which is performed in one unit of time. If there is one indivisible serial device, then it will process 100 pairs of arguments in 500 units. If each micro-operation is separated into a separate stage (or otherwise called a stage) of a conveyor device, then on the fifth unit of time, at a different stage of processing of such a device, the first five pairs of arguments will be located, and the entire set of one hundred pairs will be processed in 5 + 99 = 104 units time - acceleration compared to a serial device is almost five times (according to the number of conveyor stages).

    It would seem that pipeline processing can be successfully replaced by ordinary parallelism, for which we duplicate the main device as many times as the number of stages of the pipeline are supposed to be allocated. In fact, the five devices in the previous example will process 100 pairs of arguments in 100 units of time, which is faster than the running time of the pipeline device! Thus, by increasing the number of devices fivefold, we significantly increase both the volume of equipment and its cost. Imagine that a car plant decided to remove the assembly line while maintaining the rate of car production. If previously there were a thousand cars on the assembly line at the same time, then, acting by analogy with the previous example, it is necessary to recruit a thousand teams, each of which is able to completely assemble the car from start to finish, performing hundreds of different types of operations, and do this in the same time as the car was previously on the assembly line.

    Today, parallelism in computer architecture will surprise few people. All modern microprocessors use some form of parallel processing. In the Pentium 4 core on different stages Up to 126 micro-operations can be executed simultaneously. At the same time, these ideas themselves appeared a very long time ago. Initially, they were implemented in the most advanced, and therefore single, computers of their time. Then, after proper development of the technology and cheaper production, they went down to middle-class computers, and finally today all this is fully embodied in workstations and personal computers.

    The performance of many applications running on single-processor computer systems can be significantly improved by using parallel processing tools. The following introduces the basic concepts of parallel processing and multiprocessor computer architecture.

    When multiple applications request their jobs to be processed on a single-processor computer, its single processor has to do all the work. The purpose of parallel processing is usually to improve application performance. When an application issues a job request to a multiprocessor computer, the computer breaks the job into logical subtasks and then processes them using multiple processors in parallel, reducing the time it takes to complete the job. The number of subtasks resulting from splitting one large task is called the degree of parallelism . The reduction in information processing time required to complete a task is directly proportional to the degree of parallelism. They try to increase the performance of systems with parallel processing in such a way as to ensure maximum performance of each processor in the system.


    4th year, 1st and 2nd streams, 7th semester

    lectures (34 hours), test

    Department responsible for the course: ASVK

    Program Compiler: member-corr. RAS, Doctor of Physics and Mathematics. Sciences Voevodin Vl.V.,

    Lecturers: member-corr. RAS, Doctor of Physics and Mathematics. Sciences Voevodin Vl.V.

    Annotation

    Discussed in the course general questions organization of parallel computing. The features of the architectures of modern parallel computing systems are considered, the basic methods and paradigms of programming in parallel environments are studied.

    For the 1st and 2nd streams, approaches to harmonizing the architecture of parallel systems and the structure of algorithms, issues of the theory of analysis of the structure of programs and algorithms, and models in parallel computing are discussed.

    Program

    1. Large tasks and supercomputers. Parallel and pipeline data processing. Parallelism and pipelineability in the architecture of modern high-performance computers. Scalar and vector commands. Scalar, pipeline and vector devices. The memory hierarchy in computers as a means of increasing the speed of program execution, locality of calculations and locality of data use. Amdahl's law and its consequences, superlinear acceleration.

    2. Main classes of modern parallel computing systems. Computers with shared memory, examples, reasons for decreased productivity real programs. SMP architectures, NUMA, ccNUMA. Switching of processors and memory modules, bus, matrix switch, omega network. Vector-pipeline computing systems, examples, reasons for decreased performance. Computers with distributed memory, examples, reasons for decreased performance. Topology of communication between processors: star, lattice, three-dimensional torus, binary hypercube, their properties. Computing clusters, examples, latency and throughput various communication technologies. Architectures with parallelism at the level of machine instructions, VLIW, superscalarity.

    3. Parallel programming technologies. Traditional sequential languages ​​and parallelizing compilers, problems. Special comments and compiler directives, extensions existing languages. Special languages parallel programming. Programming using libraries and message passing interfaces. Parallel subject libraries, specialized packages and software systems high level. Parallel programming technologies MPI, OpenMP, Linda.

    4. Performance of parallel computing systems. The versatility and specialization of computers, the performance of special processors. Moore's Law. Performance assessment methods. Introduction of a single numeric parameter, Mflops, MIPS. Peak and real computer performance. Linpack test and its variants. Sets of complementary test programs, STREAM and NPB.

    5. Graph models of programs. Control graph and program information graph. Information and operating history implementation of programs. Algorithm graph as a compact parametric form of representing information history. Information independence of operations and the possibility of their parallel execution. The length of the critical path of an algorithm graph as a measure of the degree of parallelism. Finite and massive parallelism, coordinate and skew parallelism. Equivalent program conversions, elementary transformations cycles.

    6. Heterogeneous distributed computing systems. Metacomputers and metacomputing, existing metacomputer projects. Distinctive properties of metacomputers. The concept of GRID, basic components and services, existing projects of GRID segments, the concept of a virtual organization.

    Literature

    1. Voevodin V.V., Voevodin Vl.V. Parallel computing. – St. Petersburg: BHV St. Petersburg, 2002. - 608 p.

    2. Korolev L.N. Architecture of electronic computer processors. – M.: Publishing house. Faculty of Computational Mathematics and Mathematics of Moscow State University, 2003.

    3. V.V. Korneev. Parallel computing systems. – M.: Knowledge Publishing House, 1999. – 320 p.

    4. Materials of the information and analytical center for parallel computing Parallel.ru.

    Further reading

    1. Antonov A.S. Parallel programming using technology

    MPI: Tutorial. – M.: Moscow State University Publishing House, 2004. - 71 p.

    Ways to improve aircraft performance are embedded in its architecture. On the one hand, this is a set of processors, memory units, input/output devices and, of course, methods for connecting them, i.e. communication environment. On the other hand, these are the actual actions of the aircraft to solve a certain problem, and these are operations on commands and data. This is actually the entire main basis for parallel processing. Parallel processing, embodying the idea of ​​simultaneous execution of several actions, has several varieties: superscalarity, pipelining,SIMD– extensions,Hyper Threading, multi-core. Basically, these types of parallel processing are intuitive, so we will only make small explanations. If a certain device performs one operation per unit of time, then it will perform a thousand operations in a thousand units. If we assume that there are five identical independent devices capable of operating simultaneously, then a system of five devices can perform the same thousand operations not in a thousand, but in two hundred units of time. Similarly, a system of N devices will perform the same work in 1000/N units of time. Similar analogies can be found in life: if one soldier digs up a garden in 10 hours, then a company of fifty soldiers with the same abilities, working simultaneously, will cope with the same work in 12 minutes (parallel data processing), and even with songs (parallel command processing).

    Conveyor processing . What is needed to add two real numbers represented in floating point form? A whole lot of small operations such as comparing orders, aligning orders, adding mantissas, normalizing, etc. The processors of the first computers performed all these “micro-operations” for each pair of arguments, one after another, until they reached the final result, and only then proceeded to process the next pair of terms. The idea of ​​pipeline processing is to isolate individual stages of performing a general operation, and each stage, having completed its work, would pass the result to the next one, while simultaneously receiving a new portion of input data. We get an obvious gain in processing speed by combining previously spaced operations.

    Superscalarity. As in the previous example, only when constructing a pipeline, several hardware and software implementations of functional devices are used, for example, two or three ALUs, three or four sampling devices.

    Hyper Threading. A promising direction in the development of modern microprocessors based on multithreaded architecture. The main obstacle to increasing productivity by increasing the number of functional devices is organizing the efficient loading of these devices. If today's program codes are not able to load all the functional devices with work, then you can allow the processor to perform more than one task (thread), so that additional threads load all the functional units (much like multitasking).

    Multi-core. It is possible, of course, to implement multiprocessing at the chip level, i.e. place several processors on one chip (Power 4). But if we take a microprocessor together with memory as system cores, then several such cores on one chip will create a multi-core structure. In this case, functions are integrated into the chip (for example, interfaces of network and telecommunication systems) for which chipsets (Motorola MPC8260, Power 4 processors) are usually used.

    The implementation of high-performance computing technology is currently proceeding in four main directions.

    1. Vector conveyor computers. Pipeline functional devices and vector instruction set are two features of such machines. Unlike the traditional approach, vector commands operate on entire arrays of independent data, which allows efficient loading of available pipelines, i.e. a command like A=B+C can mean adding two arrays, not two numbers. A typical representative of this direction is the CRAY family of vector-pipeline computers, which includes, for example, CRAY EL, CRAY J90, CRAY T90 (in March 2000, the American company TERA bought the CRAY division from Silicon Graphics, Inc.).

    2. Massively parallel computers with distributed memory. The idea of ​​​​building computers of this class is trivial: let's take serial microprocessors, provide each with its own local memory, connect through some communication medium - that's all. This architecture has many advantages: if you need high performance, you can add more processors if finances are limited or the required requirements are known in advance. computing power, then it is easy to select the optimal configuration, etc.

    However, there is also a decisive “minus” that reduces many of the “pluses” to nothing. The fact is that it is independent, but rather a combination of the previous three. We will form a computing node from several processors (traditional or vector-pipeline) and their common memory. If the received computing power is not enough, then we will combine several nodes with high-speed channels. This kind of architecture is called cluster SV1, HP Exemplar,Sun StarFire, N.E.C. SX-5, latest IBM models SP2

    3. Parallelshared memory computers. All RAM Such computers are divided by several identical processors. This removes the problems of the previous class, but adds new ones - the number of processors with access to shared memory cannot be made large for purely technical reasons. This area includes many modern multiprocessor SMP computers or, for example, individual nodes of HP computers Exemplar and Sun StarFire.

    4. Cluster systems. The last direction, strictly speaking, is not independent, but rather a combination of the previous three. We will form a computing node from several processors (traditional or vector-pipeline) and their common memory. If the received computing power is not enough, then we will combine several nodes with high-speed channels. This kind of architecture is called cluster, and CRAYs are built on this principle SV1, HP Exemplar,Sun StarFire, N.E.C. SX-5, latest IBM models SP2 and others. It is this direction that is currently the most promising for the design of computers with record performance indicators.