• Computing systems with shared memory. Computer memory

    Organization of memory of the MPS. Memory segmentation. Address calculation. Internal cache memory.

    The memory of a microprocessor system performs the function of temporary or permanent storage of data and commands. The amount of memory determines the permissible complexity of the algorithms executed by the system, as well as, to some extent, the speed of the system as a whole. Memory modules are implemented on memory chips (RAM or ROM). Increasingly, microprocessor systems are using flash memory, which is a non-volatile memory with the ability to rewrite its contents multiple times.

    To connect the memory module to the system bus, interface blocks are used, which include an address decoder (selector), a circuit for processing bus control signals, and data buffers (Figure 7.4.1).

    Figure 7.4.1. Memory module connection diagram.

    The memory space of a microprocessor system is usually allocated to several special areas that perform special functions. These include:

    – startup program memory, executed on ROM or flash memory;

    – memory for the stack or stack (Stack) is a part RAM, intended for temporary storage of data;

    – interrupt vector table containing the start addresses of interrupt processing programs;

    – memory of devices connected to the system bus.

    All other parts of the memory space, as a rule, have a universal purpose. They can contain both data and programs (of course, in the case of a single-bus architecture).

    Often the memory space is divided into segments with a programmable address of the beginning of the segment and with established size segment. For example, in Intel processor 8086 memory segmentation is organized as follows.

    The entire system memory is represented not as a continuous space, but in the form of several pieces - segments given size(64 KB each), the position of which in memory space can be changed programmatically.

    To store memory address codes, not individual registers are used, but pairs of registers:

    The segment register specifies the start address of the segment (that is, the segment's position in memory);

    The pointer register (offset register) determines the position of the working address within the segment.

    In this case, the physical 20-bit memory address exposed to the external address bus is formed as shown in Figure 7.4.2, that is, by adding the offset and the segment address with a shift of 4 bits.

    Figure 7.4.2. Formation of a physical memory address from a segment address and offset.

    The location of this address in memory is shown in Figure 7.4.3.

    Figure 7.4.3. Location of physical address in memory

    A segment can only start at a 16-byte memory boundary (since the segment start address is essentially four low-order zero bits, as seen in Figure 7.4.2), that is, at an address that is a multiple of 16. These valid segment boundaries are called paragraph boundaries .

    Note that the introduction of segmentation is primarily due to the fact that the internal registers of the processor are 16-bit, and the physical memory address is 20-bit (a 16-bit address allows the use of only 64 KB of memory, which is clearly not enough).

    Cache memory is located between main memory (RAM) and central processor to reduce the time spent on CPU access to the OP.

    The idea of ​​cache memory is based on predicting the most likely CPU accesses to the OP. The most likely data and instructions are copied into the fast, CPU-paced cache memory before they are directly used by the CPU, so that the data and instructions currently in use can be accessed quickly, without having to access the OP. This approach is based on the principle of program locality or, as they also say, the nested nature of calls, meaning that the addresses of successive calls to the OP form, as a rule, a compact group. When accessing the OP, it is not individual data that is copied into the cache memory, but blocks of information, including the data that will most likely be used by the CPU in subsequent work steps. In this regard, subsequent instructions are selected by the CPU not from the OP, but from the fast cache memory. When the CPU needs to read or write some data to the RAM, it first checks its presence in the cache memory. The efficiency of a cache system depends on the block size and program algorithm.

    Chapter 11

    Memory organization computing systems

    In computing systems that combine many parallel processors or machines, the task of properly organizing memory is one of the most important. The difference between processor and memory speed has always been a stumbling block in single-processor VMs. Multiprocessing of computers leads to another problem - the problem of simultaneous access to memory by several processors.

    Depending on how the memory of multiprocessor (multi-machine) systems is organized, computing systems with shared memory(shared memory) and aircraft with distributed memory (distributed memory). IN shared memory systems(often called shared or shared memory), the memory is treated as a shared resource, and each processor has full access to the entire address space. Systems with shared memory are called strongly connected(closely coupled systems). A similar construction of computing systems takes place both in the SIMD class and in the MIMD class. Sometimes, to emphasize this circumstance, special subclasses are introduced, using the abbreviations SM-SIMD (Shared Memory SIMD) and SM-MIMD (Shared Memory MIMD) to denote them.

    In option with distributed memory Each processor has its own memory. Processors are combined V network and can, if necessary, exchange data stored in their memory, transferring to each other the so-called messages. This type of aircraft is called weakly connected(loosely coupled systems). Weak connected systems are also found in both the SIMD class and the MIMD class, and sometimes, to emphasize this feature, the subclasses DM-SIMD (Distributed Memory SIMD) and DM-MIMD (Distributed Memory MIMD) are introduced.

    In some cases, shared memory computing systems are called multiprocessors, and systems with distributed memory - mtslticomputers.

    The difference between shared and distributed memory is the difference in structure virtual memory, that is, in how the memory looks from the processor side. Physically, almost every memory system is divided into autonomous components that can be accessed independently. What separates shared memory from distributed memory is how the memory subsystem interprets the cell address received from the processor. For example, let's assume that the processor executes the load RO, i instruction, meaning “Load register R0 with the contents of cell i.” In the case of shared memory, i is a global address, and for any processor it points to the same cell. IN distributed system memory i is local address If two processors execute the command load RO, i, then each of them accesses the i-th cell in its local memory, that is, different cells, and unequal values ​​may be loaded into the R0 registers.

    The difference between the two memory systems must be taken into account by the programmer because it determines the way the parts of a parallelized program interact. In the shared memory option, it is enough to create a data structure in memory and pass references to this structure to parallel subroutines. In a distributed memory system, it is necessary to have a copy of the shared data in each local memory. These copies are created by embedding the shared data in messages sent to other processors.

    Interleaved memory

    Physically, the memory of a computer system consists of several modules (banks), and a significant issue is how the address space is distributed in this case (the set of all addresses that the processor can generate). One way to distribute virtual addresses across memory modules is to divide the address space into sequential blocks. If the memory consists of n banks, then the cell with the address i with block-by-block partitioning it will be in the bank with number i/ n. In the system interleaved memory(interleaved memory) consecutive addresses are located in different banks: cell with address i is in bank with number i mod p. Let, for example, the memory consist of four banks, 256 bytes each. In a block-oriented scheme, the first bank will be allocated virtual addresses 0-255, the second - 256-511, etc. In a scheme with interleaved addresses, successive cells in the first bank will have virtual addresses 0, 4, 8, .. .. in the second bank - 1, 5, 9, etc. (Fig. 11.1, a).

    Distributing the address space across modules makes it possible to simultaneously process requests for memory access if the corresponding addresses belong to different banks. The processor can request access to a cell in one of the cycles i and in the next cycle - to cell j. If iAndj are in different banks, the information will be transmitted in successive cycles. Here, a cycle refers to a processor cycle, while a complete memory cycle takes several processor cycles. Thus, in in this case the processor does not have to wait until a full cell access cycle is completed i. This technique allows you to increase bandwidth: if the memory system consists of

    Rice. 11.1- Memory with alternating addresses: a - address distribution; b- elements extracted in increments of 9 from an 8 x 8 array

    sufficient number of banks, it is possible to exchange information between the processor and memory at a speed of one word per processor cycle, regardless of the duration of the memory cycle.

    The decision about which address distribution option to choose (block-based or striped) depends on the expected order of access to information. Programs are compiled in such a way that sequential instructions are located in cells with sequential addresses, so there is a high probability that after the instruction extracted from cell address i, the instruction from cell address i will be executed i + 1. The compiler also places elements of vectors in sequential cells, so in operations with vectors you can take advantage of the interleaving method. For this reason, vector processors typically use some form of address interleaving. Shared-memory multiprocessors still use block addressing because the memory access patterns of MIMD systems can vary greatly. In such systems, the goal is to connect the processor to a memory block and use as much of the information in it as possible before switching to another memory block.

    Memory systems often provide additional flexibility when retrieving element vectors. On some systems, it is possible to load every nth element of a vector simultaneously, for example when retrieving elements of a vector V, stored in successive memory cells; at n= 4, memory will return The spacing between elements is called step by index or "stride"(stride). One interesting use of this property is Matrix Access. If the index step is one greater than the number of rows in the matrix, a single memory access request will return all diagonal elements of the matrix (Fig. 11.1b). The responsibility for ensuring that all extracted matrix elements are located in different banks rests with the programmer.

    Models of memory architecture of computer systems

    Several models of memory system architectures are implemented within both shared and distributed memory.

    Rice. 11.2. Classification of memory architecture models of computer systems

    In Fig. 11.2 shows the classification of such models used in computing systems of the MIMD class (this is also true for the S1MD class).

    Shared Memory Architecture Models

    In systems with shared memory, all processors have equal capabilities but access to a single address space. A single memory can be built as a single-block or modular, but the second option is usually practiced.

    Computing systems with shared memory, where any processor accesses memory uniformly and takes the same time, are called systems with uniform access to memory and is designated by the abbreviation UMA (Uniform Memory Access). This is the most common shared-memory parallel VS memory architecture.

    Technically, UMA systems assume the presence of a node connecting each of n processors with each T memory modules. The simplest way to build such computers is to combine several processors (P i) with a single memory (M p) via a common bus - shown in Fig. 11.3, A. In this case, however, only one of the processors can communicate on the bus at any time, that is, the processors must compete for access to the spike. When processor P i selects an instruction from memory, the other processors must wait until the tire is free. If V The system includes only two processors, they are able to work with performance close to maximum, since their access to the bus can be interleaved: while one processor is decoding and executing an instruction, the other has the right to use the bus to fetch the next instruction from memory. However, when a third processor is added, performance begins to drop. If there are ten processors on the bus, the bus speed curve (Fig. H.3, A) becomes horizontal, so adding an 11th processor no longer provides an increase in performance. The bottom curve in this figure illustrates the fact that the memory and bus have a fixed bandwidth determined by the combination of memory cycle time and bus protocol, and in a shared bus multiprocessor system this throughput distributed among several processors. If the processor cycle time is longer than the memory cycle, many processors can be connected to the bus. However, in reality the processor is usually much faster than memory, therefore this scheme is not widely used.

    Rice. 11.3. Shared memory: a - combining processors using a bus; b - system with local caches; V- system performance as a function of the number of processors on the bus; d - multiprocessor computer with shared memory consisting of individual modules

    Alternative way The construction of a multiprocessor computer with shared memory based on NML is shown in Fig. 11.3, G. Here the spike is replaced by a switch that routes processor requests to one of several memory modules. Even though there are several memory modules, they are all part of a single virtual address space. The advantage of this approach is that the switch is able to serve multiple requests in parallel. Each processor can be connected to its own memory module and have access to it at the maximum allowed speed. Contention between processors can occur when attempting to access the same memory module at the same time. In this case, only one processor gets access, and the others are blocked.

    Unfortunately, the UMA architecture does not scale very well. The most common systems contain 4-8 processors, much less often 32-64 processors. In addition, such systems cannot be classified as fault-tolerant, since the failure of one processor or memory module entails the failure of the entire computer.

    Another approach to building a computer with shared memory is heterogeneous memory access, denoted as NUM A (Non-Uniform Memory Access). This still involves a single address space, but each processor has local memory. The processor accesses its own local memory directly, which is much faster than accessing remote memory through a switch or network. Such a system can be supplemented with global memory, then local storage devices act as a fast cache memory for global memory. Such a scheme can improve aircraft performance, but cannot indefinitely delay the leveling off of direct performance. If each processor has a local cache memory (Fig. 11.3.6), there is a high probability (p > 0.9) that the required command or data is already in local memory. Reasonable probability of hitting local memory significantly reduces the number of processor accesses To global memory and thus leads to increased efficiency. The location of the inflection point in the performance curve (upper curve in Fig. 11.3, V), corresponding to the point at which adding processors is still effective now moves to the region of 20 processors, and the point where the curve becomes horizontal moves to the region of 30 processors.

    Within the framework of the concept NUMA Several different approaches are implemented, denoted by acronyms SOMA,CC- NUMA And NCC- NUMA.

    IN cache-only architecture(SOMA, Cache Only Memory Architecture) the local memory of each processor is built as a large cache memory for quick access by “its” processor. The caches of all processors are collectively considered as global system memory. There is no actual global memory. The fundamental feature of the SOMA concept is expressed in dynamics. Here, the data is not statically bound to a specific memory module and does not have a unique address that remains unchanged throughout the lifetime of the variable. In the SOMA architecture, data is transferred to the cache memory of the processor that last requested it, while the variable is not fixed by a unique address and can be located in any physical cell at any time. Moving data from one local cache to another does not require participation in this process operating system, but involves complex and expensive memory management hardware. To organize such a regime, the so-called cache directories. Note also that latest copy The data item is never removed from the cache.

    Since in the COMA architecture data is moved to the local cache memory of the owner processor, such computers have a significant performance advantage over other NUM A architectures. On the other hand, if a single variable or two different variables are stored in the same line of the same cache , are required by two processors, this cache line must move back and forth between processors with each data access. Such effects may depend on the details of memory allocation and lead to unpredictable situations.

    Model cache coherent access to heterogeneous memory(CC-NUMA, Сасhe Coherent Non-Uniform Memory Architecture) is fundamentally different from the SOMA model. The CC-NUMA system does not use cache memory, but regular physically distributed memory. There is no copying of pages or data between memory cells. There is no software implemented message passing. There is simply one memory card, with the parts physically connected by a copper cable, and "smart" hardware. Hardware-implemented cache coherence means that no software to store multiple copies of updated data or transmit it. The hardware level handles all this. Access to local memory modules in different nodes of the system can be performed simultaneously and occurs faster than to remote memory modules.

    The difference between the model and cache-incoherent access to heterogeneous memory(NCC-NUMA, Non-Cache Coherent Non-Uniform Memory Architecture) from CC-NUMA is obvious from the name. The memory architecture assumes a single address space, but does not provide global data consistency at the hardware level. Management of the use of such data rests entirely with the software (applications or compilers). Despite this circumstance, which seems to be a drawback of the architecture, it turns out to be very useful in increasing the performance of computing systems with a DSM-type memory architecture, discussed in the section “Models of distributed memory architectures”.

    In general, computers with shared memory, built according to the NUMA scheme, are called architectures with virtual shared memory(virtual shared memory architectures). This type of architecture, in particular CC-NUMA, lately is considered as an independent and rather promising type of MIMD-class computing systems, therefore such computers will be discussed in more detail below.

    Models of distributed memory architectures

    In a distributed memory system, each processor has its own memory and can only address it. Some authors call this type of system multi-machine aircraft or multi-computers, emphasizing the fact that the blocks from which the system is built are themselves small computing systems with a processor and memory. Models of architectures with distributed memory are usually denoted as architecture without direct access to remote memory(NORMA, No Remote Memory Access). This name comes from the fact that each processor has access only to its local memory. Access to remote memory (local memory of another processor) is possible only by exchanging messages with the processor that owns the addressable memory.

    Such an organization is characterized by a number of advantages. First, when accessing data, there is no contention for the bus or switches - each processor can fully use the bandwidth of the communication path with its own local memory. Secondly, the absence of a common bus means that there are no associated restrictions on the number of processors: the size of the system is limited only by the network connecting the processors. Thirdly, the problem of cache memory coherence is eliminated. Each processor has the right to independently change its Data, without worrying about matching copies of data in its own local cache with the caches of other processors.

    The main disadvantage of computers with distributed memory is the complexity of information exchange between processors. If one of the processors needs data from the memory of another processor, it must exchange messages with that processor. This results in two types of costs:

      it takes time to generate and forward a message from one! processor to another;

      To respond to messages from other processors, the receiving processor must receive an interrupt request and execute the interrupt handling procedure.

    The structure of a system with distributed memory is shown in Fig. 11.4. On the left! parts (Fig. 11.4, A) one processing element (PE) is shown. It includes) the processor itself (P), local memory (M) and two input/output controllers (K o and CD on the right side (Fig. 11.4, b) a four-processor system is shown, illustrating how messages are sent from one processor to another. In relation to each PE, all other processing elements can be considered simply as input/output devices. To send a message to another PE, the processor forms a data block in its local memory and notifies its local controller about the need to transfer information to an external device. Over the interconnection network, this message is sent to the receiving I/O controller of the receiving PE. The latter finds a place for the message in its own local memory and notifies the source processor that the message has been received.

    Rice. 11.4. Computing system with distributed memory: a - processing element; b- combining processing elements o

    An interesting variant of a distributed memory system is; model distributed shared memory(DSM, Distribute Shared Memory), also known by another name architectures with heterogeneousmemory access and software coherence(SC-NUMA, Software-Coherent Non-Uniform Memory Architecture). The idea of ​​this model is that the computer, physically being a system with distributed memory, is presented to the user as a system with shared memory thanks to the operating system. This means that the operating system offers the user a single address space, despite the fact that actual access to the memory of the “foreign” computer BC is still provided through message exchange.

    Multiprocessorcache coherence

    A shared-memory multiprocessor system consists of two or more independent processors, each executing either part of a larger program or an independent program. All processors access instructions and data stored in a common main memory. Since memory is a shared resource, contention between processors occurs when accessing it, resulting in an increase in the average memory access latency. To reduce this latency, each processor is given a local cache, which, by serving local memory accesses, in many cases prevents the need to access shared main memory. In turn, equipping each processor with local cache memory leads to the so-called coherence problem or security according tobathroom cache memory. According to , a system is coherent if each read operation at any address performed by any of the processors returns the value entered during the last write operation at this address, regardless of which processor performed the last write.

    In its simplest form, the cache coherence problem can be explained as follows (Figure 11.5). Let two processors R G and R g are connected to the shared memory via a bus. First, both processors read the variable X. Copies of blocks containing this variable are sent from main memory to the local caches of both processors (Fig. 11.5, A). Next, the processor P t performs the operation of increasing the value of the variable X per unit. Since a copy of the variable is already in the cache memory of this processor, a cache hit will occur and the value will only be changed in cache memory 1. If processor P 2 now performs the read operation again X, then a cache hit will also occur and P 2 will receive the “old” value stored in its cache memory X(Fig. 11.5, b).

    Maintaining consistency requires that when a data item is changed by one of the processors, the corresponding changes are made to the cache memory of the remaining processors, where there is a copy of the changed data item, as well as to shared memory. A similar problem arises, by the way, in single-processor systems where there are several levels of cache memory. Here it is necessary to coordinate the contents of caches of different levels.

    There are two approaches to solving the coherence problem: software and hardware. Some systems use strategies that combine both approaches.

    Software solutionscoherence problems

    Software techniques for solving the coherence problem allow you to do without additional equipment or reduce it to a minimum)