• SMP – uniform memory access (shared memory). SMP architecture. Advantages and disadvantages. Scope, examples of aircraft on SMP

    Unleash the power of Linux on SMP systems

    Performance Linux systems you can increase in different ways, and one of the most popular is to increase processor performance. The obvious solution is to use a processor with a higher clock speed, but with any technology there is a physical limit where the clock generator simply cannot go faster. Once you reach this limit, you can take a much better approach and use multiple processors. Unfortunately, performance has a nonlinear dependence on the set of parameters of individual processors.

    Before discussing the use of multiprocessing in Linux, let's take a look at its history.

    History of multiprocessing

    Multiprocessing began in the mid-1950s at a number of companies, some you know and some you may have forgotten (IBM, Digital Equipment Corporation, Control Data Corporation). In the early 1960s, Burroughs Corporation introduced a four-CPU MIMD-type symmetric multiprocessor with up to sixteen memory modules connected by a crossbar (the first SMP architecture). The widely known and successful CDC 6600 was introduced in 1964 and provided the CPU with ten subprocessors (peripheral processors). In the late 1960s, Honeywell released another symmetric multiprocessor system with eight CPU Multics.

    While multiprocessor systems have evolved, technology has also advanced, making processors smaller in size and increasing their ability to run at much higher speeds. clock frequency. In the 1980s, companies such as Cray Research introduced multiprocessor systems and UNIX®-like operating systems that could use them (CX-OS).

    The late 1980s saw the decline of multiprocessor systems with the popularity of single-processor personal computers such as the IBM PC. But now, twenty years later, multiprocessing has returned to those same personal computers in the form of symmetric multiprocessing.

    Amdahl's law

    Gene Amdahl, computer architect and IBM employee, developed at IBM computer architectures, created the company of the same name, Amdahl Corporation, etc. But what brought him fame was his law, which calculates the maximum possible improvement of a system when improving part of it. The law is used primarily to calculate the maximum theoretical improvement in system performance when using multiple processors (see Figure 1).

    Figure 1. Amdahl's law for parallelizing processes

    Using the equation shown in Figure 1, you can calculate the maximum performance improvement for a system using N processors and factor F, which indicates which part of the system cannot be parallelized (the part of the system that is serial in nature). The result is shown in Figure 2.

    Figure 2. Amdahl's law for a system with up to ten CPUs

    The top line in Figure 2 shows the number of processors. Ideally, this is what you would want to see after adding more processors to solve the problem. Unfortunately, due to the fact that not everything in the problem can be parallelized and there is overhead in managing the processors, the speedup is slightly less. Below (purple line) is a case of a problem that is 90% sequential. The best case on this graph corresponds to the brown line, which depicts a task that is 10% sequential and, accordingly, 90% parallelizable. Even so, ten processors don't perform much better than five.

    Multiprocessing and PC

    The SMP architecture is one where two or more identical processors are connected to each other via shared memory. They all have the same access to shared memory (the same latency to access memory space). The opposite of it is the Non-Uniform Memory Access (NUMA) architecture. For example, each processor has its own memory and accesses shared memory with different latency.

    Loosely coupled multiprocessing

    Early Linux SMP systems were loosely coupled multiprocessor systems, that is, built from several separate systems connected by a high-speed connection (such as 10G Ethernet, Fiber Channel or Infiniband). Another name for this type of architecture is a cluster (see Figure 3), for which the Linux Beowulf project remains a popular solution. Linux Beowulf clusters can be built from available hardware and common network connection, such as Ethernet.

    Figure 3. Loosely coupled multiprocessor architecture

    Building systems with a loosely coupled multiprocessor architecture is easy (thanks to projects like Beowulf), but has its limitations. Creating a large multiprocessor network can require significant power and space. A more serious obstacle is the material of the communication channel. Even with a high-speed network such as 10G Ethernet, there is a limit to the system's scalability.

    Tightly coupled multiprocessing

    Tightly coupled multiprocessing refers to chip-level multiprocessing (CMP). Imagine a loosely coupled architecture scaled down to the die. This is the idea of ​​tightly coupled multiprocessing (also called multi-core computing). On one integrated circuit several crystals, shared memory and connection form a well-integrated core for multiprocessing (see Figure 4).

    Figure 4. Tightly coupled multiprocessing architecture

    In CMP, several CPUs are connected by a common bus with shared memory (second-level cache). Each processor also has its own high-speed memory (L1 cache). The tightly coupled nature of CMP allows for very short physical distances between processors and memory and, as a result, minimal memory access latency and higher performance. This type of architecture works well in multi-threaded applications, where threads can be distributed across processors and executed in parallel. This is called thread-level parallelism (TPL).

    Given the popularity of this multiprocessor architecture, many manufacturers are releasing CMP devices. Table 1 lists some popular Linux-enabled options.

    Table 1. Selected CMP devices
    ManufacturerDeviceDescription
    IBMPOWER4SMP, two CPUs
    IBMPOWER5SMP, two CPUs, four parallel threads
    AMDAMD X2SMP, two CPUs
    Intel®XeonSMP, two or four CPUs
    IntelCore2 DuoSMP, two CPUs
    ARMMPCoreSMP, up to four CPUs
    IBMXenonSMP, three Power PC CPUs
    IBMCell ProcessorAsymmetric multiprocessing (ASMP --Asymmetric multiprocessing), nine CPUs

    Kernel configuration

    In order to use SMP with Linux on SMP-compatible hardware, the kernel must be configured correctly. The CONFIG_SMP option must be enabled during kernel configuration in order for the kernel to be aware of SMP. If such a kernel will run on a multiprocessor host, you can determine the number of processors and their type using file system proc.

    First you get the number of processors from the cpuinfo file in /proc using grep . As you can see in Listing 1, you use the --count (-c) option on lines starting with processor . The contents of the cpuinfo file are also shown. As an example, we take a Xeon motherboard on two chips.

    Listing 1. Using the proc file system to obtain CPU information
    mtj@camus:~$ grep -c ^processor /proc/cpuinfo 8 mtj@camus:~$ cat /proc/cpuinfo processor: 0 vendor_id: GenuineIntel cpu family: 15 model: 6 model name: Intel(R) Xeon(TM ) CPU 3.73GHz stepping: 4 cpu MHz: 3724.219 cache size: 2048 KB physical id: 0 siblings: 4 core id: 0 cpu cores: 2 fdiv_bug: no hlt_bug: no f00f_bug: no coma_bug: no fpu: yes fpu_exception: yes cpuid level: 6 wp: yes flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx lm pni monitor ds_cpl est cid xtpr bogomips: 7389.18 ... processor: 7 vendor_id: GenuineIntel cpu family: 15 model: 6 model name: Intel(R) Xeon(TM) CPU 3.73GHz stepping: 4 cpu MHz: 3724.219 cache size: 2048 KB physical id: 1 siblings: 4 core id: 3 cpu cores: 2 fdiv_bug: no hlt_bug: no f00f_bug: no coma_bug: no fpu: yes fpu_exception: yes cpuid level: 6 wp: yes flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx lm pni monitor ds_cpl est cid xtpr bogomips: 7438.33 mtj@camus:~$

    SMP and the Linux kernel

    When Linux 2.0 first came out, SMP support consisted of a basic locking system that controlled access on the system. Later there was little progress in SMP support, but it wasn't until the 2.6 kernel that the full power of SMP finally emerged.

    Kernel 2.6 introduced a new 0(1) scheduler which included better support for SMP systems. The key was the ability to balance the load across all available CPUs, avoiding process switching between processors as much as possible to make more efficient use of the cache. Regarding cache performance, recall from Figure 4 that when a task interacts with one CPU, moving it to another requires the cache to become involved. This increases the latency of a task's memory access while its data is in the cache of the new CPU.

    The 2.6 kernel stores a per-processor runqueue pair (expired and active runqueue). Each runqueue supports 140 priorities, of which 100 are used for real-time tasks and the remaining 40 for user tasks. Tasks are given chunks of time to run, and when they use up their time, they are moved from the active runqueue to the expired one. This ensures equal access to the CPU for all tasks (blocking only individual CPUs).

    With a queue of tasks on each CPU work can be balanced, giving a weighted load to all CPUs in the system. Every 200 milliseconds, the scheduler performs load balancing to redistribute jobs and maintain balance across the processor complex. Find more information about the Linux 2.6 scheduler in the section.

    Userspace threads: building on the power of SMP

    IN Linux kernel has been done great job for the development of SMP, but the operating system by itself is not enough. Remember that the power of SMP lies in TLP. Individual monolithic (single-threaded) programs cannot use SMP, but SMP can be used in programs that consist of many threads that can be distributed among cores. While one thread is waiting for an I/O operation to complete, another can do useful work. Thus, threads work by overlapping each other's waiting time.

    Portable Operating System Interface (POSIX) threads are an excellent way to build threaded applications that can use SMP. POSIX threads provide a threading mechanism as well as shared memory. When a program is invoked, a number of threads are created, each maintaining its own stack (local variables and state) but sharing the parent's data space. All created threads share this same data space, but this is where the problem lies.

    To support multi-threaded access to shared memory, coordination mechanisms are required. POSIX provides a mutual exclusion function to create critical sections, which establish exclusive access to an object (memory area) for only one thread. If this is not done, memory may become corrupted due to unsynchronized manipulations performed by multiple threads. Listing 2 illustrates creating a critical section using POSIX mutual exclusion.

    Listing 2. Using pthread_mutex_lock and unlock to create critical sections
    pthread_mutex_t crit_section_mutex = PTHREAD_MUTEX_INITIALIZER; ... pthread_mutex_lock(&crit_section_mutex); /* Inside the critical section. Memory access here is safe * for memory protected by crit_section_mutex. */ pthread_mutex_unlock(&crit_section_mutex);

    If multiple threads attempt to lock the semaphore after the initial call at the top, they are blocked and their requests are queued until the pthread_mutex_unlock call is made.

    Kernel variable protection for SMP

    When multiple cores in a processor are running in parallel for an OS kernel, it is advisable to avoid sharing data that is specific to that processor core. For this reason, the 2.6 kernel introduced the concept of per-core variables that are associated with individual CPUs. This allows you to declare CPU-specific variables that are most frequently used by that particular CPU, minimizing locking requirements and improving execution.

    Defining individual kernel variables is done using the DEFINE_PER_CPU macro, to which you pass the type and name of the variable. Since the macro comes as an l-value, this is where you can initialize it. The following example (from /arch/i386/kernel/smpboot.c) defines a variable representing the state for each CPU in the system.

    /* State of each CPU. */ DEFINE_PER_CPU(int, cpu_state) = ( 0 );

    The macro creates an array of variables, one for each CPU instance. To obtain a variable for an individual CPU, use the per_cpu macro along with the smp_processor_id function, which returns the current ID of the CPU for which the program is currently running.

    per_cpu(cpu_state, smp_processor_id()) = CPU_ONLINE;

    The kernel provides other functions for blocking each CPU and dynamically allocating variables. These functions can be found in ./include/linux/percpu.h.

    Conclusion

    When the processor frequency reaches its limit, it is common to simply add more processors to increase performance. Previously, this meant adding more processors to motherboard or combine several independent computers into a cluster. Today, chip-level multiprocessing provides more processors on a single chip, delivering even greater performance by reducing memory latency.

    You will find SMP systems not only on servers, but also on desktops, especially with the introduction of virtualization. Like many advanced technologies, Linux provides support for SMP. The kernel does its part to optimize the load on available CPUs (from threads to virtualized operating systems). All that remains is to ensure that the application can be sufficiently threaded to take advantage of the power of SMP.

    5.2. Symmetric multiprocessor SMP architecture

    SMP (symmetric multiprocessing) – symmetric multiprocessor architecture. The main feature of systems with SMP architecture(Fig.5.5) is the presence of common physical memory shared by all processors.

    Figure 5.5 – Schematic view of the SMP architecture

    Memory serves, in particular, to transfer messages between processors, while all computing devices when accessing it have equal rights and the same addressing for all memory cells. Therefore, the SMP architecture is called symmetric. The latter circumstance allows you to very efficiently exchange data with other computing devices.

    The SMP system is built on the basis high speed system bus(SGI PowerPath, Sun Gigaplane, DEC TurboLaser), to the slots of which functional blocks of the following types are connected: processors (CPU), input/output subsystem (I/O), etc. Slower buses are used for connection to I/O modules (PCI, VME64).

    The most well-known SMP systems are SMP servers and workstations based on Intel processors (IBM, HP, Compaq, Dell, ALR, Unisys, DG, Fujitsu, etc.). The entire system runs under a single OS (usually UNIX-like, but Windows NT is supported for Intel platforms). The OS automatically (during operation) distributes processes among processors, but sometimes explicit binding is possible.

    The main advantages of SMP systems:

    - simplicity and versatility for programming. The SMP architecture does not impose restrictions on the programming model used to create the application: a parallel branch model is typically used, with all processors operating independently of each other. However, it is also possible to implement models that use interprocessor communication. Using shared memory increases the speed of such exchange; the user also has access to the entire amount of memory at once. For SMP systems, there are quite effective means of automatic parallelization;

    - ease of operation. Typically, SMP systems use an air-cooled air conditioning system, which makes them easier to maintain;

    - relatively low price.

    Flaws:

    - shared memory systems do not scale well.

    This significant drawback of SMP systems does not allow us to consider them truly promising. The reason for poor scalability is that the bus can only handle one transaction at a time, which results in conflict resolution problems when multiple processors simultaneously access the same areas of shared physical memory.

    Currently, conflicts can occur with 8-24 processors. All this obviously prevents performance from increasing as the number of processors and the number of connected users increases. In real systems, no more than 32 processors can be used. To build scalable systems based on SMP, cluster or NUMA architectures are used. When working with SMP systems, the so-called shared memory paradigm is used.

    02/15/1995 V. Pyatenok

    Single-processor architecture Modified single-processor architecture SMP architecture SMP router architecture proposed by Wellfleet Architecture overview Packet processing details Summary Literature Routers have evolved using three different architectures: single-processor, modified single-processor, and symmetric multiprocessor. All three were designed to support highly critical applications.

    Routers in their development used three different architectures: uniprocessor, modified uniprocessor and symmetric multiprocessor. All three were designed to support highly critical applications. However, the main requirements of these requirements, namely high, scalable performance, as well as a high level of availability, including full fault tolerance and recovery of inoperative components ("hot standby"), they are not able to satisfy to the same extent. The article discusses the advantages of symmetric multiprocessor architecture.

    Single-processor architecture

    The single-processor architecture uses several network interface modules - this allows for additional flexibility in configuring nodes. Network interface modules are connected to a single central processor via a common system bus. This single processor takes care of all the processing tasks. And these tasks, given the current level of development of corporate networks, are complex and diverse: filtering and forwarding packets, necessary modification of packet headers, updating routing tables and network addresses, interpretation of service control packets, responses to SNMP requests, generation of control packets, provision of other specific services such as spoofing, that is, setting special filters to achieve improved security characteristics and network performance.

    This traditional architectural solution is the easiest to implement. However, it is not difficult to imagine the limitations to which the performance and availability of such a system will be subject.

    Indeed, all packets from all network interfaces must be processed by a single central processor. As additional network interfaces are added, performance degrades noticeably. In addition, each packet must travel twice along the bus - from the “source” module to the processor, and then from the processor to the “receiver” module. The packet makes this path even if it is destined for the same network interface from which it came. This also leads to a significant drop in performance as the number of network interface modules increases. Thus, there is a classic bottleneck.

    Reliability is also low. If the central processor fails, the functionality of the router as a whole will be disrupted. In addition, for such an architecture it is impossible to implement “hot recovery” from the reserve of damaged system elements.

    Modern implementations of this router architecture typically use a fairly powerful RISC processor and a high-speed front side bus to overcome performance limitations. This is a purely forceful attempt to solve the problem - increased productivity for large initial investment. However, such implementations do not provide performance scaling, and their level of reliability is predetermined by the reliability of the processor.

    Modified single-processor architecture

    In order to overcome some of the above-mentioned shortcomings of the uniprocessor architecture, its modification was invented. The underlying architecture is preserved: the interface modules are connected to a single processor via a common system bus. However, each of the network interface modules includes a special peripheral processor in order to at least partially relieve CPU.

    Peripheral processors are, as a rule, bit-slace or universal microprocessors that filter and route packets destined for the network interface of the same module from which they entered the router. (Unfortunately, in many currently available implementations this can only be achieved for certain types of packages, such as Ethernet frames, but not IEEE 802.3.

    The central processor is still responsible for those tasks that cannot be delegated to the peripheral processor (including routing between modules, system-wide operations, administration and management). Therefore, the performance optimization achieved in this way is quite limited (to be fair, in some cases, with proper network design, good results can be achieved). At the same time, despite a slight reduction in the number of packets transmitted over the system bus, it still remains a very bottleneck.

    Including peripheral processors in the architecture does not improve the availability of the router as a whole.

    SMP architecture

    The symmetric multiprocessor architecture does not have the disadvantages inherent in the above-mentioned architectures. In this case, computing power is completely distributed among all network interface modules.

    Each network interface module has its own dedicated processor module, which performs all routing-related tasks. In this case, all routing tables, other necessary information, as well as the software that implements the protocols are replicated (that is, copied) to each processor module. When a processor module receives routing information, it updates its own table and then propagates the updates to all other processor modules.

    This architecture, of course, provides almost linear (if we neglect the costs of replication and the bandwidth of the communication channel between modules) scalability. This, in turn, means the prospect of significant network expansion without a noticeable drop in performance. If necessary, you just need to add an additional network interface module - after all, there is simply no central processor in this architecture.

    All packets are processed by local processors. External (that is, intended for other modules) packets are transmitted over the communication channel between processors only once. This leads to a significant reduction in traffic within the router.

    In terms of availability, the system will not fail if a single processor module fails. This failure will only affect those network segments that are connected to the damaged processor module. In addition, a damaged module can be replaced with a working module without turning off the router and without any impact on all other modules.

    The advantages of SMP architecture are recognized by computer manufacturers. Many similar platforms have emerged over the past few years, and only a limited number of standard operating systems capable of fully realizing the benefits of the hardware have held back their proliferation. Other manufacturers, including manufacturers of active network devices, also use the SMP architecture when creating specialized computing devices.

    In the remainder of this section, we'll take a closer look at the technical details of Wellfleet's SMP router architecture.

    SMP router architecture proposed by Wellfleet

    Wellfleet, one of the leading manufacturers of routers and bridges, has certainly spared no expense in evaluating and testing various router architectures supporting various WAN and LAN protocols over various physical media and designed for various conditions traffic. The results of these studies were formulated in the form of a list of requirements taken into account when designing routers intended for building enterprise network environments for highly critical applications. Let us present some of these requirements - those that, in our opinion, justify the use of multiprocessor architecture.

    1. The need for scalable performance, a high level of availability, and configuration flexibility dictates the use of SMP architecture.

    2. Level of multiprotocol routing requirements for computing power(especially when using modern routing protocols like TCP/IP OSPF) can only be provided by modern powerful 32-bit microprocessors. However, since routing involves servicing large numbers of similar requests in parallel, it requires fast switching between different processes, which requires exceptionally low context switch latency, as well as integrated cache memory.

    3. To store protocol supporting and control software, routing and address tables, statistical and other information require a fairly large memory capacity.

    4. To ensure maximum transfer rates between networks and router processing modules, high-speed network interface controllers and interprocessor controllers with integrated capabilities are required direct access to memory (DMA - Direct Memory Access).

    5. Minimizing latency requires high-bandwidth 32-bit data channels and addresses for all resources.

    6. Availability requirements include distributed computing power, redundant power subsystems, and, as an additional but very important feature, redundant interprocessor communication channels.

    7. The need to cover a wide range of network environments - from a single remote node or workgroup network to a high-performance, highly available backbone - requires the use of a scalable multiprocessor architecture.

    Architecture overview

    Figure 2 shows a schematic diagram of the symmetric multiprocessor architecture used in all Wellfleet modular routers. There are three main architectural elements: communication modules, processor modules and interprocess communication.

    Communication modules provide physical network interfaces that allow connections to local and wide area networks of virtually any type. Each communication module is directly connected to its dedicated processor module via an Intelligent Link Interface (ILI). Packets received by the communication module are transmitted to the processor module connected to it through its own direct connection. The processor determines which network interface these packets are destined for and either redirects them to another network interface of the same communication module, or, over a high-speed interprocessor connection, to another processor module, which will forward the packet to the connected communication module.

    Let us dwell in more detail on the structure of each of the components.

    The processor module includes:

    The actual central processing unit;

    Local memory, which stores protocols and routing tables, address tables and other information locally used by the CPU;

    Global memory, which plays the role of a buffer for “transit” data packets coming from the communication module to the processor module attached to it or from other processor modules (it is called global because it is visible and accessible to all processor modules);

    OMA processor, which provides the ability to directly access memory when transferring packets between global memory buffers located in different processor modules;

    Communication interface providing connection to the corresponding communication module;

    Internal 32-bit wide data channels connecting all of the above resources and designed to provide the highest possible throughput and minimum delay time; Multiple channels are provided, allowing multiple computing devices (such as the CPU and DMA processor) to perform operations simultaneously and ensuring there are no bottlenecks that slow down packet forwarding and processing.

    Various Wellfleet router models use ACE (Advanced Communication Engine) processor modules based on the Motorola 68020 or 68030 processors, or Fast Routing Engine (FRE) modules based on the MC68040.

    The communication module includes:

    Connectors that provide an interface with specific networks (for example, synchronous, Ethernet, Token Ring FDDI);

    Communication controllers that transfer packets between the physical network interface and global memory using a DMA channel; communication controllers are also designed for a specific type of network interface and are capable of transmitting packets at a speed that matches the speed of the wire;

    Filters ( additional opportunity for communication modules for FDDI and Ethernet) that perform pre-filtering of incoming packets, saving computing resources for meaningful file processing.

    The standard VMEbus is often used as an interprocessor communication channel, providing a total throughput of 320 Mbit/s.

    The older models use the Parallel Packet Express (PPX) interface developed by Wellfleet itself with a bandwidth of 1 Gbit/s, using four independent, redundant 256 Mbit/s data channels with dynamic load distribution. This ensures high overall performance and ensures that there is no single point of failure in the architecture. Each processor module is connected to all four channels and has the ability to select any of them. A specific channel is selected randomly for each packet, which should ensure an even distribution of traffic among all available channels. If one of the PPX data channels becomes unavailable, the download is automatically distributed among the remaining ones.

    Packet processing details

    Depending on the network, incoming packets are received by one or another communication controller. If an additional filter is included in the communication module configuration, some packets are discarded and others are accepted. Received packets are placed by the communication controller in the global memory buffer of the processor module directly attached to it. For fast packet transfer, each communication controller includes a direct memory access channel.

    Once in the global memory, the packets are retrieved by the CPU for routing. The CPU detects the output network interface, modifies the packet appropriately, and returns it to global memory. Then one of two actions is performed:

    1. The packet is redirected to the network interface of the module directly attached to it. The communication controller of the egress network interface receives instructions from the CPU to select packets from global memory and send them to the network.

    2. The packet is forwarded to the network interface of another communication module. The DMA processor receives instructions from the CPU to send packets to another processor module and loads them over the interprocessor connection into the global memory of the processor module attached to the output network interface. The communication controller of the egress network interface selects packets from global memory and sends them to the network.

    Routing decisions are made by the CPU independently of other processor modules. Each processor module maintains an independent routing and address database in its local memory, which is updated when the module receives information about changes in the routing tables or addresses (in this case, the changes are sent to all other processor modules).

    The simultaneous operation of the communication controller, CPU and DMA processor allows achieving overall high performance. (We emphasize that all this happens in a device where processing is parallelized across several multiprocessor modules). For example, you can imagine a situation where the communication controller places packets in global memory, while the CPU updates the routing table in local memory, and the DMA processor places the packet in the interprocessor connection.

    Resume

    The mere fact of penetration of computer technologies developed for one application area into other related ones is not new. However, everyone concrete example attracts the attention of specialists. In the router architecture discussed in this article, in addition to the idea of ​​symmetric multiprocessing, designed to provide scalable performance and a high level of availability, the mechanism of duplicate data channels between processors (for the same purposes) is also used, as well as the idea of ​​data replication (or replication), the use of which is more typical for the distributed DBMS industry.

    Literature

    Symmetric Multiprocessor Architecture. Wellfleet Communications, 10/1993.

    G.G. Baron, G.M. Ladyzhensky. "Technology for data replication in distributed systems", "Open Systems", Spring 1994.

    *) Wellfleet merged with another leader last fall network technologies,SunOptics Communications. The merger led to the creation of a new network giant - Bay Networks (editor's note)



    5. SYMMETRICAL MULTIPROCESSOR SYSTEMS

    5.1. Distinctive features and advantages of symmetrical

    multiprocessor systems

    For the class of symmetric multiprocessors (SMP - symmetric multiprocessor) systems are characterized by the following distinctive features:

      the presence of two or more identical or similar processors in characteristics;

      processors have access to shared memory, to which they are connected either through a common system bus, or through another mechanism for ensuring interaction, but in any case, the time of access to memory resources by any processor is approximately the same;

      processors have access to common input/output facilities either through the same channel or through separate channels;

      all processors are capable of performing the same set of functions (hence the definition symmetrical system);

      the entire complex is controlled by a common operating system, which ensures interaction between processors and programs at the level of jobs, files and data elements.

    The first four signs on this list hardly require further comment. As for the fifth feature, it reveals the most important difference between SMP systems and cluster systems, in which interaction between components is carried out, as a rule, at the level of individual messages or full files. In an SMP system, it is possible to exchange information between components and at the level individual elements data and thus closer interaction between processes can be organized. In an SMP system, the distribution of processes or task threads between individual processors is assigned to the operating system.

    The most significant advantages of SMP systems over single-processor systems are as follows.

    Increased productivity. If individual application tasks can be executed in parallel, a system with multiple processors will perform faster than a system with a single processor of the same type.

    Reliability. Since all processors in an SMP system are of the same type and can perform the same tasks, if one of them fails, the task scheduled for it can be transferred to another processor. Consequently, the failure of one of the processors will not lead to loss of functionality of the entire system.

    Possibility of functional expansion. The user can increase system performance by adding additional processors.

    Production of similar systems of different performance. A computer manufacturer can offer customers a range of systems with the same architecture, but different costs and performance, differing in the number of processors.

    It should be noted that all these advantages are most often potential and are not always realized in practice.

    A very attractive feature of SMP systems for users is its transparency. operating system takes on all the worries of distributing tasks between individual processors and synchronizing their work.

    5.2. Structural organizationSMP–systems

    In Fig. Figure 5.1 shows a generalized block diagram of a multiprocessor system.

    Rice. 5.1. Generalized diagram of a multiprocessor system

    The system contains two or more processors, each of which has the entire set of necessary nodes - a control unit, ALU, registers and a cache unit. Each processor has access to the system's main memory and I/O devices through some communication subsystem. Processors can exchange data and messages through main memory (for this purpose, a separate communication area is allocated in it). In addition, the system can also support the ability to directly exchange signals between individual processors. Often shared memory is organized in such a way that processors can access different blocks of it simultaneously. In some systems, processors have blocks of local memory and own channels input/output in addition to shared resources.

    Options for the structural organization of multiprocessor systems can be classified as follows:

      systems with a common or time-shared highway;

      systems with multiport memory;

      systems with a central control device.

    5.2.1. Common Rail Systems

    Using a common backbone in time-sharing mode is the simplest way to organize collaboration processors in the SMP system (Fig. 5.2). The bus structure and interface are practically the same as in a single-processor system. The trunk includes lines of data, addresses and control signals. To simplify the operation of the direct memory access mechanism on the part of the I/O modules, the following measures are taken.

    Addressing is organized in such a way that modules can be distinguished by address code when determining data sources and receivers.

    Arbitration. Any I/O module can temporarily become a bus master. The arbiter, using some priority mechanism, ensures conflict resolution when competing requests for highway control appear.

    Time sharing. When one module gains control of a highway, the remaining modules are blocked and must, if necessary, suspend operations and wait until they are granted access to the highway.

    These functions, which are common on single-processor systems, can be used without special modifications on a multiprocessor system. The main difference is that not only input/output modules, but also processors take part in the struggle for the right to access a memory block.


    Rice. 5.2. Organization of an SMP system with a common backbone

    The backbone structure of connections has several advantages compared to other approaches to implementing the interaction subsystem.

    Simplicity. This option is the simplest because physical interface,The addressing scheme, arbitration mechanism, and backbone ,resource sharing logic remain essentially the same as in a ,uniprocessor system.

    Flexibility. A backbone system can be easily reconfigured by adding new processors.

    Reliability. The highway is a passive medium, and the failure of any device connected to it does not lead to loss of functionality of the system as a whole.

    The main disadvantage of a common bus system is limited performance. All access operations to the main memory must take place along a single path - through a common highway and, therefore, the speed of the system is limited by the cycle time of the highway. This problem can be partially overcome by equipping each processor with its own cache memory block, which reduces the number of accesses to main memory. As a rule, a two-level cache organization is used: the L1 cache is located in the processor LSI (internal cache), and the L2 cache is external.

    However, the use of cache memory in a multiprocessor system raises the problem of coherence or information integrity of the caches of individual processors.

    5.2.2. Multiport Memory Systems

    The use of multiport memory in SMP systems makes it possible to organize direct access of each processor and input/output module to a common array of information, independently of all others (Fig. 5.3). In this case, each memory module must be equipped logic circuit resolving possible conflicts. To do this, ports are most often assigned certain priorities. Typically, the electrical and physical interface of each port is identical to the device connected to that port; The memory module can be considered as single-port. Therefore, practically no changes need to be made to the processor or I/O module circuitry to connect to multiport memory.


    Rice. 5.3. Multi-port memory circuit

    Only the shared memory block design becomes significantly more complicated, but this pays off by increasing the performance of the system as a whole, since each processor has its own channel for accessing information sharing. Another advantage of systems with this organization is the ability to allocate memory areas for the exclusive use of a specific processor (or group of processors). This simplifies the creation of a system for protecting information from unauthorized access and storing recovery programs in memory areas that are inaccessible for modification by other processors.

    There is one more significant point when working with multiport memory. When updating information in the cache of any processor, it is necessary to perform a write-through to main memory, since there is no other way to notify other processors that changes have been made to the data.

    5.2.3. Systems with central control unit

    The central control device organizes separate data flows between independent modules - processors, memory and input/output modules. The controller can remember requests and act as an arbiter and resource allocator. It is also responsible for transmitting status information, control messages, and notifying processors about changes in information in caches.

    Since all logical functions associated with coordinating the operation of system components are implemented in one central control device, the interfaces of processors, memory and I/O modules remain virtually unchanged. This gives the system almost the same flexibility and simplicity as using a shared backbone. The main disadvantage of this approach is the significant complication of the control device design, which can potentially lead to reduced performance.

    The structure with a central control device was once widespread in the construction of multiprocessor computing systems based on large machines. Nowadays they are very rare.

    5.3. SMP–systems based on large computers

    5.3.1. StructureSMP–systems based on large

    computers

    Most personal SMP systems and workstations use a system backbone to organize interaction between components. In complexes on base large computers(mainframe) takes an alternative approach . A block diagram of such a complex is shown in Fig. 5.4. The family includes computers of different classes - from single-processor ones with a single main memory card to high-performance systems with a dozen processors and four main memory blocks. The configuration also includes additional processors that perform the functions of I/O modules. The main components of computer systems based on large computers are as follows.

    PR processor - CISC is a microprocessor in which the processing of the most frequently used commands is controlled in hardware, and other commands are executed using firmware. The LSI of each PR includes a 64 KB L1 cache, which stores both instructions and data.

    Level cacheL2 volume 384 KB. L2 cache blocks are clustered two at a time, with each cluster supporting three processors and providing access to the entire main memory address space.

    Line switch adapter- BSN (busswitching network adapter), which organizes the connection of L2 cache blocks with one of the four main memory blocks. The BSN also includes a 2 MB L3 cache.

    Single board main memory unit 8 GB capacity. The complex includes four such blocks, providing a total main memory capacity of 32 GB.

    There are several features in this structure that are worth dwelling on in more detail:

      switchable interconnection subsystem;

      shared L2 cache;

      L3 cache.


    Fig.5.4. Block diagram of an SMP system based on large machines

    5.3.2. Switchable interconnection subsystem

    In SMP systems for personal use and workstations, a structure with a single system backbone is generally accepted. In this option, the backbone can become a bottleneck over time, preventing further expansion of the system - adding new components to it. Designers of SMP systems based on large machines have tried to cope with this problem in two ways.

    First, they divided the main memory subsystem into four single-board units, each equipped with its own controller, which is capable of high speed process memory requests. As a result, the total bandwidth of the memory access channel has quadrupled.

    Second, the connection between each processor (in fact, between its L2 cache) and a separate memory block is not implemented in the form of a shared backbone, but rather in the form of a point-to-point connection - each link connects a group of three processors through the L2 cache to the module B.S.N. In turn, the BSN acts as a switch that combines five communication channels (four with L2 caches and one with a memory unit) - connecting four physical channels into one logical data transmission backbone. Thus, a signal arriving through any of the four channels connected to the L2 caches is duplicated across the other three channels, thereby ensuring the information integrity of the caches.

    Although there are four separate memory blocks in the system, each processor and each L2 cache block has only two physical ports through which they communicate with the main memory subsystem. This solution was chosen because each L2 cache block can only store data from half of the total memory address space. A pair of cache blocks is used to service requests to the entire address space, and each processor must have access to both blocks in the pair.

    5.3.3. Sharing Level Cache BlocksL2

    In a typical SMP system structure, each processor has its own cache blocks (usually two levels). In recent years, the concept of sharing L2 cache blocks has attracted increasing interest among system developers. An early version of a large machine-based SMP system used 12 L2 cache blocks, each of which was “at the disposal” of one specific processor. In later versions, L2 cache blocks are shared among multiple processors. This was done based on the following considerations.

    New versions use processors whose speed has doubled compared to the processors of the first version. If you leave the same structure of cache blocks, the flow of information through the backbone subsystem will increase significantly. At the same time, the designers were tasked with using ready-made blocks designed for old version. If the backbone subsystem is not upgraded, the BSN blocks in the new version may become a bottleneck over time.

    Analysis typical applications executing in the system showed that a fairly large part of the commands and data is shared between different processors.

    Therefore, the developers of the new version of the system considered the option of sharing one or more L2 cache blocks among several processors (each processor still has its own internal L1 cache block). At first glance, the idea of ​​sharing the L2 cache seems unattractive, since the processor must additionally seek access to it, and this may entail a performance penalty. But if a significant portion of the data and instructions in the cache is needed by multiple processors, then a shared cache can increase system throughput rather than reduce it. Data that is needed by multiple processors will be found in the shared cache more quickly than if it had to be transferred through the backbone subsystems.

    The developers of the new version of the system also considered the option of including a single large cache in the system, shared by all processors. Although this system structure promised an even greater increase in productivity, it had to be abandoned, since this option required a complete rework of the entire existing communication organization. Analysis of data flows in the system showed that the introduction of sharing cache blocks associated with each of the available BSNs will already allow for a very noticeable increase in system performance. At the same time, compared to individual caches, the hit percentage when accessing the cache increases significantly and, accordingly, the number of accesses to main memory decreases.

    5.3.4. Level cacheL3

    Another feature of an SMP system based on a large machine is the inclusion of a third level cache, L3, in its structure. The L3 cache is included in each BSN block and is therefore a buffer between the L2 caches and one of the main memory blocks. The use of this cache reduces the latency of arriving data that is not in the L1 and L2 caches. If this data was previously required by any of the processors, then it is present in the L3 cache and can be transferred to the new processor. The time it takes to retrieve data from the L3 cache is less than the time it takes to access a main memory block, which provides performance gains.

    In table Figure 5.1 shows data obtained from a performance study of a typical IBM S/390-based SMP system. The “access latency” indicator characterizes the time required to retrieve data by the processor if it is present in a particular structural element of the memory subsystem. When a processor requests new information, in 89% of cases it is found in its own L1 cache. In the remaining 11% of cases, you have to access the caches of the next levels or the main memory block. In 5% of cases, the necessary information is found in the L2 cache, etc. Only 3% of the time a block of main memory ends up being accessed. Without the L3 cache, this figure would be twice as high.

    Table 5.1

    Efficiency indicators of memory subsystem elements

    in an SMP system based on IBM S/390

    Memory subsystem

    Access delay (operation cycles

    processor)

    Hit percentage

    Main memory

    5.4. Cache Information Integrity and ProtocolMESI

    5.4.1. Ways to solve the information problem

    integrity

    It has become the norm in modern computing systems to have one or two levels of cache blocks associated with each processor. This organization makes it possible to achieve high system performance, but raises the problem of information integrity of data in the caches of different processors. Its essence is that the caches of different processors can store copies of the same data from main memory. If at the same time some processor updates any of the elements of such data in its copy, then the copies that other processors deal with become unreliable, as well as the contents of the main memory. It is possible to use two options for duplicating the changes made in the main memory:

    Write back (write back). In this embodiment, the processor makes changes only to the contents of its cache. The contents of the line are written to main memory when it becomes necessary to clear a modified cache line to accept a new block of data.

    Write-through (write through). All cache writes are immediately duplicated in main memory, without waiting for the contents of the corresponding cache line to be replaced. As a result, you can always be sure that RAM The most recent, and therefore reliable, information is stored at any time.

    It is obvious that the application of the technique write back can lead to a violation of the information integrity of data in the caches, since until the updated data is rewritten to the main memory, the caches of other processors will contain unreliable data. Even when using write-through techniques, information integrity is not guaranteed because changes made must be duplicated not only in main memory, but also in all cache blocks containing the original copies of this data.

    systems Guidelines

    University Department of Computer Measurement systems and metrology ________________________________________________ I N F O R M A T I C A Architecturecomputingsystems. Basic subsystems of a personal computer...