Path: utzoo!attcan!utgpu!jarvis.csri.toronto.edu!mailrus! csd4.csd.uwm.edu!bionet!ames!sgi!...@maddog.sgi.com From: j...@maddog.sgi.com (Jim Barton) Newsgroups: comp.sys.sgi Subject: Overview of Parallel Synchronization Keywords: multiprocessor parallel programming Message-ID: <41394@sgi.sgi.com> Date: 7 Sep 89 16:17:00 GMT Sender: j...@maddog.sgi.com Organization: Silicon Graphics, Inc., Mountain View, CA Lines: 996 Here is a formatted version of a section of the original IRIX software specification, written by yours truly. I can supply the raw MM source to anyone who wants it. -- Jim Barton Silicon Graphics Computer Systems "UNIX: Live Free Or Die!" j...@sgi.sgi.com, sgi!...@decwrl.dec.com, ...{decwrl,sun}!sgi!jmb --------------- cut here -------------------- Section 2 - 1 - Section_2:_Software_Synchronization ($Revision: 1.12 $ $Date: 86/09/23 17:27:33 $) 1. UNIX_On_Multiprocessors Much of the discussion in this section follows that in [1]. This section assumes that the Clover 2 multiprocessor structure will consist of two or more CPU's that share common memory and peripherals. Such a structure can potentially provide much greater throughput since processes can run concurrently on different processors. Each CPU executes independently, but all of them execute a single copy of the kernel. In general, processes are unaware that they are running on a multiprocessor, and can migrate between processors transparently. Unfortunately, a process does not consume less CPU time. Allowing several processors to execute simultaneously in the kernel causes integrity problems unless protection mechanisms are used. The original design of UNIX assumed that only a single processor would ever execute the kernel. Thus, kernel data structures could be protected by not preempting a process in the kernel. A kernel process gives up control of the processor voluntarily, through the normal UNIX sleep/wakeup mechanism. There are three ways in which corruption of kernel data structures can be prevented on a multiprocessor: 1. Execute the kernel on one processor only, which protects all kernel data structures but serializes all system functions. 2. Serialize small sections of code (critical regions) with locking primitives. This approach adds locking overhead, and may increase context switching. 3. Redesign algorithms to avoid contention for data structures. In many cases, this may be impossible to do and still maintain reasonable performance. AT&T currently sells a multiprocessor version of UNIX for the 3B20A attached processor system. In this UNIX kernel, locking primitives are used to create short critical sections where contention for kernel data structures may occur. This approach was chosen because, if properly designed and implemented, the problems of such locking can be managed easily without sacrificing system throughput or causing major redesigns of kernel algorithms. By partially porting and otherwise using this code as an example, it will September 7, 1989 Company Confidential Section 2 - 2 - be possible to implement a multiprocessor kernel in a short period of time that meets the response and throughput requirements of Clover 2. The remainder of this section is devoted to describing the locking primitives, how they are used in general and how they will be used in the Clover 2 kernel. 2. Semaphores 2.1 Spinlocks 2.1.1 General_Description The class of locking primitives used in the UNIX kernel goes under the general name of semaphores. The simplest semaphore is a spinlock. To access some critical resource, a processor attempts to acquire the lock using the simple algorithm: loop: if lock is set goto loop; set lock; The lock is obviously released simply by clearing it, which will allow at most one other processor to obtain it. If the testing and setting of the lock are not atomic, then it is possible that two processors may believe they have obtained the lock at the same time, or that neither has obtained the lock. The first case can result in corrupted data structures, while the second results in deadlock. Spinlocks can be implemented atomically in software (see [1]), however this solution is slow and may potentially waste a substantial number of cycles. Instead, most modern computers provide atomic test-and-set instructions, which allow a processor to obtain the lock in a single operation. 2.1.2 Spinlock_Usage Spinlocks are used whenever the time in which the lock is actually held by any one processor is very short. In these cases it is more efficient simply to wait for the lock rather than attempting to find some other useful work to do; the time spent searching for other work may exceed the time spent waiting for the lock. In a multiprocessor system, it is difficult to know when a spinlock may be more efficient than a mutual exclusion semaphore. The use of spinlocks must be empirically determined during system tuning, when the effects can be measured. Any very short operation is a candidate for spin locking. September 7, 1989 Company Confidential Section 2 - 3 - Example uses of sinplocks are for controlling free list access or performing linear scans of system tables. 2.1.3 Spinlocks_and_Interrupts One important aspect of locking that becomes apparent in any interrupt-driven system is the possibility that an asynchronous routine started by the hardware may take control of the CPU while it is attempting to obtain the lock. If the interrupt occurs after the lock has been obtained, then the time for which the lock is held can increase dramatically, seriously impacting system performance. If a spinlock is used to protect a resource to which an interrupt routine needs access, then similar problems arise. The interrupt routine remains active while spinning, blocking other important interrupt processing and degrading system performance. Finally, the consequence if an interrupt routine attempts to obtain the same lock that normal processing has just obtained is guaranteed deadlock of the system. To avoid all these problems, interrupts are disabled while attempting to obtain and while holding a spinlock. This re-enforces the necessity for spinlocks to be held as short a time as possible. 2.1.4 Hardware_Support The Clover 2 hardware will supply a large set of "virtual" locks, which will provide atomic test-and-set capabilities between the processors in the system. This provides a powerful mechanism for implementing spinlocks, since traffic on the multiprocessor bus can be avoided. The importance of having a very large set of these locks will become apparent in the following sections. 2.2 Mutual_Exclusion_Semaphores Once spinlocks are available, it becomes possible to implement more sophisticated types of semaphores. Once of these is the mutual exclusion semaphore, which is used to protect critical sections of code in a more intelligent manner than a spinlock. Such a semaphore is a data structure which consists of a counter and a queue of processes. The counter is initialized to the value one. There are two operations possible on such a semaphore, allocation, often called the acquire operation, and de- allocation and wakeup, often called the release operation. A process attempting to acquire the semaphore does so using the following (abbreviated) algorithm: September 7, 1989 Company Confidential Section 2 - 4 - if (counter <= 0) then add this process to end of queue of processes; block this process (allow another to run); on return, decrement the counter; continue with critical section; else decrement counter; continue with critical section; fi The release operation is performed in the following manner: increment counter; if (a process is waiting on the queue) { decrement counter; unblock the first process on the queue; } Obviously the above operations must be atomic in nature; a spinlock is used to protect the semaphore, and is locked before and unlocked after each of the above operations. 2.2.1 Usage Mutual exclusion semaphores are used where the length of the critical section to be protected is long or variable in length. This allows other processes to run and do useful work even though certain work must wait. Examples of such usage are when holding region entries for managing segments of virtual memory, or when using an inode to access disk files or devices. 2.2.2 Counting_Semaphores The mutual exclusion semaphore is a special case of a general counting semaphore. A counting semaphore is initialized to a value greater than one, allowing more than one process to acquire the semaphore at once. Counting semaphores are often used in resource allocation, where there is a fixed number of resources available. Another use is to provide atomic counters; the semaphore is released each time the counter should be incremented, but never acquired. The AT&T multiprocessor code uses many atomic counters but no counting semaphores; the Clover 2 product may include the use of counting semaphores for controlling certain resources. September 7, 1989 Company Confidential Section 2 - 5 - 2.3 Synchronization_Semaphores This third type of semaphore is used to synchronize different processes (possibly on different processors). A synchronization semaphore is actually a mutual exclusion semaphore, however the acquire and release operations are performed by different processes, affecting a rendezvous operation. Example usage is in awakening the paging daemon from the clock handler when memory is in short supply and a sufficient time interval has elapsed. 3. Multiprocessor_Semaphore_Primitives The AT&T 3B20A kernel introduced several new primitives to the kernel to provide semaphore facilities. One of the stated goals of the implementation was to provide a kernel that could run on either uniprocessors or multiprocessors with no modification to the kernel code, and to achieve this goal with only a minor decrease in uniprocessor performance. This goal was met. Each set of semaphore primitives is described below. Spinlock Primitives lock_t lock; spsema(&lock); - acquire the lock, spin until acquired svsema(&lock); - release the lock On a uniprocessor the spinlock primitives are commented out using the C preprocessor (since spinlocks are meaningless on a uniprocessor): # define spsema # define svsema The actual implementation of the spinlock is done using a 3B20 microcoded instruction for speed. On the MIPS processor, there is no test-and-set instruction. Thus, hardware support must be provided to insure that a high- performance path exists for performing this operation. The Clover 2 system will provide a separate, high-speed bus for synchronization among the processors in the complex. This bus will provide similar functionality to the SLIC VLSI chip done for the Sequent Balance multi-processor [2] The Sequent implementation provides 64 test-and-set variables which are global to all processors. In addition, a number of 32-bit September 7, 1989 Company Confidential Section 2 - 6 - registers were available that could be transferred between processors in an atomic fashion, provided a general message facility. This facility was used for distributing interrupts from I/O devices and communication between the processors. The Clover 2 implementation will be based on cheaper gate- array technology, and will be called the SYNC bus for brevity. Access to the SYNC bus will be through normal load and store operations in the physical address space of each processor. The SYNC bus will provide 4096 test-and-set locks. Each lock will be atomic in nature, thus a processor can use these locks for access control, busy waiting, or other uses. The locks will be grouped into 128 pages, each having 32 locks. This allows a group of locks to be mapped into the virtual address space of the processor, allowing user access to certain locks if so desired. Other facilities will be provided by the SYNC bus; they are not of interest here 1. Unlike the Sequent system, such a large number of locks allows each semaphore in the kernel to have a private lock, improving the overall performance of the system. Since no memory cycles are used once the busy-wait instruction has been loaded from memory, all memory traffic is eliminated. With high-speed spinlocks available, the semaphore operations can be defined. Synchronization Primitives sema_t sema; psema(&sema, PRI); - acquire the semaphore vsema(&sema); - release the semaphore cpsema(&sema, PRI); - conditionally acquire the semaphore Algorithmically, psema() is implemented as: __________ 1. The SYNC bus will provide 4 8-bit registers that can be incremented, decremented or set in an atomic fashion. This allows certain algorithms to be speeded up. In addition, the SYNC bus will be used for messaging interrupts both from the I/O system and between processors, eliminating this load on the system memory bus. September 7, 1989 Company Confidential Section 2 - 7 - BEGIN acquire semaphore lock; if (count <= 0) { decrement count; acquire semaphore queue lock; release semaphore lock; add process to semaphore queue; release semaphore queue lock; switch to another process; on return, return to caller; } else decrement count; release semaphore lock; return to caller; END The routine vsema() is implemented as: BEGIN acquire semaphore lock; increment count; if (count > 0) { acquire semaphore queue lock; if (a process is queued) { remove process from queue; unblock process; decrement count; } release semaphore lock; release semaphore queue lock; return; } release semaphore lock; END The routine cpsema() is used by a process which does not wish to block if the semaphore cannot be acquired. This is used in two instances. First, if the caller is an interrupt routine wishing access to a data structure, than it may not block. Thus, it uses cpsema() to acquire the semaphore, and takes some other action if the critical section is in use. Second, some algorithms require that multiple semaphores be used when updating related data structures. In this case, it is necessary to avoid blocking so that the system does not deadlock. This type of algorithm usually appears as: September 7, 1989 Company Confidential Section 2 - 8 - loop: psema(&sem1); . . . if (cpsema(&sem2)) { vsema(&sem1); goto loop; } . . . vsema(&sem2); vsema(&sem1); Cpsema() is also used in cases where semaphores must be acquired in a different order than they will be released. This is also a case where only a conditional operation will avoid deadlock. Cpsema() is implemented as: BEGIN acquire semaphore lock; if (count <= 0) { release semaphore lock; return failure; } else decrement count; release semaphore lock; Mutual Exclusion Primitives sema_t sema; appsema(&sema, PRI); - acquire the semaphore apvsema(&sema, PRI); - release the semaphore apcpsema(&sema); - conditionally acquire semaphore These primitives are implemented differently depending on whether the target system is a uniprocessor or a multiprocessor. The following C preprocessor definitions control this: September 7, 1989 Company Confidential Section 2 - 9 - # ifdef multiprocessor # define appsema psema # define apvsema vsema # define apcpsema cpsema # else # define appsema # define apvsema # define apcpsema # endif Thus, on a uniprocessor, these operations are not performed, since they would only add overhead and perform no useful purpose. On a multiprocessor, these operations are equivalent to the previously defined psema() and vsema() operations. Miscellaneous Primitives sema_t sema; initsema(&sema, value); - initialize the given semaphore valusema(&sema) - return the value of the semaphore 4. Uniprocessor_UNIX_Semaphore_Usage Although not explicitly shown in the code, normal UNIX uses several implicit semaphores to manage kernel data space in a crude manner. First, a single semaphore is used to protect all data structures from access by other processes: the mode, e.g., kernel or user. When in kernel mode, the current process may not be preempted, protecting kernel data structures. This is obviously a crude method of protection, but has the advantage of elegance and simplicity. This semaphore is a mutual exclusion semaphore. Second, the interrupt level of the processor is used to protect from access by interrupt routines. There are as many of these semaphores as there are interrupt levels in the processor. The kernel routines used to acquire and release these semaphores are the spl routines. These semaphores are spinlocks, since the interrupt is blocked by the processor only for short periods of time, and other processing (except higher priority interrupts) cannot continue while the lock is held. Third, process synchronization is performed using the sleep and wakeup primitives, which provide an inefficient yet elegant method for synchronization. The usual sequence for September 7, 1989 Company Confidential Section 2 - 10 - acquiring the semaphore is: while (flag & BUSY) { flag |= WANT; sleep(&flag); } flag |= BUSY; The sequence for releasing the semaphore is: flag &= ~BUSY; if (flag & WANT) { flag &= ~WANT; wakeup(&flag); } Thus, if more than one process is waiting on the semaphore, they will all awake at once, and must contend for the semaphore once again. 5. Multiprocessor_UNIX_Semaphore_Usage A multiprocessor UNIX kernel must abandon the mode semaphore, since multiple processors may be executing in the kernel at once. The sleep and wakeup primitives are replaced with synchronization semaphores. This provides a more efficient implementation, since only the first process waiting on the semaphore will awake, and that process will hold the semaphore once it does. Even through the multiprocessor UNIX kernel abandons the mode semaphore, it still disallows preemption of kernel processes on the same processor. This is for performance reasons. For example, consider the case where a kernel process is preempted, but the next process chosen to run needs access to the same resource. Eventually control will pass back to the preempted process, but only after a substantial amount of time and several unnecessary context switches. At points where the uniprocessor kernel used interrupt blocking to protect from interrupt routines, the multiprocessor kernel must also provide spinlocks to protect from other processors. Thus, a common sequence in the multiprocessor kernel is: September 7, 1989 Company Confidential Section 2 - 11 - x = spl6(); spsema(&lock); /* spsema() is discussed below */ . . . svsema(&lock); /* svsema is discussed below */ splx(x); The most difficult change is the use of mutual exclusion semaphores to protect kernel data structures that previously were protected by the mode semaphore. Thus, areas such as the buffer cache, process management and virtual memory management must be modified to use semaphores to insure exclusive access to structures. 5.1 Semaphores_and_Signals One aspect of the normal sleep/wakeup mechanism is that signal handling may be performed within these primitives if desired by the caller. This allows long waits to be interrupted by the user or application and proper cleanup to be performed. Thus, both mutual exclusion and synchronization semaphores provide the means by which a signal can be ignored or handled, in the same manner as that used by sleep/wakeup. An example call to a mutual exclusion semaphore is: psema(&sema, PPRI); psema(&sema, PPRI | PCATCH); In the first case, if the priority PPRI is greater than PZERO, than a signal which is sent to the process will cause the process to wake up, cleanup it's entry on the semaphore queue, and use longjmp() to return to the system call handler for return to the user. If the priority is less than PZERO, than such signals will always be ignored, and the process can only be woken by: vsema(&sema); In the second case, the PCATCH flag causes control to be returned to the caller when a signal occurs (if the priority is greater than PZERO, as before). Thus, the caller can take any necessary cleanup actions before returning to the application. September 7, 1989 Company Confidential Section 2 - 12 - 6. Semaphore_Performance In a semaphored system, there are a number of different factors that effect the performance of the system. Obviously, the more semaphore operations performed, the fewer cycles that will be devoted to real work. The speed with which a semaphore can be accessed and updated is also critical. AT&T dealt with this problem by micro-coding special instructions to handle the most common semaphore operations. There is one factor, however, that has an overriding effect on performance. This is the hit ratio on a semaphore. This is defined as the ratio of the number of times a semaphore could be acquired immediately to the number of attempts to acquire the semaphore. By examining the above algorithms, the reader can see that the optimal path is to immediately acquire the semaphore - blocking a process is an expensive and time-consuming operation. During construction of the multiprocessor kernel, AT&T discovered that a hit ratio of at least 95% on all semaphores was necessary to achieve adequate performance of the system. To achieve this goal, it is often necessary to break up data structure accesses into smaller chunks, and to isolate accesses as much as possible. For instance, the buffer cache is broken up into a large set of hash buckets that can be searched individually, reducing the contention on any one semaphore. Fortunately, AT&T has already done most of this work, providing the Clover 2 effort with high performance debugged algorithms for handling most of these cases. 7. Driver_Semaphore_Use Clover 2 will obtain most of it's UNIX drivers from the Clover I system. Converting drivers to semaphore use would be a long, costly and painful process, perhaps not worth the effort. To avoid the same situation for the 3B20A port, AT&T added the concept of driver semaphores to the kernel. Since a driver can only be entered in certain well-defined ways, it was possible to place a semaphore sequence around each entry to the driver. In general, these entry points are the open, close, read, write and ioctl entries. This is much like the mode semaphore in normal UNIX, in that it prevents any other process from entering the driver simultaneously. Three separate types of protection are provided. September 7, 1989 Company Confidential Section 2 - 13 - First, the entire driver can be semaphored on it's major device number. Before the kernel invokes the driver, it calls psema() to lock the driver semaphores, and after return from the driver it calls vsema() to unlock the semaphore. Second, each minor device number access to the driver can be semaphored. Instead of locking a single semaphore, each possible minor device number has a semaphore associated with it. The kernel locks the semaphore associated with the minor device before entering the driver and unlocks it after return. For example, this used in protecting the tty driver, which maintains independent structures for each minor device supported. Third, a driver can be given no protection, in which case it is responsible for protecting it's own data structures. This means that structures within the driver that are shared between processes or with interrupt routines must be protected with semaphores. To protect from interrupt access while a driver semaphore is locked, the kernel checks the driver semaphore before launching the interrupt routine. If the semaphore is locked, the kernel queues the interrupt and returns. The code which unlocks the driver semaphore checks for the queued interrupt, and if one exists it launches the interrupt routine at that time. Note that this behavior is only possible with major device locking. A driver that uses minor device locking must be modified to block the interrupt routine explicitly (if it does not already do so). If a driver choses to access kernel data structures, it will not be possible to port it directly to the multiprocessor environment. This is because it will not follow the kernel's locking strategy for the structures accessed. Such drivers must be modified to use the proper locking strategy before they can be used in the multiprocessor kernel. The Clover 2 kernel will implement the 3B20A driver concept, maximizing the use of currently existing drivers. In most cases, it should be possible to directly use drivers written for Clover I. It is expected that some drivers will require a porting effort. September 7, 1989 Company Confidential Section 2 - 14 - 8. Further_Reading Much of this discussion is based on the actual code for the AT&T 3B20A implementation, as well as [3] and [1]. September 7, 1989 Company Confidential Section 2 - 15 - REFERENCES 1. Bach, Maurice J., The Design of the UNIX Operating System, Prentice-Hall, Englewood Cliffs, N. J., 1986. 2. Beck, B., and Kasten, B., VLSI Assist in Building a Multiprocessor UNIX System, Sequent Computer Systems, Portland Oregon, not dated. 3. Bach, M. J., and Buroff, S. J., Multiprocessor UNIX Operating Systems, AT&T Bell Laboratories Technical Journal, Vol. 63, No. 8, October 1984. September 7, 1989 Company Confidential