Overview of Parallel Synchronization

Path: utzoo!attcan!utgpu!jarvis.csri.toronto.edu!mailrus!
csd4.csd.uwm.edu!bionet!ames!sgi!...@maddog.sgi.com
From: j...@maddog.sgi.com (Jim Barton)
Newsgroups: comp.sys.sgi
Subject: Overview of Parallel Synchronization
Keywords: multiprocessor parallel programming
Message-ID: <41394@sgi.sgi.com>
Date: 7 Sep 89 16:17:00 GMT
Sender: j...@maddog.sgi.com
Organization: Silicon Graphics, Inc., Mountain View, CA
Lines: 996

Here is a formatted version of a section of the original IRIX software
specification, written by yours truly.  I can supply the raw MM source to
anyone who wants it.

-- Jim Barton
Silicon Graphics Computer Systems    "UNIX: Live Free Or Die!"
j...@sgi.sgi.com, sgi!...@decwrl.dec.com, ...{decwrl,sun}!sgi!jmb


--------------- cut here --------------------



       Section 2					      -	1 -



       Section_2:_Software_Synchronization
	      ($Revision: 1.12 $ $Date:	86/09/23 17:27:33 $)


       1.  UNIX_On_Multiprocessors

       Much of the discussion in this section follows that in [1].
       This section assumes that the Clover 2 multiprocessor
       structure will consist of two or	more CPU's that	share
       common memory and peripherals.  Such a structure	can
       potentially provide much	greater	throughput since processes
       can run concurrently on different processors.  Each CPU
       executes	independently, but all of them execute a single
       copy of the kernel.  In general,	processes are unaware that
       they are	running	on a multiprocessor, and can migrate
       between processors transparently.  Unfortunately, a process
       does not	consume	less CPU time.

       Allowing	several	processors to execute simultaneously in	the
       kernel causes integrity problems	unless protection
       mechanisms are used.  The original design of UNIX assumed
       that only a single processor would ever execute the kernel.
       Thus, kernel data structures could be protected by not
       preempting a process in the kernel.  A kernel process gives
       up control of the processor voluntarily,	through	the normal
       UNIX sleep/wakeup mechanism.

       There are three ways in which corruption	of kernel data
       structures can be prevented on a	multiprocessor:

	 1.  Execute the kernel	on one processor only, which
	     protects all kernel data structures but serializes	all
	     system functions.

	 2.  Serialize small sections of code (critical	regions)
	     with locking primitives.  This approach adds locking
	     overhead, and may increase	context	switching.

	 3.  Redesign algorithms to avoid contention for data
	     structures.  In many cases, this may be impossible	to
	     do	and still maintain reasonable performance.

       AT&T currently sells a multiprocessor version of	UNIX for
       the 3B20A attached processor system.  In	this UNIX kernel,
       locking primitives are used to create short critical
       sections	where contention for kernel data structures may
       occur.  This approach was chosen	because, if properly
       designed	and implemented, the problems of such locking can
       be managed easily without sacrificing system throughput or
       causing major redesigns of kernel algorithms.  By partially
       porting and otherwise using this	code as	an example, it will



       September 7, 1989		       Company Confidential







       Section 2					      -	2 -



       be possible to implement	a multiprocessor kernel	in a short
       period of time that meets the response and throughput
       requirements of Clover 2.

       The remainder of	this section is	devoted	to describing the
       locking primitives, how they are	used in	general	and how
       they will be used in the	Clover 2 kernel.


       2.  Semaphores

       2.1  Spinlocks

       2.1.1  General_Description  The class of	locking	primitives
       used in the UNIX	kernel goes under the general name of
       semaphores.  The	simplest semaphore is a	spinlock.  To
       access some critical resource, a	processor attempts to
       acquire the lock	using the simple algorithm:

	    loop:
		    if lock is set goto	loop;
		    set	lock;

       The lock	is obviously released simply by	clearing it, which
       will allow at most one other processor to obtain	it.

       If the testing and setting of the lock are not atomic, then
       it is possible that two processors may believe they have
       obtained	the lock at the	same time, or that neither has
       obtained	the lock.  The first case can result in	corrupted
       data structures,	while the second results in deadlock.

       Spinlocks can be	implemented atomically in software (see
       [1]), however this solution is slow and may potentially
       waste a substantial number of cycles.  Instead, most modern
       computers provide atomic	test-and-set instructions, which
       allow a processor to obtain the lock in a single	operation.

       2.1.2  Spinlock_Usage  Spinlocks	are used whenever the time
       in which	the lock is actually held by any one processor is
       very short.  In these cases it is more efficient	simply to
       wait for	the lock rather	than attempting	to find	some other
       useful work to do; the time spent searching for other work
       may exceed the time spent waiting for the lock.

       In a multiprocessor system, it is difficult to know when	a
       spinlock	may be more efficient than a mutual exclusion
       semaphore.  The use of spinlocks	must be	empirically
       determined during system	tuning,	when the effects can be
       measured.  Any very short operation is a	candidate for spin
       locking.



       September 7, 1989		       Company Confidential







       Section 2					      -	3 -



       Example uses of sinplocks are for controlling free list
       access or performing linear scans of system tables.

       2.1.3  Spinlocks_and_Interrupts	One important aspect of
       locking that becomes apparent in	any interrupt-driven system
       is the possibility that an asynchronous routine started by
       the hardware may	take control of	the CPU	while it is
       attempting to obtain the	lock.  If the interrupt	occurs
       after the lock has been obtained, then the time for which
       the lock	is held	can increase dramatically, seriously
       impacting system	performance.

       If a spinlock is	used to	protect	a resource to which an
       interrupt routine needs access, then similar problems arise.
       The interrupt routine remains active while spinning,
       blocking	other important	interrupt processing and degrading
       system performance.

       Finally,	the consequence	if an interrupt	routine	attempts to
       obtain the same lock that normal	processing has just
       obtained	is guaranteed deadlock of the system.

       To avoid	all these problems, interrupts are disabled while
       attempting to obtain and	while holding a	spinlock.  This
       re-enforces the necessity for spinlocks to be held as short
       a time as possible.

       2.1.4  Hardware_Support	The Clover 2 hardware will supply a
       large set of "virtual" locks, which will	provide	atomic
       test-and-set capabilities between the processors	in the
       system.	This provides a	powerful mechanism for implementing
       spinlocks, since	traffic	on the multiprocessor bus can be
       avoided.	 The importance	of having a very large set of these
       locks will become apparent in the following sections.

       2.2  Mutual_Exclusion_Semaphores

       Once spinlocks are available, it	becomes	possible to
       implement more sophisticated types of semaphores.  Once of
       these is	the mutual exclusion semaphore,	which is used to
       protect critical	sections of code in a more intelligent
       manner than a spinlock.	Such a semaphore is a data
       structure which consists	of a counter and a queue of
       processes.  The counter is initialized to the value one.

       There are two operations	possible on such a semaphore,
       allocation, often called	the acquire operation, and de-
       allocation and wakeup, often called the release operation.
       A process attempting to acquire the semaphore does so using
       the following (abbreviated) algorithm:




       September 7, 1989		       Company Confidential







       Section 2					      -	4 -



	    if (counter	<= 0) then
		    add	this process to	end of queue of	processes;
		    block this process (allow another to run);
		    on return, decrement the counter;
		    continue with critical section;
	    else
		    decrement counter;
		    continue with critical section;
	    fi

       The release operation is	performed in the following manner:

	    increment counter;
	    if (a process is waiting on	the queue) {
		    decrement counter;
		    unblock the	first process on the queue;
	    }

       Obviously the above operations must be atomic in	nature;	a
       spinlock	is used	to protect the semaphore, and is locked
       before and unlocked after each of the above operations.

       2.2.1  Usage  Mutual exclusion semaphores are used where	the
       length of the critical section to be protected is long or
       variable	in length.  This allows	other processes	to run and
       do useful work even though certain work must wait.

       Examples	of such	usage are when holding region entries for
       managing	segments of virtual memory, or when using an inode
       to access disk files or devices.

       2.2.2  Counting_Semaphores  The mutual exclusion	semaphore
       is a special case of a general counting semaphore.  A
       counting	semaphore is initialized to a value greater than
       one, allowing more than one process to acquire the semaphore
       at once.	 Counting semaphores are often used in resource
       allocation, where there is a fixed number of resources
       available.  Another use is to provide atomic counters; the
       semaphore is released each time the counter should be
       incremented, but	never acquired.

       The AT&T	multiprocessor code uses many atomic counters but
       no counting semaphores; the Clover 2 product may	include	the
       use of counting semaphores for controlling certain
       resources.









       September 7, 1989		       Company Confidential







       Section 2					      -	5 -



       2.3  Synchronization_Semaphores

       This third type of semaphore is used to synchronize
       different processes (possibly on	different processors).	A
       synchronization semaphore is actually a mutual exclusion
       semaphore, however the acquire and release operations are
       performed by different processes, affecting a rendezvous
       operation.

       Example usage is	in awakening the paging	daemon from the
       clock handler when memory is in short supply and	a
       sufficient time interval	has elapsed.


       3.  Multiprocessor_Semaphore_Primitives

       The AT&T	3B20A kernel introduced	several	new primitives to
       the kernel to provide semaphore facilities.  One	of the
       stated goals of the implementation was to provide a kernel
       that could run on either	uniprocessors or multiprocessors
       with no modification to the kernel code,	and to achieve this
       goal with only a	minor decrease in uniprocessor performance.
       This goal was met.

       Each set	of semaphore primitives	is described below.

       Spinlock	Primitives

	       lock_t  lock;

	       spsema(&lock);  - acquire the lock, spin	until acquired
	       svsema(&lock);  - release the lock

       On a uniprocessor the spinlock primitives are commented out
       using the C preprocessor	(since spinlocks are meaningless on
       a uniprocessor):

	    # define spsema
	    # define svsema

       The actual implementation of the	spinlock is done using a
       3B20 microcoded instruction for speed.  On the MIPS
       processor, there	is no test-and-set instruction.	 Thus,
       hardware	support	must be	provided to insure that	a high-
       performance path	exists for performing this operation.  The
       Clover 2	system will provide a separate,	high-speed bus for
       synchronization among the processors in the complex.  This
       bus will	provide	similar	functionality to the SLIC VLSI chip
       done for	the Sequent Balance multi-processor [2]	The Sequent
       implementation provides 64 test-and-set variables which are
       global to all processors.  In addition, a number	of 32-bit



       September 7, 1989		       Company Confidential







       Section 2					      -	6 -



       registers were available	that could be transferred between
       processors in an	atomic fashion,	provided a general message
       facility.  This facility	was used for distributing
       interrupts from I/O devices and communication between the
       processors.

       The Clover 2 implementation will	be based on cheaper gate-
       array technology, and will be called the	SYNC bus for
       brevity.	 Access	to the SYNC bus	will be	through	normal load
       and store operations in the physical address space of each
       processor.  The SYNC bus	will provide 4096 test-and-set
       locks.  Each lock will be atomic	in nature, thus	a processor
       can use these locks for access control, busy waiting, or
       other uses.  The	locks will be grouped into 128 pages, each
       having 32 locks.	 This allows a group of	locks to be mapped
       into the	virtual	address	space of the processor,	allowing
       user access to certain locks if so desired.  Other
       facilities will be provided by the SYNC bus; they are not of
       interest	here 1.

       Unlike the Sequent system, such a large number of locks
       allows each semaphore in	the kernel to have a private lock,
       improving the overall performance of the	system.	 Since no
       memory cycles are used once the busy-wait instruction has
       been loaded from	memory,	all memory traffic is eliminated.

       With high-speed spinlocks available, the	semaphore
       operations can be defined.

       Synchronization Primitives

	       sema_t  sema;

	       psema(&sema, PRI);      - acquire the semaphore
	       vsema(&sema);	       - release the semaphore
	       cpsema(&sema, PRI);     - conditionally acquire the semaphore

       Algorithmically,	psema()	is implemented as:



       __________

	1. The SYNC bus	will provide 4 8-bit registers that can	be
	   incremented,	decremented or set in an atomic	fashion.
	   This	allows certain algorithms to be	speeded	up.  In
	   addition, the SYNC bus will be used for messaging
	   interrupts both from	the I/O	system and between
	   processors, eliminating this	load on	the system memory
	   bus.




       September 7, 1989		       Company Confidential







       Section 2					      -	7 -



	    BEGIN
	       acquire semaphore lock;
	       if (count <= 0) {
		  decrement count;
		  acquire semaphore queue lock;
		  release semaphore lock;
		  add process to semaphore queue;
		  release semaphore queue lock;
		  switch to another process;
		  on return, return to caller;
	       }
	       else
		  decrement count;
	       release semaphore lock;
	       return to caller;
	    END

       The routine vsema() is implemented as:

	    BEGIN
	       acquire semaphore lock;
	       increment count;
	       if (count > 0) {
		  acquire semaphore queue lock;
		  if (a	process	is queued) {
		     remove process from queue;
		     unblock process;
		     decrement count;
		  }
		  release semaphore lock;
		  release semaphore queue lock;
		  return;
	       }
	       release semaphore lock;
	    END

       The routine cpsema() is used by a process which does not
       wish to block if	the semaphore cannot be	acquired.  This	is
       used in two instances.  First, if the caller is an interrupt
       routine wishing access to a data	structure, than	it may not
       block.  Thus, it	uses cpsema() to acquire the semaphore,	and
       takes some other	action if the critical section is in use.
       Second, some algorithms require that multiple semaphores	be
       used when updating related data structures.  In this case,
       it is necessary to avoid	blocking so that the system does
       not deadlock.  This type	of algorithm usually appears as:








       September 7, 1989		       Company Confidential







       Section 2					      -	8 -



	    loop:
		    psema(&sem1);
		       .
		       .
		       .
		    if (cpsema(&sem2)) {
			    vsema(&sem1);
			    goto loop;
		    }
		       .
		       .
		       .
		    vsema(&sem2);
		    vsema(&sem1);

       Cpsema()	is also	used in	cases where semaphores must be
       acquired	in a different order than they will be released.
       This is also a case where only a	conditional operation will
       avoid deadlock.	Cpsema() is implemented	as:

	    BEGIN
	       acquire semaphore lock;
	       if (count <= 0) {
		  release semaphore lock;
		  return failure;
	       }
	       else
		  decrement count;
	       release semaphore lock;

       Mutual Exclusion	Primitives

	       sema_t  sema;

	       appsema(&sema, PRI);    - acquire the semaphore
	       apvsema(&sema, PRI);    - release the semaphore
	       apcpsema(&sema);	       - conditionally acquire semaphore

       These primitives	are implemented	differently depending on
       whether the target system is a uniprocessor or a
       multiprocessor.	The following C	preprocessor definitions
       control this:












       September 7, 1989		       Company Confidential







       Section 2					      -	9 -



	    # ifdef multiprocessor
	    #	    define appsema  psema
	    #	    define apvsema  vsema
	    #	    define apcpsema cpsema
	    # else
	    #	    define appsema
	    #	    define apvsema
	    #	    define apcpsema
	    # endif

       Thus, on	a uniprocessor,	these operations are not performed,
       since they would	only add overhead and perform no useful
       purpose.	 On a multiprocessor, these operations are
       equivalent to the previously defined psema() and	vsema()
       operations.

       Miscellaneous Primitives

	       sema_t  sema;

	       initsema(&sema, value); - initialize the	given semaphore
	       valusema(&sema)	       - return	the value of the semaphore


       4.  Uniprocessor_UNIX_Semaphore_Usage

       Although	not explicitly shown in	the code, normal UNIX uses
       several implicit	semaphores to manage kernel data space in a
       crude manner.

       First, a	single semaphore is used to protect all	data
       structures from access by other processes: the mode, e.g.,
       kernel or user.	When in	kernel mode, the current process
       may not be preempted, protecting	kernel data structures.
       This is obviously a crude method	of protection, but has the
       advantage of elegance and simplicity.  This semaphore is	a
       mutual exclusion	semaphore.

       Second, the interrupt level of the processor is used to
       protect from access by interrupt	routines.  There are as
       many of these semaphores	as there are interrupt levels in
       the processor.  The kernel routines used	to acquire and
       release these semaphores	are the	spl routines.  These
       semaphores are spinlocks, since the interrupt is	blocked	by
       the processor only for short periods of time, and other
       processing (except higher priority interrupts) cannot
       continue	while the lock is held.

       Third, process synchronization is performed using the sleep
       and wakeup primitives, which provide an inefficient yet
       elegant method for synchronization.  The	usual sequence for



       September 7, 1989		       Company Confidential







       Section 2					     - 10 -



       acquiring the semaphore is:

	    while (flag	& BUSY)	{
		    flag |= WANT;
		    sleep(&flag);
	    }
	    flag |= BUSY;

       The sequence for	releasing the semaphore	is:

	    flag &= ~BUSY;
	    if (flag & WANT) {
		    flag &= ~WANT;
		    wakeup(&flag);
	    }

       Thus, if	more than one process is waiting on the	semaphore,
       they will all awake at once, and	must contend for the
       semaphore once again.


       5.  Multiprocessor_UNIX_Semaphore_Usage

       A multiprocessor	UNIX kernel must abandon the mode
       semaphore, since	multiple processors may	be executing in	the
       kernel at once.	The sleep and wakeup primitives	are
       replaced	with synchronization semaphores.  This provides	a
       more efficient implementation, since only the first process
       waiting on the semaphore	will awake, and	that process will
       hold the	semaphore once it does.

       Even through the	multiprocessor UNIX kernel abandons the
       mode semaphore, it still	disallows preemption of	kernel
       processes on the	same processor.	 This is for performance
       reasons.	 For example, consider the case	where a	kernel
       process is preempted, but the next process chosen to run
       needs access to the same	resource.  Eventually control will
       pass back to the	preempted process, but only after a
       substantial amount of time and several unnecessary context
       switches.

       At points where the uniprocessor	kernel used interrupt
       blocking	to protect from	interrupt routines, the
       multiprocessor kernel must also provide spinlocks to protect
       from other processors.  Thus, a common sequence in the
       multiprocessor kernel is:








       September 7, 1989		       Company Confidential







       Section 2					     - 11 -



	    x =	spl6();
	    spsema(&lock);  /* spsema()	is discussed below */
		.
		.
		.
	    svsema(&lock);  /* svsema is discussed below */
	    splx(x);

       The most	difficult change is the	use of mutual exclusion
       semaphores to protect kernel data structures that previously
       were protected by the mode semaphore.  Thus, areas such as
       the buffer cache, process management and	virtual	memory
       management must be modified to use semaphores to	insure
       exclusive access	to structures.

       5.1  Semaphores_and_Signals

       One aspect of the normal	sleep/wakeup mechanism is that
       signal handling may be performed	within these primitives	if
       desired by the caller.  This allows long	waits to be
       interrupted by the user or application and proper cleanup to
       be performed.

       Thus, both mutual exclusion and synchronization semaphores
       provide the means by which a signal can be ignored or
       handled,	in the same manner as that used	by sleep/wakeup.
       An example call to a mutual exclusion semaphore is:

	    psema(&sema, PPRI);

	    psema(&sema, PPRI |	PCATCH);

       In the first case, if the priority PPRI is greater than
       PZERO, than a signal which is sent to the process will cause
       the process to wake up, cleanup it's entry on the semaphore
       queue, and use longjmp()	to return to the system	call
       handler for return to the user.	If the priority	is less
       than PZERO, than	such signals will always be ignored, and
       the process can only be woken by:

	    vsema(&sema);

       In the second case, the PCATCH flag causes control to be
       returned	to the caller when a signal occurs (if the priority
       is greater than PZERO, as before).  Thus, the caller can
       take any	necessary cleanup actions before returning to the
       application.







       September 7, 1989		       Company Confidential







       Section 2					     - 12 -



       6.  Semaphore_Performance

       In a semaphored system, there are a number of different
       factors that effect the performance of the system.
       Obviously, the more semaphore operations	performed, the
       fewer cycles that will be devoted to real work.	The speed
       with which a semaphore can be accessed and updated is also
       critical.  AT&T dealt with this problem by micro-coding
       special instructions to handle the most common semaphore
       operations.

       There is	one factor, however, that has an overriding effect
       on performance.	This is	the hit	ratio on a semaphore.  This
       is defined as the ratio of the number of	times a	semaphore
       could be	acquired immediately to	the number of attempts to
       acquire the semaphore.  By examining the	above algorithms,
       the reader can see that the optimal path	is to immediately
       acquire the semaphore - blocking	a process is an	expensive
       and time-consuming operation.

       During construction of the multiprocessor kernel, AT&T
       discovered that a hit ratio of at least 95% on all
       semaphores was necessary	to achieve adequate performance	of
       the system.  To achieve this goal, it is	often necessary	to
       break up	data structure accesses	into smaller chunks, and to
       isolate accesses	as much	as possible.  For instance, the
       buffer cache is broken up into a	large set of hash buckets
       that can	be searched individually, reducing the contention
       on any one semaphore.

       Fortunately, AT&T has already done most of this work,
       providing the Clover 2 effort with high performance debugged
       algorithms for handling most of these cases.


       7.  Driver_Semaphore_Use

       Clover 2	will obtain most of it's UNIX drivers from the
       Clover I	system.	 Converting drivers to semaphore use would
       be a long, costly and painful process, perhaps not worth	the
       effort.	To avoid the same situation for	the 3B20A port,
       AT&T added the concept of driver	semaphores to the kernel.
       Since a driver can only be entered in certain well-defined
       ways, it	was possible to	place a	semaphore sequence around
       each entry to the driver.  In general, these entry points
       are the open, close, read, write	and ioctl entries.  This is
       much like the mode semaphore in normal UNIX, in that it
       prevents	any other process from entering	the driver
       simultaneously.	Three separate types of	protection are
       provided.




       September 7, 1989		       Company Confidential







       Section 2					     - 13 -



       First, the entire driver	can be semaphored on it's major
       device number.  Before the kernel invokes the driver, it
       calls psema() to	lock the driver	semaphores, and	after
       return from the driver it calls vsema() to unlock the
       semaphore.

       Second, each minor device number	access to the driver can be
       semaphored.  Instead of locking a single	semaphore, each
       possible	minor device number has	a semaphore associated with
       it.  The	kernel locks the semaphore associated with the
       minor device before entering the	driver and unlocks it after
       return.	For example, this used in protecting the tty
       driver, which maintains independent structures for each
       minor device supported.

       Third, a	driver can be given no protection, in which case it
       is responsible for protecting it's own data structures.
       This means that structures within the driver that are shared
       between processes or with interrupt routines must be
       protected with semaphores.

       To protect from interrupt access	while a	driver semaphore is
       locked, the kernel checks the driver semaphore before
       launching the interrupt routine.	 If the	semaphore is
       locked, the kernel queues the interrupt and returns.  The
       code which unlocks the driver semaphore checks for the
       queued interrupt, and if	one exists it launches the
       interrupt routine at that time.	Note that this behavior	is
       only possible with major	device locking.	 A driver that uses
       minor device locking must be modified to	block the interrupt
       routine explicitly (if it does not already do so).

       If a driver choses to access kernel data	structures, it will
       not be possible to port it directly to the multiprocessor
       environment.  This is because it	will not follow	the
       kernel's	locking	strategy for the structures accessed.  Such
       drivers must be modified	to use the proper locking strategy
       before they can be used in the multiprocessor kernel.

       The Clover 2 kernel will	implement the 3B20A driver concept,
       maximizing the use of currently existing	drivers.  In most
       cases, it should	be possible to directly	use drivers written
       for Clover I.  It is expected that some drivers will require
       a porting effort.










       September 7, 1989		       Company Confidential







       Section 2					     - 14 -



       8.  Further_Reading

       Much of this discussion is based	on the actual code for the
       AT&T 3B20A implementation, as well as [3] and [1].


















































       September 7, 1989		       Company Confidential







       Section 2					     - 15 -




				REFERENCES



	1. Bach, Maurice J., The Design	of the UNIX Operating
	   System, Prentice-Hall, Englewood Cliffs, N. J., 1986.

	2. Beck, B., and Kasten, B., VLSI Assist in Building a
	   Multiprocessor UNIX System, Sequent Computer	Systems,
	   Portland Oregon, not	dated.

	3. Bach, M. J.,	and Buroff, S. J., Multiprocessor UNIX
	   Operating Systems, AT&T Bell	Laboratories Technical
	   Journal, Vol. 63, No. 8, October 1984.







































       September 7, 1989		       Company Confidential