USENIX Paper on SGI's Threads

Path: utzoo!mnetor!uunet!husc6!cmcl2!nrl-cmf!mailrus!ames!sgi!jmb
From: j...@patton.SGI.COM (Jim Barton)
Newsgroups: comp.sys.sgi
Subject: USENIX Paper on SGI's Threads
Message-ID: <12192@sgi.SGI.COM>
Date: 4 Mar 88 20:12:54 GMT
Sender: dae...@sgi.SGI.COM
Organization: Silicon Graphics Inc, Mountain View, CA
Lines: 1203
Keywords: threads multiprocessing

As I have had quite a few request for these papers, I'll post them
here for all to see.  This paper concerns our work extending the UNIX
process model to accomodate parallel programming.


-- Jim Barton
From the UNIX asylum at SiliconGraphics, Inc.
j...@sgi.sgi.com,  sgi!...@decwrl.dec.com, ...{decwrl,sun}!sgi!jmb

---------- cut here ----------------
.\"
.\"	Converted to the 'ms' macros (blehhhh...).
.\"	Run it through 'pic' first ...
.\"
.TL
Beyond Threads: Resource Sharing in UNIX
.AU
J. M. Barton
J. C. Wagner
.AI
Silicon Graphics, Incorporated
2011 Stirlin Road
Mountain View, California 94039
j...@sgi.com, j...@sgi.com
.AB
UNIX
provides a programming model for the user which gives an illusion of
multiprocessing.
On uniprocessors, this illusion works well, providing communication paths,
firewalls between processes and a simple programming environment.
Unfortunately, the normal UNIX model does not provide the capabilities
to take full advantage of modern multiprocessor hardware. 
This results from a design which uses data queueing and preemption to
provide the multiprocessing illusion, which are unnecessary on a
true multiprocessor.
.LP
The recent proposed addition of
.I threads
to CMU's MACH kernel shows that there are other effective programming
models that may be supported in UNIX as well.
This model provides for sharing the virtual address space of a process
by breaking the process into several lightweight contexts that may
be manipulated more quickly than a normal process.
Unfortunately, this model raises questions about support for normal
UNIX semantics, as well as
thread scheduling performance and kernel overhead.
.LP
This paper describes a programming model which goes beyond the simple
concept of threads, providing a much broader range of resource sharing
while retaining the key parts of the UNIX process model.
Instead of 
.I threads,
the concept of a process is retained along with most of it's semantics.
As usual, the fundamental resource shared is the virtual address space.
In addition, open file descriptors, user and group ID's, the current
directory, and certain other elements of the process environment may be
shared also.
This makes for easy construction of useful servers and simple concurrent
applications, while still providing high-performance support for
more computationally intensive applications.
.LP
This implementation, labelled
.I
process share groups,
.R
allows normal process actions to take place easily, such as system calls,
page faulting, signalling, pausing, and other actions which are ill-defined
within the
.I threads
model.
The paper describes the philosophy which drove the design, as well
as details of the implementation, which is based on,
and upwardly compatible with, AT&T 5.3 UNIX.
Performance of normal UNIX processes is maintained while providing
high-performance share group support.
.AE
.NH 1
Introduction
.LP
The success of the UNIX\** kernel is owed in part to the
.FS
Standard Declaration: "UNIX is a trademark of AT&T Bell Laboratories".
.FE
effective way in which it hides the resource sharing necessary in
a multiprogrammed environment.
A UNIX process is a highly independent entity, having it's own address
space and environment.
Communication paths are restricted to low-bandwidth mechanisms, such
as pipes, sockets and messages.
.LP
When moved to a multiprocessor, this model is overly restrictive.
We would like to allow several processors to work on a problem in
parallel using high-bandwidth communications (shared memory, for instance).
Even though this explicit sharing of data is thought of most often when
working with parallel programing, there are other resources that can
be shared usefully.
For instance, the file descriptors associated with a process could be shared
among several processes.
.LP
In order to provide a conceptual framework for resource sharing among
concurrent processes, we have introduced an additional level of
functionality into the normal UNIX process model.
A process may gain access to this additional level of functionality through
an interface called
.I
process share groups.
.R
Members of a 
.I
share group
.R
can potentially share many resources previously thought of as private to
a single process.
.LP
This interface is first demonstrated in an AT&T System V.3 kernel,
and relies on modified region handling abilities to share
a virtual address space.
New file opens can be propagated to all processes, as well as modifications
to the environment of a process (such as the
.I ulimit(2)
values).
By extending the semantics of the UNIX process in an upwardly compatible
way, a powerful new programming model has been developed.
.NH 2
The Road to Resource Sharing
.LP
In order to provide a simple model for multi-programming, the designers of
UNIX chose a model of independent processes, shared access to a file-system,
and limited communications using low-bandwidth paths, such as signals
or pipes (Figure 1).
Such a decision was appropriate, since the inherent queueing of data implied at
many levels of the UNIX interface simplifies multi-programming support.
.KS
.DS B
.PS
#
#	Some globals
#
addrht = 1.5
addrwid = .75
elht = addrht / 8
ellipsewid = addrwid * (2/3)
ellipseht  = (addrht/4) * (2/3)
#
#	Address space
#
A1: box wid addrwid ht addrht
box wid addrwid ht elht with .nw at A1.nw "\s-2stack\s0"
B1: box wid addrwid ht elht with .sw at A1.sw "\s-2text\s0"
box wid addrwid ht elht with .sw at B1.nw "\s-2data/bss\s0"
#
#	Environment
#
B2: box wid addrwid * 2 ht elht with .c at B1.s - (0.0, 0.25)
" File I/O" at B2.e ljust
move to B2.sw + (addrwid/2, -0.25)
L1: line -> up .25
"pipes" at L1.start below
move to B2.se - (addrwid/2, 0.25)
L2: line -> up .25
"files" at L2.start below
move to A1.w - (0.35, 0.0)
L3: line -> right .25
"signals " at L3.start above rjust
#
#	State
#
move to A1.e + (0.1, 0.0)
E1: ellipse with .w "\s-2state\s0"
line -> left .25 from E1.w
.PE
.DE
.DS C
\fBFigure 1.\fP Version 7 Process Environment
.DE
.KE
To meet the growing need for high-performance and concurrent programming
support, some new mechanisms were introduced.
Berkeley UNIX introduced the
.I socket
interface,
while System V introduced a local inter-process communication interface
providing shared memory support (Figure 2).
No new interfaces for managing files were proposed, nor were changes
to improve concurrent programming abilities put forward.
.KS
.DS B
.PS
#
#	Draw the address space again.
#
A1: box wid addrwid ht addrht
box wid addrwid ht elht with .nw at A1.nw "\s-2stack\s0"
B1: box wid addrwid ht elht with .sw at A1.sw "\s-2text\s0"
B2: box wid addrwid ht elht with .sw at B1.nw "\s-2data/bss\s0"
B3: box wid addrwid ht elht with .sw at B2.nw + (0.0, 0.4) \
	"\s-2shared memory\s0"
#
#	Draw the environment again.
#
B4: box wid addrwid * 2 ht elht with .c at B1.s - (0.0, 0.25)
" File I/O" at B4.e ljust
move to B4.sw + (addrwid/2, -0.25)
L1: line -> up .25
"pipes" at L1.start below
move to B4.se - (addrwid/2, 0.25)
L2: line -> up .25
"files" at L2.start below
move to A1.w - (0.35, 0.0)
L3: line -> right .25
"signals " at L3.start rjust
move to A1.w - (0.35, -0.2)
L4: line -> right .25
"messages " at L4.start rjust
move to A1.w - (0.35, 0.2)
L5: line -> right .25
"semaphores " at L5.start rjust
#
#	Some state
#
move to A1.e + (0.1, 0.0)
E1: ellipse with .w "\s-2state\s0"
line -> left .25 from E1.w
#
#	Add the legend
#
"\fBSystem V Environment\fP" at A1.n + (0.0, 0.1)
#
#	Now draw the BSD layout.
#
right
move to A1.e + (2.5, 0.0)
A2: box wid addrwid ht addrht
box wid addrwid ht elht with .nw at A2.nw "\s-2stack\s0"
BB1: box wid addrwid ht elht with .sw at A2.sw "\s-2text\s0"
box wid addrwid ht elht with .sw at BB1.nw "\s-2data/bss\s0"
#
#	Draw it's environment
#
BB2: box wid addrwid * 2 ht elht with .c at BB1.s - (0.0, 0.25)
" File I/O" at BB2.e ljust
move to BB2.sw + (addrwid/4, -0.25)
LL1: line -> up .25
"pipes" at LL1.start below
move to BB2.se - (addrwid/4, 0.25)
LL2: line -> up .25
"files" at LL2.start below
move to BB2.s + (0.0, -0.25)
LL4: line -> up .25
"sockets" at LL4.start below
move to A2.w - (0.35, 0.0)
LL3: line -> right .25
"signals " at LL3.start rjust
#
#	Some state
#
move to A2.e + (0.1, 0.0)
E2: ellipse with .w "\s-2state\s0"
line -> left .25 from E2.w
#
#	Add the legend
#
"\fBBSD 4.2 Environment\fP" at A2.n + (0.0, 0.1)
.PE
.DE
.DS C
\fBFigure 2.\fP System V and BSD Process Environments
.DE
.KE
Most recently, the Mach[Acce86]
kernel introduced an implementation of 
tasking labeled 
\fIthreads\fP[Teva87].
This model allows an address space to be shared among several processes.
Each process can execute independently in both kernel and user mode.
Although independent execution adds additional cost for a threaded
process, such as kernel context (the user area) and a kernel stack
for each thread, a useful concurrent programming environment is provided
(Figure 3).
.KS
.DS B
.PS
#
#	Now draw the Mach layout.
#
right
A2: box wid addrwid ht addrht
BB0: box wid addrwid ht elht with .nw at A2.nw "\s-2stack\s0"
BB1: box wid addrwid ht elht with .sw at A2.sw "\s-2text\s0"
BB9: box wid addrwid ht elht with .sw at BB1.nw "\s-2data/bss\s0"
B3: box wid addrwid ht elht with .sw at BB9.nw + (0.0, 0.2) \
	"\s-2shared memory\s0"
box wid addrwid ht elht with .nw at BB0.sw - (0.0, 0.2) \
	"\s-2mapped file\s0"
#
#	Draw it's environment
#
BB2: box wid addrwid * 2 ht elht with .c at BB1.s - (0.0, 0.25)
" File I/O" at BB2.e ljust
"\s-2BSD 4.3 kernel\s0" at BB2.c
move to BB2.sw + (addrwid/4, -0.25)
LL1: line -> up .25
"pipes" at LL1.start below
move to BB2.se - (addrwid/4, 0.25)
LL2: line -> up .25
"files" at LL2.start below
move to BB2.s + (0.0, -0.25)
LL4: line -> up .25
"sockets" at LL4.start below
move to A2.w - (0.35, 0.1)
LL3: line -> right .25
"signals " at LL3.start rjust
move to A2.w - (0.35, -0.1)
LL5: line -> right .25
"messages " at LL5.start rjust
#
#	Draw the Mach performance killer.
#
mht = (addrht * 0.75)
move to A2.e + (0.15, 0.0)
BB4: box wid elht ht mht with .w
"\s-2Mach kernel\s0" at BB4.n above
#
#	Some state
#
eoff = 0.1
move to BB4.e + (eoff, 0.0)
move by (0.0, mht/4)
E1: ellipse with .w "\s-2state 0\s0"
line -> left .25 from E1.w
right
move to BB4.e + (eoff, 0.0)
E2: ellipse with .w "\s-2state 1\s0"
line -> left .25 from E2.w
right
move to BB4.e + (eoff, 0.0)
move by (0.0, -mht/4)
E3: ellipse with .w "\s-2state 2\s0"
line -> left .25 from E3.w
#
#	The message line
#
line <-> down .35 from BB4.s "\s-2 messages\s0" ljust
.PE
.DE
.DS C
\fBFigure 3.\fP Mach Process Environment
.DE
.KE
.LP
As part of the research for share groups, a true threaded process
implementation on System V.3 was constructed[Wagn87].
This implementation of threaded execution
went beyond that used in the Mach kernel.
A thread was realized as an actual entity within a process, and truly
manifested itself only as the minimum machine context which could
be scheduled.
This version of threads provided a good environment for numerical or
other computationaly intensive algorithms, but it suffered from
limited applications.
Finally realizing the limitations of this approach, we looked for
a more powerful model.
.NH 1
Parallel Programming
.LP
To take maximum advantage of multi-processors, it must be possible to
create and execute parallel algorithms, and to get truly parallel execution
of application code.
Because of the data-sharing bandwidth necessary for applications to
gain performance from parallelism, shared memory is usually the main
data path.
Sharing memory isn't enough, however.
The sharing mechanism and other aspects of the environment must
all work together to provide a useful programming model[Bart87].
.LP
Shared memory is not useful without a mechanism for synchronizing access to
it.
For instance, pipes, System V messages or semaphores, sockets or signals
can be used to synchronize memory access between processes, but are all
much lower bandwidth mechanisms than the
memory itself, which can be a liability.
With hardware supported locking, synchronization speeds can approach
memory access speeds.
.LP
Two equally important aspects of the environment are scheduling and context
switching.
Even if a multi-process application limits itself to the number of processors
available, the best performance will occur only if all the processes are
actually running in parallel.
The kernel needs to take steps to insure that these processes run in parallel
whenever possible.
Since this isn't always possible in UNIX, fast context switching is also
important.
For high-performance semaphores, it is necessary that blocking or unblocking
a process be extremely quick.
Fast context switching is one element of this; the kernel must provide
facilities for the actual blocking operations to be executed quickly
as well.
.LP
Dynamic creation and destruction of processes must be possible,
and getting an existing process started on a new task must have little overhead.
In a threaded model, creation of a new thread is often an order of
magnitude faster than creation of a process, however this seems 
unimportant.
If a process model is used instead (such as Sequent's[Beck87]),
then the speed
penalties of process creation are eliminated by creating a pool of 
processes before entering parallel sections of code, each of which then
waits to be dispatched quickly as needed.
If enough processes are not available, a new one can be dynamically created,
but this problem can be tuned out of the application.
.LP
If we add to this model by introducing sharing of other process resources,
we approach a useful superset of normal UNIX process semantics.
Signals, system calls, traps and other process events happen in an expected
way, since they deal with actual processes.
By sharing other resources, such as file descriptors, user ID's, and the like, 
we can develop a powerful model that gives the performance of threads with
greater functionality (Figure 4).
.KS
.DS B
.PS
addrht = 2.0
addrwid = .75
ellipsewid = addrwid * (2/3)
ellipseht  = (addrht/4) * (2/3)
elht = addrht / 16
#
#	Draw the address space one more time
#
A1: box wid addrwid ht addrht
C1: box wid addrwid ht elht with .nw at A1.nw "\s-2stack\s0"
C2: box wid addrwid ht elht with .nw at C1.sw - (0.0, elht/2) "\s-2stack\s0"
box wid addrwid ht elht with .nw at C2.sw  - (0.0, elht/2) "\s-2stack\s0"
B1: box wid addrwid ht elht with .sw at A1.sw "\s-2text\s0"
C3: box wid addrwid ht elht with .sw at B1.nw + (0.0, elht/2) \
	"\s-2shared lib text\s0"
C4: box wid addrwid ht elht with .sw at C3.nw + (0.0, elht/2) \
	"\s-2shared lib data\s0"
C5: box wid addrwid ht elht with .sw at C4.nw "\s-2data/bss\s0"
C6: box wid addrwid ht elht with .sw at C5.nw + (0.0, elht) \
	"\s-2mapped file\s0"
C7: box wid addrwid ht elht with .sw at C6.nw + (0.0, elht) \
	"\s-2shared memory\s0"
#
#	Draw the environment again.
#
B4: box wid addrwid * 2 ht elht with .c at B1.s - (0.0, 0.25)
" File I/O" at B4.e ljust
move to B4.sw + (addrwid/4, -0.25)
L1: line -> up .25
"pipes" at L1.start below
move to B4.se - (addrwid/4, 0.25)
L2: line -> up .25
"files" at L2.start below
move to B4.s + (0.0, -0.25)
LL4: line -> up .25
"sockets" at LL4.start below
move to A1.w - (0.35, 0.0)
L3: line -> right .25
"signals " at L3.start rjust
move to A1.w - (0.35, -0.2)
L4: line -> right .25
"messages " at L4.start rjust
move to A1.w - (0.35, 0.2)
L5: line -> right .25
"semaphores " at L5.start rjust
#
#	Draw some processes using the address space
#
eoff = 0.1
move to A1.e + (eoff, 0.0)
move by (0.0, addrht/4)
E1: ellipse with .w "\s-2state 0\s0"
line -> left .25 from E1.w
right
move to A1.e + (eoff, 0.0)
E2: ellipse with .w "\s-2state 1\s0"
line -> left .25 from E2.w
right
move to A1.e + (eoff, 0.0)
move by (0.0, -addrht/4)
E3: ellipse with .w "\s-2state 2\s0"
line -> left .25 from E3.w
.PE
.DE
.DS C
\fBFigure 4.\fP New SGI 4D Series Environment
.DE
.KE
.NH 1
Share Groups
.LP
To address the failures and limitations of resource sharing when applied
to multiprocessors, we developed the concept of the 
.I
share group.
.R
Such a group
is a collection of processes which share a common ancestor and have not
executed the
.I exec(2)
system call since being created.
The parent-child relationship between processes is maintained, and forms
the basis for a hierarchial resource sharing facility, where a parent 
can indicate what shared resources each child should share.
.LP
In general, the main resource shared is the virtual address space associated
with the share group, although it need not be.
Processes in the share group are free to pass pointers to data, and to
use high-performance synchronization and data passing techniques (mainly
shared memory and spinlocks).
.LP
A small number of other resources may be shared in the initial implementation.
Chief among these are file descriptors.
When one of the processes in a group opens a file, the others will see
the file as immediately available to them.
The descriptor number\** may be passed between processes or simply
assumed as part of the algorithm.
.FS
The descriptor number is an index into the file table for a process, which
holds pointers to open file table entries.
For example, 
.I
standard input
.R
is by convention descriptor number 0, while
.I
standard output
.R
indicates descriptor number 1.
.FE
Other resources that may be shared are the
.I ulimit
values, the
.I umask
value, the current directory
and the root directory
.NH 1
System Call Interface
.LP
The share group interface is defined through two new system calls,
.I sproc()
and
.I prctl().
The
.I sproc()
call is used to create a new process within the share group, and controls
the resources which will be shared with the new process.
The
.I prctl()
call is used to obtain information about the share group and to control
certain features of the group and new processes.
.NH 2
The \fIsproc\fP System Call
.LP
The syntax of this call is:
.DS
\fC
int sproc(entry, shmask, arg)
   void          (*entry)();
   unsigned long shmask;
   long          arg;

returns -
   -1 - OS error in call
   >0 - new process ID
\fP
.DE
This call is similar to the
.I fork(2)
system call, in that a new process is created.
The
.I shmask
parameter specifies which resources the child will share with the
share group, each particular resource being identified with a bit
in the parameter.
A new stack is automatically created for the child process, and the
new process is entered at the address given by
.I entry.
This new stack is visible to all other processes in the share group,
and will automatically grow.
The single argument 
.I arg
is passed to newly created process as the only
parameter to the entry point, and can be used to pass a pointer to a
data block or perhaps an index into a table.
.LP
The first use of the
.I sproc()
call creates a
.I
share group.
.R
Whenever a process in the share group issues the
.I sproc ()
call the new process is made part of the parent's share group.
A new process may be created outside the share group through the
.I fork(2)
system call, and use of the 
.I exec(2)
system call also removes the process from the share group before overlaying
the new process image.
.LP
All resource sharing is controlled through the
.I shmask
(
.B "share mask"
)
.R
parameter.
Currently, the following elements can be shared:
.DS
\fC
PR_SADDR   - share virtual address space
PR_SULIMIT - ulimit values
PR_SUMASK  - umask values
PR_SDIR    - current/root directory
PR_FDS     - open file descriptors
PR_SID     - uid/gid
PR_SALL    - all of the above and any future resources
\fP
.DE
When the child is created, the share mask
is masked against the share mask used when creating the parent.
This means that a process can only cause a child to share those resources
that the parent can share as well, thus forming a hierarchy.
The original process in a share group is given a mask indicating
that all resources are shared.
.LP
Whenever a process modifies one of the shared resources,
and it's share mask indicates that it is sharing the resource,
all other processes in the share group which
are also sharing the resource will be updated.
.LP
When the virtual address(VM) space is shared, a
process may modify the VM space and all other processes will see that 
same change.
If new regions are added to the VM space,
the kernel will check to insure that no address range overlaps any other.
If the virtual address space is not shared, the new process (child) gets a
.I copy-on-write
image of the share group's (parent's) virtual address space.
In this case, the child's stack is not visible in the parent's virtual
address space.
.LP
Certain small parts of a process's VM space are not shared.
The most critical of these is the process data area, or PRDA.
This is a small amount of memory (less than a page in size) which records
data which must remain private to the process, and is always at the same
fixed virtual location in every process.
As an example, consider the
.I errno
variable provided by the C library.
Since the data space is shared, if this variable were only in the data space
it would be difficult for independent processes to make reliable system
calls.
Thus, the C library locates  a copy of
.I errno
in the PRDA for a process.
The format and use of the PRDA is totally within the scope of the user
program, and can be managed in any way appropriate.
Libraries (such as the C library) which need some private data may use
a portion of the PRDA, leaving a portion for direct use by the programmer.
.NH 2
The \fIprctl\fP System Call
.LP
The second system call is the
.I prctl()
call, which allows certain aspects of a share group to be controlled.
Its syntax is:
.DS
\fC
int prctl(option [, value [, value2 ]])
   unsigned option;
   char     *value;
   char     *value2;

returns -
   -1  - error in OS call
   <>0 - request result
\fP
.DE
The following
.I option
values may be used:
.DS
\fC
PR_MAXPROCS     - return limit on processes per user
PR_MAXPPROCS    - return number of processes that the
                  system can run in parallel.
PR_SETSTACKSIZE - sets the maximum stack size for the current
                  process.
PR_GETSTACKSIZE - retrieves the maximum stack size for
                  the current process.
\fP
.DE
The PR_SETSTACKSIZE option allows the program to set the maximum stack size
which it may have.
This value is inherited across 
.I sproc()
and
.I fork()
system calls, and indirectly controls the layout of the shared VM image.
.NH 1
Implementation
.LP
The implementation of share groups centered on four goals:
.RS
.IP 1.
The implementation must work correctly in both multi-processor
and uni-processor environments.
.IP 2.
Synchronization between share group processes must be able to proceed
even though one member is not available for execution.
.IP 3.
The overall structure of the kernel must not be modified.
.IP 4.
The performance for non-share group processes must not be reduced.
.RE
.LP
The multiprocessor requirement proved to be the major difficulty in the
design, since there are no formal interfaces for
synchronization between processes executing in the kernel.
This is not usually a problem on uni-processor systems since a
process executing in the kernel cannot be preempted.
Thus it may examine and modify the state of any other process, and know that
the state will not change (unless the executing process goes to sleep).
This isn't true in a multi-processor environment.
The process being examined may be running on another processor, sleeping
on a semaphore waiting for a resource (it could even be waiting for
a resource that the examining process controls), or
it may be waiting for a slow operation (i.e. read on a pty, 
performing the
.I wait(2)
system call, etc.).
.NH 2
Data Structures
.LP
For each share group, there is a single data structure (the shared
address block) that is referenced
by all members of the group:
.DS
\fC
typedef struct shaddr_s {
        /* the following are all to handle pregions */
        preg_t       *s_region;      /* processes' shared pregions */
        lock_t       s_acclck;       /* lock on access to shared block */
        sema_t       s_updwait;      /* wait for update lock */
        short        s_acccnt;       /* count of readers */
        ushort       s_waitcnt;      /* count of waiting processes */
        /* generic fields for handling shared processes */
        struct proc  s_plink;        /* link to shared processes */
        ushort       s_refcnt;       /* # process in list */
        ushort       s_flag;         /* flags */
        lock_t       s_listlock;     /* protects s_plink */
        /* semaphore for single threading open file updating */
        sema_t       s_fupdsema;     /* wait for opening file */
        struct file  **s_ofile;      /* copy of open file descriptors */
        char         *s_pofile;      /* copy of open file flags */
        struct inode *s_cdir;        /* current directory */
        struct inode *s_rdir;        /* root directory */
        /* lock for updating misc things that don't need a semaphore */
        lock_t       s_rupdlock;     /* update lock */
        /* hold values for other sharing options */
        short        s_cmask;        /* mask for file creation */
        daddr_t      s_limit;        /* maximum write address */
        ushort       s_uid;          /* effective user id */
        ushort       s_gid;          /* effective group id */
} shaddr_t;
\fP
.DE 1
This structure is dynamically allocated the first time that a process
invokes the
.I sproc(2)
system call.
A pointer in the
.I proc
structure points to this, and the
.I s_plink
field links all processes via a link field in the
.I proc
structure.
To protect the linked list during searching, the lock
.I s_listlock
is used.
A reference count is kept in 
.I s_refcnt,
and the structure is thrown away once the last member exits (Figure 5).
The rest of the fields are used to implement the resource sharing
between group members and are explained in the following sections.
.KS
.DS B
.PS
linewid = .25
lineht = .25
#
# Draw a process slot box
#
define	procent	X
box ht .55 wid .4 "\s-2proc\s0" $1
X
#
# Draw a pregion box
#
define	pregion X
box ht .25 wid .5 "\s-2pregion\s0" 
X
define pregionmove X
box ht .25 wid .5 invisible
X
#
#	Build three example entries
#
right
move by (.75, 0.0)
move by (linewid, 0.0)
R1: pregion()
line <- from R1.e
P1: procent("\s-20\s0")
line down -> from P1.s
P2: procent("\s-21\s0")
line down ->
P3: procent("\s-22\s0")
left
L1: line -> from P3.w
B0: box ht .25 wid .5 with .e at L1.end "\s-2pregion\s0"
L2: line -> from B0.w
box ht .25 wid .5 with .e at L2.end "\s-2pregion\s0"
right
#
#	Draw share block
#
move to P2.e
B1: line -> right .5
B2: box ht .75 wid .5 with .sw at B1.end "\s-2File\s0" "\s-2Sync\s0"
B3: box ht .375 wid .5 with .nw at B2.sw "\s-2Other\s0"
B4: box ht .375 wid .5 with .nw at B3.sw "\s-2Address\s0" \
	"\s-2Space\s0"
#
#	Connect up the lines
#
line <-> from P1.e to B2.w
line -> from P3.e to B4.nw
#
#	Add the global pregions
#
line -> down from B4.s
down
R3: pregion()
right
line -> from R3.e
R2: pregion()
line ->; pregion()
#
#	Add additional Legends
#
move to B2.n "Share Block" above
move to R1.nw + (0.0, 0.05) "Private Pregions" above
move to R2.n + (0.1, 0.05) "Shared Pregions" above
.PE
.DE
.DS C
\fBFigure 5.\fP Share Block Data Structure
.DE
.KE
.NH 2
Virtual Space Sharing
.LP
The kernel in which this was implemented is based on System V.3, and therefore
uses the \fBregion\fP[Bach86]
model of virtual memory.
This model consists of 2 main data structures - 
.I regions,
which describe
contiguous virtual spaces (and contain all the page table information etc.)
and
.I pregions,
which are linked per-process and describe the virtual address at which
a region is attached to the process, and certain other per-process information
about the region of memory.
This model is designed to allow for full orthogonality between regions
that grow (up or down), and those that can be shared.
.LP
The most interesting (and difficult) part of resource sharing to
implement is sharing of the VM image, although the job is simplified
by using 
.I regions
as the basis for managing the VM.
The
.I sproc
system call shares code with the standard
.I fork
call,
the only difference being the handling of regions.
If address space sharing is not indicated when the 
.I sproc() 
call is made, than the standard
.I fork()
operations are performed, generating an image that provides
.I copy-on-write
access to the VM image.
If sharing of the VM image is specified,
the private regions of the parent process are marked as
.I copy-on-write
in the child, while all other regions are shared.
A point to remember is that a
.I fork()
or non-VM sharing 
.I sproc()
call leaves any visible stack or other regions from the share group
as
.I copy-on-write
elements of the new process.
.LP
Algorithmically, the private regions for a process are examined when
demand paging or loading a new process before examining the shared 
regions.
This provides the
.I copy-on-write
abilities of a non-VM sharing
share group member.
It also provides a basis for future enhancements to the manner in which
the VM is shared.
For instance, it could be possible to share part of the VM image
and have
.I copy-on-write
access to other parts of the image.
.LP
.I Sproc()
allocates a new stack segment in a non-overlapping region of the parent's
virtual address space.
The new process starts on this stack and therefore does not inherit the
stack context of the parent (though the child can reference the parent's
stack).
The child process starts executing at the address given in the
.I sproc()
call.
.LP
Unfortunately, 
the stock (V.3) region implementation does not support growing or shrinking
shared regions (e.g., the data and BSS portion of a process).
The main problem is that although regions have a well defined locking
protocol, a basic assumption is made that only the process owning a
private region (one that has only a single reference) will ever change it.
Using this assumption, various parts of the kernel hold implicit
pointers into the region (i.e. pointers to pages) even though they
no longer hold a lock on the region.
If a process comes in and shrinks such a region, then any implicit pointers
are left dangling.
.\"
Lack of adequate region locking also comes into play when scanning a
pregion list attempting to match a virtual address reference with a 
region of actual memory (during a page fault, for example).
This list is not usually protected via a lock since only the
calling process can change it.
With growable sharable regions, information that is used during the scan
could change.
The former problem is inherent in uni-processor as well as
multi-processor systems, and the later is only a problem for multi-processor
systems.
.LP
The obvious soution to this is to use a lock on the pregion list.
Since all share group processes must obtain this lock each time they
page fault, the lock was made a shared read lock - any number of processes
can scan the list, but if a process needs to update or change the list or
what it points to, it must wait until all others are done scanning.
Unfortunately, to guard against the 'implicit' references mentioned above,
the lock could not be hidden in the existing region lock routines,
but had to be placed around the entire sections of code that reference
the virtual address.
.LP
The shared read lock consists of a spin lock
.I s_acclck
which guards the counters:
.I s_acccnt
and
.I s_waitcnt.
.I s_acccnt
is the number of processes reading the list
(or -1 if someone is updating the list);
.I s_waitcnt
counts the number of processes waiting for the shared lock.
.I s_updwait
is a semaphore on which waiting processes sleep.
Since operations that require the update lock are relatively rare (fork, exec,
mmap, sbrk, etc) compared to the operations that scan (page fault, pager)
the shared lock is almost always available and multiple processes do not
collide.
.LP
When a process first creates a share group (by calling
.I sproc(2))
all of its sharable pregions are moved to the list of pregions in
the shared address block.
Some types of regions are not currently shareable.
For instance, private text regions may be 
created for debugging - that way breakpoints may (but are not required to) be
set in an individual share group member.
The shared pregion list is protected via the shared lock in all
places that the pregion list is accessed.
In most cases the only code change was to put calls to the shared lock
routines around the large sections of code that reference the pregion or region.
Since there is only one list, if one process adds a pregion (say through a
mmap(2) call) all other share group members will immediately see that
new virtual region.
.LP
An interesting problem occurs when a process wishes to delete
some section of its virtual space either by removing or  shrinking a region.
In this case it is important that the actual physical pages not be freed
until 
.B all
share group members have agreed to not reference those pages.
Also important is that the initiating process not have to wait for each
group member to run before completing its system call (some members
may not run for a long time).
To solve this problem, we use the fact that the
TLB (translation lookaside buffer) is managed by software for the
target processor (a MIPS R2000[MIPS86]).
Thus before shrinking or detaching a region, we synchronously flush
the TLBs for ALL processors, while holding the update lock for the
share group's pregions.
Thus if any group members are running, they will immediately trap on a TLB miss
exception, come into the kernel and attempt to grab the shared read lock
and block.
Since the lock  is held for update, the process will sleep
until the lock is released.
The invoking process may then continue to remove the region, free the
lock and leave the kernel.
.NH 2
Other Attribute Sharing
.LP
\" something about making pools and why we want to share other things
Unlike virtual space, other process resources are not visible outside
of the kernel, thus it is only important that they be synchronized whenever
a group member enters the kernel.
As with virtual space synchronization, we don't want to force the calling
process to wait until all other group members have synchronized.
To implement this, we keep a copy of each resource in the shared address
block: current directory
inode pointer, open file descriptors, user ids, etc.
This also allows immediate access to this data, as most of it is kept in the
user area and therfore inaccessible to other than the owning process.
.LP
Those resources which have reference counts (file descriptors and inodes)
have the count bumped one for the shared address block.
This avoids any races whereby the process that changed the resource
exits before all other group members have had a chance to synchronize.
Since there always exists a reliable available copy of the data,
all that remains is to synchronize the data on each member's entry to the
kernel.
To make this efficient, multiple bits in the proc structure
.I p_flag
word are used.
When a group member changes a sharable resource, it first
checks its
.I p_shmask
mask (the kernel version of the share mask)
to see if it is sharing this particular resource.
If so,
.I s_fupdsema
or
.I s_fupdlock
is locked to avoid any races between multiple updaters.
The resource is then modified (open the file, set user id, etc),
a copy is made in the shared address block, each sharing group member's
.I p_flag
word is updated, and the locks are released.
When a shared process enters the system via a system call, the collection
of bits in
.I p_flag
is checked in a single test; if any are set then a routine to handle
the synchronization is called.
Other events that must be checked on entry to the system were also
changed to this scheme, thus lowering the system call overhead for most
system calls.
.LP
The above scheme works well except if two processes attempt to update
a resource at the same time.
The lock will stop the second process, but it is important that the
second process be synchronized prior to being allowed update the resource.
This is handled by also checking the synchronization bits after
acquiring the lock.
.NH 1
Analysis
.LP
The resource sharing mechanism described above was implemented and tested
on a multi-processor MIPS R2300 testbed.
As expected, the time for a
.I sproc()
system call is slightly less than a regular
.I fork().
The overhead for synchronizing the virtual spaces is negligible except when
detaching or shrinking regions.
In practice this only happens if a process shrinks its data space (fairly rare)
or if a process does considerable VM management 
activity (e.g. mapping or unmapping files).
A possible system performance gain could be to only selectively flush the
TLBs of the other processors.
.NH 1
Future Directions
.LP
Although the set of features included in the first release is adequate
for many tasks, there are many other interesting capabilities that
could be added.
.LP
For example, the ability to selectively share regions when calling
.I sproc()
could be a useful facility, somewhat along the lines of Mach shared memory,
thus allowing some parts of the address space to be accessed 
.I copy-on-write
while others are simply shared.
This ability is a simple extension to the current scheme, as it only requires
proper management of the private pregion list and the shared pregion list.
.LP
There are also other resources that could be usefully shared.
For instance, the scheduling parameters of a process could be shared
among the members of the share group.
Since the shared address block is always resident, it provides a 
convienent handle for making scheduling decisions about the process
group as a whole.
In a multi-processor example, the programmer could specify that at least two
of the processes in the share group must run in parallel, or the group
should not be allowed to execute at all.
The priority of the whole group could be raised or lowered, or a whole
process group could be convienently blocked or unblocked.
.LP
Currently, calling
.I exec(2)
breaks the association with the share group.
By modifying the concept of pregion sharing to handle a unique address space,
it could be possible to have a group of unrelated programs managed as
a whole for file sharing or scheduling purposes.
.NH 1
Conclusion
.LP
This paper has presented a new and unique interface for the System V
kernel, 
.I
share groups,
.R
that allows an extremely high level of resource sharing between
UNIX processes.
The programmer has control of what is being shared.
Normal UNIX semantics are maintained for all processes within a share
group, thus maintaining compatability and expected behaviour.
.LP
This interface has been implemented within a production kernel, and
meets the promise of its design.
Processes not associated with a share group experience no penalty
for the inclusion of share group support, while processes within a share
group may take advantage of a common address space and automatic
sharing of certain fundamental resources.
.LP
This interface also allows for a large degree of future extensions,
and shows how an additional layer of process management may be
added to the UNIX kernel without penalizing standard programs.
.bp
.SH
References
.XP
Accetta, Mike, et.al., "Mach: A New Kernel Foundation For UNIX Development", in
.I
USENIX Association Proceedings, Summer, 1986.
.R
.XP
Tevanian, Avadis, et.al., "Mach Threads and the UNIX Kernel:
The Battle for Control", in
.I
USENIX Conference Proceedings,
.R
Summer 1987.
.XP
Wagner, J. C., and Barton, J. M.,
"Threads in System V: Letting UNIX Parallel Process", a work-in-progress
paper presented in
.I
;login:,
.R
USENIX Association, Vol. 12, No. 5, September/October 1987.
.XP
Barton, J. M., "Multiprocessors: Can UNIX Take the Plunge?", in
.I
UNIX Review,
.R
October, 1987.
.XP
Beck, Bob, and Olien, Dave, "A Parallel Programming Process Model", in
.I
USENIX Conference Proceedings,
.R
Winter, 1987.
.XP
Bach, Maurice J.,
.I
The Design of the UNIX Operating System,
.R
Prentice Hall, New Jersey, 1986.
.XP
.I
MIPS R2000 Processor Architecture,
.R
Mips Computer Systems, Inc., Mountain View, CA., 1986.
---------- cut here ----------------