Path: utzoo!mnetor!uunet!husc6!cmcl2!nrl-cmf!mailrus!ames!sgi!jmb From: j...@patton.SGI.COM (Jim Barton) Newsgroups: comp.sys.sgi Subject: USENIX Paper on SGI's Threads Message-ID: <12192@sgi.SGI.COM> Date: 4 Mar 88 20:12:54 GMT Sender: dae...@sgi.SGI.COM Organization: Silicon Graphics Inc, Mountain View, CA Lines: 1203 Keywords: threads multiprocessing As I have had quite a few request for these papers, I'll post them here for all to see. This paper concerns our work extending the UNIX process model to accomodate parallel programming. -- Jim Barton From the UNIX asylum at SiliconGraphics, Inc. j...@sgi.sgi.com, sgi!...@decwrl.dec.com, ...{decwrl,sun}!sgi!jmb ---------- cut here ---------------- .\" .\" Converted to the 'ms' macros (blehhhh...). .\" Run it through 'pic' first ... .\" .TL Beyond Threads: Resource Sharing in UNIX .AU J. M. Barton J. C. Wagner .AI Silicon Graphics, Incorporated 2011 Stirlin Road Mountain View, California 94039 j...@sgi.com, j...@sgi.com .AB UNIX provides a programming model for the user which gives an illusion of multiprocessing. On uniprocessors, this illusion works well, providing communication paths, firewalls between processes and a simple programming environment. Unfortunately, the normal UNIX model does not provide the capabilities to take full advantage of modern multiprocessor hardware. This results from a design which uses data queueing and preemption to provide the multiprocessing illusion, which are unnecessary on a true multiprocessor. .LP The recent proposed addition of .I threads to CMU's MACH kernel shows that there are other effective programming models that may be supported in UNIX as well. This model provides for sharing the virtual address space of a process by breaking the process into several lightweight contexts that may be manipulated more quickly than a normal process. Unfortunately, this model raises questions about support for normal UNIX semantics, as well as thread scheduling performance and kernel overhead. .LP This paper describes a programming model which goes beyond the simple concept of threads, providing a much broader range of resource sharing while retaining the key parts of the UNIX process model. Instead of .I threads, the concept of a process is retained along with most of it's semantics. As usual, the fundamental resource shared is the virtual address space. In addition, open file descriptors, user and group ID's, the current directory, and certain other elements of the process environment may be shared also. This makes for easy construction of useful servers and simple concurrent applications, while still providing high-performance support for more computationally intensive applications. .LP This implementation, labelled .I process share groups, .R allows normal process actions to take place easily, such as system calls, page faulting, signalling, pausing, and other actions which are ill-defined within the .I threads model. The paper describes the philosophy which drove the design, as well as details of the implementation, which is based on, and upwardly compatible with, AT&T 5.3 UNIX. Performance of normal UNIX processes is maintained while providing high-performance share group support. .AE .NH 1 Introduction .LP The success of the UNIX\** kernel is owed in part to the .FS Standard Declaration: "UNIX is a trademark of AT&T Bell Laboratories". .FE effective way in which it hides the resource sharing necessary in a multiprogrammed environment. A UNIX process is a highly independent entity, having it's own address space and environment. Communication paths are restricted to low-bandwidth mechanisms, such as pipes, sockets and messages. .LP When moved to a multiprocessor, this model is overly restrictive. We would like to allow several processors to work on a problem in parallel using high-bandwidth communications (shared memory, for instance). Even though this explicit sharing of data is thought of most often when working with parallel programing, there are other resources that can be shared usefully. For instance, the file descriptors associated with a process could be shared among several processes. .LP In order to provide a conceptual framework for resource sharing among concurrent processes, we have introduced an additional level of functionality into the normal UNIX process model. A process may gain access to this additional level of functionality through an interface called .I process share groups. .R Members of a .I share group .R can potentially share many resources previously thought of as private to a single process. .LP This interface is first demonstrated in an AT&T System V.3 kernel, and relies on modified region handling abilities to share a virtual address space. New file opens can be propagated to all processes, as well as modifications to the environment of a process (such as the .I ulimit(2) values). By extending the semantics of the UNIX process in an upwardly compatible way, a powerful new programming model has been developed. .NH 2 The Road to Resource Sharing .LP In order to provide a simple model for multi-programming, the designers of UNIX chose a model of independent processes, shared access to a file-system, and limited communications using low-bandwidth paths, such as signals or pipes (Figure 1). Such a decision was appropriate, since the inherent queueing of data implied at many levels of the UNIX interface simplifies multi-programming support. .KS .DS B .PS # # Some globals # addrht = 1.5 addrwid = .75 elht = addrht / 8 ellipsewid = addrwid * (2/3) ellipseht = (addrht/4) * (2/3) # # Address space # A1: box wid addrwid ht addrht box wid addrwid ht elht with .nw at A1.nw "\s-2stack\s0" B1: box wid addrwid ht elht with .sw at A1.sw "\s-2text\s0" box wid addrwid ht elht with .sw at B1.nw "\s-2data/bss\s0" # # Environment # B2: box wid addrwid * 2 ht elht with .c at B1.s - (0.0, 0.25) " File I/O" at B2.e ljust move to B2.sw + (addrwid/2, -0.25) L1: line -> up .25 "pipes" at L1.start below move to B2.se - (addrwid/2, 0.25) L2: line -> up .25 "files" at L2.start below move to A1.w - (0.35, 0.0) L3: line -> right .25 "signals " at L3.start above rjust # # State # move to A1.e + (0.1, 0.0) E1: ellipse with .w "\s-2state\s0" line -> left .25 from E1.w .PE .DE .DS C \fBFigure 1.\fP Version 7 Process Environment .DE .KE To meet the growing need for high-performance and concurrent programming support, some new mechanisms were introduced. Berkeley UNIX introduced the .I socket interface, while System V introduced a local inter-process communication interface providing shared memory support (Figure 2). No new interfaces for managing files were proposed, nor were changes to improve concurrent programming abilities put forward. .KS .DS B .PS # # Draw the address space again. # A1: box wid addrwid ht addrht box wid addrwid ht elht with .nw at A1.nw "\s-2stack\s0" B1: box wid addrwid ht elht with .sw at A1.sw "\s-2text\s0" B2: box wid addrwid ht elht with .sw at B1.nw "\s-2data/bss\s0" B3: box wid addrwid ht elht with .sw at B2.nw + (0.0, 0.4) \ "\s-2shared memory\s0" # # Draw the environment again. # B4: box wid addrwid * 2 ht elht with .c at B1.s - (0.0, 0.25) " File I/O" at B4.e ljust move to B4.sw + (addrwid/2, -0.25) L1: line -> up .25 "pipes" at L1.start below move to B4.se - (addrwid/2, 0.25) L2: line -> up .25 "files" at L2.start below move to A1.w - (0.35, 0.0) L3: line -> right .25 "signals " at L3.start rjust move to A1.w - (0.35, -0.2) L4: line -> right .25 "messages " at L4.start rjust move to A1.w - (0.35, 0.2) L5: line -> right .25 "semaphores " at L5.start rjust # # Some state # move to A1.e + (0.1, 0.0) E1: ellipse with .w "\s-2state\s0" line -> left .25 from E1.w # # Add the legend # "\fBSystem V Environment\fP" at A1.n + (0.0, 0.1) # # Now draw the BSD layout. # right move to A1.e + (2.5, 0.0) A2: box wid addrwid ht addrht box wid addrwid ht elht with .nw at A2.nw "\s-2stack\s0" BB1: box wid addrwid ht elht with .sw at A2.sw "\s-2text\s0" box wid addrwid ht elht with .sw at BB1.nw "\s-2data/bss\s0" # # Draw it's environment # BB2: box wid addrwid * 2 ht elht with .c at BB1.s - (0.0, 0.25) " File I/O" at BB2.e ljust move to BB2.sw + (addrwid/4, -0.25) LL1: line -> up .25 "pipes" at LL1.start below move to BB2.se - (addrwid/4, 0.25) LL2: line -> up .25 "files" at LL2.start below move to BB2.s + (0.0, -0.25) LL4: line -> up .25 "sockets" at LL4.start below move to A2.w - (0.35, 0.0) LL3: line -> right .25 "signals " at LL3.start rjust # # Some state # move to A2.e + (0.1, 0.0) E2: ellipse with .w "\s-2state\s0" line -> left .25 from E2.w # # Add the legend # "\fBBSD 4.2 Environment\fP" at A2.n + (0.0, 0.1) .PE .DE .DS C \fBFigure 2.\fP System V and BSD Process Environments .DE .KE Most recently, the Mach[Acce86] kernel introduced an implementation of tasking labeled \fIthreads\fP[Teva87]. This model allows an address space to be shared among several processes. Each process can execute independently in both kernel and user mode. Although independent execution adds additional cost for a threaded process, such as kernel context (the user area) and a kernel stack for each thread, a useful concurrent programming environment is provided (Figure 3). .KS .DS B .PS # # Now draw the Mach layout. # right A2: box wid addrwid ht addrht BB0: box wid addrwid ht elht with .nw at A2.nw "\s-2stack\s0" BB1: box wid addrwid ht elht with .sw at A2.sw "\s-2text\s0" BB9: box wid addrwid ht elht with .sw at BB1.nw "\s-2data/bss\s0" B3: box wid addrwid ht elht with .sw at BB9.nw + (0.0, 0.2) \ "\s-2shared memory\s0" box wid addrwid ht elht with .nw at BB0.sw - (0.0, 0.2) \ "\s-2mapped file\s0" # # Draw it's environment # BB2: box wid addrwid * 2 ht elht with .c at BB1.s - (0.0, 0.25) " File I/O" at BB2.e ljust "\s-2BSD 4.3 kernel\s0" at BB2.c move to BB2.sw + (addrwid/4, -0.25) LL1: line -> up .25 "pipes" at LL1.start below move to BB2.se - (addrwid/4, 0.25) LL2: line -> up .25 "files" at LL2.start below move to BB2.s + (0.0, -0.25) LL4: line -> up .25 "sockets" at LL4.start below move to A2.w - (0.35, 0.1) LL3: line -> right .25 "signals " at LL3.start rjust move to A2.w - (0.35, -0.1) LL5: line -> right .25 "messages " at LL5.start rjust # # Draw the Mach performance killer. # mht = (addrht * 0.75) move to A2.e + (0.15, 0.0) BB4: box wid elht ht mht with .w "\s-2Mach kernel\s0" at BB4.n above # # Some state # eoff = 0.1 move to BB4.e + (eoff, 0.0) move by (0.0, mht/4) E1: ellipse with .w "\s-2state 0\s0" line -> left .25 from E1.w right move to BB4.e + (eoff, 0.0) E2: ellipse with .w "\s-2state 1\s0" line -> left .25 from E2.w right move to BB4.e + (eoff, 0.0) move by (0.0, -mht/4) E3: ellipse with .w "\s-2state 2\s0" line -> left .25 from E3.w # # The message line # line <-> down .35 from BB4.s "\s-2 messages\s0" ljust .PE .DE .DS C \fBFigure 3.\fP Mach Process Environment .DE .KE .LP As part of the research for share groups, a true threaded process implementation on System V.3 was constructed[Wagn87]. This implementation of threaded execution went beyond that used in the Mach kernel. A thread was realized as an actual entity within a process, and truly manifested itself only as the minimum machine context which could be scheduled. This version of threads provided a good environment for numerical or other computationaly intensive algorithms, but it suffered from limited applications. Finally realizing the limitations of this approach, we looked for a more powerful model. .NH 1 Parallel Programming .LP To take maximum advantage of multi-processors, it must be possible to create and execute parallel algorithms, and to get truly parallel execution of application code. Because of the data-sharing bandwidth necessary for applications to gain performance from parallelism, shared memory is usually the main data path. Sharing memory isn't enough, however. The sharing mechanism and other aspects of the environment must all work together to provide a useful programming model[Bart87]. .LP Shared memory is not useful without a mechanism for synchronizing access to it. For instance, pipes, System V messages or semaphores, sockets or signals can be used to synchronize memory access between processes, but are all much lower bandwidth mechanisms than the memory itself, which can be a liability. With hardware supported locking, synchronization speeds can approach memory access speeds. .LP Two equally important aspects of the environment are scheduling and context switching. Even if a multi-process application limits itself to the number of processors available, the best performance will occur only if all the processes are actually running in parallel. The kernel needs to take steps to insure that these processes run in parallel whenever possible. Since this isn't always possible in UNIX, fast context switching is also important. For high-performance semaphores, it is necessary that blocking or unblocking a process be extremely quick. Fast context switching is one element of this; the kernel must provide facilities for the actual blocking operations to be executed quickly as well. .LP Dynamic creation and destruction of processes must be possible, and getting an existing process started on a new task must have little overhead. In a threaded model, creation of a new thread is often an order of magnitude faster than creation of a process, however this seems unimportant. If a process model is used instead (such as Sequent's[Beck87]), then the speed penalties of process creation are eliminated by creating a pool of processes before entering parallel sections of code, each of which then waits to be dispatched quickly as needed. If enough processes are not available, a new one can be dynamically created, but this problem can be tuned out of the application. .LP If we add to this model by introducing sharing of other process resources, we approach a useful superset of normal UNIX process semantics. Signals, system calls, traps and other process events happen in an expected way, since they deal with actual processes. By sharing other resources, such as file descriptors, user ID's, and the like, we can develop a powerful model that gives the performance of threads with greater functionality (Figure 4). .KS .DS B .PS addrht = 2.0 addrwid = .75 ellipsewid = addrwid * (2/3) ellipseht = (addrht/4) * (2/3) elht = addrht / 16 # # Draw the address space one more time # A1: box wid addrwid ht addrht C1: box wid addrwid ht elht with .nw at A1.nw "\s-2stack\s0" C2: box wid addrwid ht elht with .nw at C1.sw - (0.0, elht/2) "\s-2stack\s0" box wid addrwid ht elht with .nw at C2.sw - (0.0, elht/2) "\s-2stack\s0" B1: box wid addrwid ht elht with .sw at A1.sw "\s-2text\s0" C3: box wid addrwid ht elht with .sw at B1.nw + (0.0, elht/2) \ "\s-2shared lib text\s0" C4: box wid addrwid ht elht with .sw at C3.nw + (0.0, elht/2) \ "\s-2shared lib data\s0" C5: box wid addrwid ht elht with .sw at C4.nw "\s-2data/bss\s0" C6: box wid addrwid ht elht with .sw at C5.nw + (0.0, elht) \ "\s-2mapped file\s0" C7: box wid addrwid ht elht with .sw at C6.nw + (0.0, elht) \ "\s-2shared memory\s0" # # Draw the environment again. # B4: box wid addrwid * 2 ht elht with .c at B1.s - (0.0, 0.25) " File I/O" at B4.e ljust move to B4.sw + (addrwid/4, -0.25) L1: line -> up .25 "pipes" at L1.start below move to B4.se - (addrwid/4, 0.25) L2: line -> up .25 "files" at L2.start below move to B4.s + (0.0, -0.25) LL4: line -> up .25 "sockets" at LL4.start below move to A1.w - (0.35, 0.0) L3: line -> right .25 "signals " at L3.start rjust move to A1.w - (0.35, -0.2) L4: line -> right .25 "messages " at L4.start rjust move to A1.w - (0.35, 0.2) L5: line -> right .25 "semaphores " at L5.start rjust # # Draw some processes using the address space # eoff = 0.1 move to A1.e + (eoff, 0.0) move by (0.0, addrht/4) E1: ellipse with .w "\s-2state 0\s0" line -> left .25 from E1.w right move to A1.e + (eoff, 0.0) E2: ellipse with .w "\s-2state 1\s0" line -> left .25 from E2.w right move to A1.e + (eoff, 0.0) move by (0.0, -addrht/4) E3: ellipse with .w "\s-2state 2\s0" line -> left .25 from E3.w .PE .DE .DS C \fBFigure 4.\fP New SGI 4D Series Environment .DE .KE .NH 1 Share Groups .LP To address the failures and limitations of resource sharing when applied to multiprocessors, we developed the concept of the .I share group. .R Such a group is a collection of processes which share a common ancestor and have not executed the .I exec(2) system call since being created. The parent-child relationship between processes is maintained, and forms the basis for a hierarchial resource sharing facility, where a parent can indicate what shared resources each child should share. .LP In general, the main resource shared is the virtual address space associated with the share group, although it need not be. Processes in the share group are free to pass pointers to data, and to use high-performance synchronization and data passing techniques (mainly shared memory and spinlocks). .LP A small number of other resources may be shared in the initial implementation. Chief among these are file descriptors. When one of the processes in a group opens a file, the others will see the file as immediately available to them. The descriptor number\** may be passed between processes or simply assumed as part of the algorithm. .FS The descriptor number is an index into the file table for a process, which holds pointers to open file table entries. For example, .I standard input .R is by convention descriptor number 0, while .I standard output .R indicates descriptor number 1. .FE Other resources that may be shared are the .I ulimit values, the .I umask value, the current directory and the root directory .NH 1 System Call Interface .LP The share group interface is defined through two new system calls, .I sproc() and .I prctl(). The .I sproc() call is used to create a new process within the share group, and controls the resources which will be shared with the new process. The .I prctl() call is used to obtain information about the share group and to control certain features of the group and new processes. .NH 2 The \fIsproc\fP System Call .LP The syntax of this call is: .DS \fC int sproc(entry, shmask, arg) void (*entry)(); unsigned long shmask; long arg; returns - -1 - OS error in call >0 - new process ID \fP .DE This call is similar to the .I fork(2) system call, in that a new process is created. The .I shmask parameter specifies which resources the child will share with the share group, each particular resource being identified with a bit in the parameter. A new stack is automatically created for the child process, and the new process is entered at the address given by .I entry. This new stack is visible to all other processes in the share group, and will automatically grow. The single argument .I arg is passed to newly created process as the only parameter to the entry point, and can be used to pass a pointer to a data block or perhaps an index into a table. .LP The first use of the .I sproc() call creates a .I share group. .R Whenever a process in the share group issues the .I sproc () call the new process is made part of the parent's share group. A new process may be created outside the share group through the .I fork(2) system call, and use of the .I exec(2) system call also removes the process from the share group before overlaying the new process image. .LP All resource sharing is controlled through the .I shmask ( .B "share mask" ) .R parameter. Currently, the following elements can be shared: .DS \fC PR_SADDR - share virtual address space PR_SULIMIT - ulimit values PR_SUMASK - umask values PR_SDIR - current/root directory PR_FDS - open file descriptors PR_SID - uid/gid PR_SALL - all of the above and any future resources \fP .DE When the child is created, the share mask is masked against the share mask used when creating the parent. This means that a process can only cause a child to share those resources that the parent can share as well, thus forming a hierarchy. The original process in a share group is given a mask indicating that all resources are shared. .LP Whenever a process modifies one of the shared resources, and it's share mask indicates that it is sharing the resource, all other processes in the share group which are also sharing the resource will be updated. .LP When the virtual address(VM) space is shared, a process may modify the VM space and all other processes will see that same change. If new regions are added to the VM space, the kernel will check to insure that no address range overlaps any other. If the virtual address space is not shared, the new process (child) gets a .I copy-on-write image of the share group's (parent's) virtual address space. In this case, the child's stack is not visible in the parent's virtual address space. .LP Certain small parts of a process's VM space are not shared. The most critical of these is the process data area, or PRDA. This is a small amount of memory (less than a page in size) which records data which must remain private to the process, and is always at the same fixed virtual location in every process. As an example, consider the .I errno variable provided by the C library. Since the data space is shared, if this variable were only in the data space it would be difficult for independent processes to make reliable system calls. Thus, the C library locates a copy of .I errno in the PRDA for a process. The format and use of the PRDA is totally within the scope of the user program, and can be managed in any way appropriate. Libraries (such as the C library) which need some private data may use a portion of the PRDA, leaving a portion for direct use by the programmer. .NH 2 The \fIprctl\fP System Call .LP The second system call is the .I prctl() call, which allows certain aspects of a share group to be controlled. Its syntax is: .DS \fC int prctl(option [, value [, value2 ]]) unsigned option; char *value; char *value2; returns - -1 - error in OS call <>0 - request result \fP .DE The following .I option values may be used: .DS \fC PR_MAXPROCS - return limit on processes per user PR_MAXPPROCS - return number of processes that the system can run in parallel. PR_SETSTACKSIZE - sets the maximum stack size for the current process. PR_GETSTACKSIZE - retrieves the maximum stack size for the current process. \fP .DE The PR_SETSTACKSIZE option allows the program to set the maximum stack size which it may have. This value is inherited across .I sproc() and .I fork() system calls, and indirectly controls the layout of the shared VM image. .NH 1 Implementation .LP The implementation of share groups centered on four goals: .RS .IP 1. The implementation must work correctly in both multi-processor and uni-processor environments. .IP 2. Synchronization between share group processes must be able to proceed even though one member is not available for execution. .IP 3. The overall structure of the kernel must not be modified. .IP 4. The performance for non-share group processes must not be reduced. .RE .LP The multiprocessor requirement proved to be the major difficulty in the design, since there are no formal interfaces for synchronization between processes executing in the kernel. This is not usually a problem on uni-processor systems since a process executing in the kernel cannot be preempted. Thus it may examine and modify the state of any other process, and know that the state will not change (unless the executing process goes to sleep). This isn't true in a multi-processor environment. The process being examined may be running on another processor, sleeping on a semaphore waiting for a resource (it could even be waiting for a resource that the examining process controls), or it may be waiting for a slow operation (i.e. read on a pty, performing the .I wait(2) system call, etc.). .NH 2 Data Structures .LP For each share group, there is a single data structure (the shared address block) that is referenced by all members of the group: .DS \fC typedef struct shaddr_s { /* the following are all to handle pregions */ preg_t *s_region; /* processes' shared pregions */ lock_t s_acclck; /* lock on access to shared block */ sema_t s_updwait; /* wait for update lock */ short s_acccnt; /* count of readers */ ushort s_waitcnt; /* count of waiting processes */ /* generic fields for handling shared processes */ struct proc s_plink; /* link to shared processes */ ushort s_refcnt; /* # process in list */ ushort s_flag; /* flags */ lock_t s_listlock; /* protects s_plink */ /* semaphore for single threading open file updating */ sema_t s_fupdsema; /* wait for opening file */ struct file **s_ofile; /* copy of open file descriptors */ char *s_pofile; /* copy of open file flags */ struct inode *s_cdir; /* current directory */ struct inode *s_rdir; /* root directory */ /* lock for updating misc things that don't need a semaphore */ lock_t s_rupdlock; /* update lock */ /* hold values for other sharing options */ short s_cmask; /* mask for file creation */ daddr_t s_limit; /* maximum write address */ ushort s_uid; /* effective user id */ ushort s_gid; /* effective group id */ } shaddr_t; \fP .DE 1 This structure is dynamically allocated the first time that a process invokes the .I sproc(2) system call. A pointer in the .I proc structure points to this, and the .I s_plink field links all processes via a link field in the .I proc structure. To protect the linked list during searching, the lock .I s_listlock is used. A reference count is kept in .I s_refcnt, and the structure is thrown away once the last member exits (Figure 5). The rest of the fields are used to implement the resource sharing between group members and are explained in the following sections. .KS .DS B .PS linewid = .25 lineht = .25 # # Draw a process slot box # define procent X box ht .55 wid .4 "\s-2proc\s0" $1 X # # Draw a pregion box # define pregion X box ht .25 wid .5 "\s-2pregion\s0" X define pregionmove X box ht .25 wid .5 invisible X # # Build three example entries # right move by (.75, 0.0) move by (linewid, 0.0) R1: pregion() line <- from R1.e P1: procent("\s-20\s0") line down -> from P1.s P2: procent("\s-21\s0") line down -> P3: procent("\s-22\s0") left L1: line -> from P3.w B0: box ht .25 wid .5 with .e at L1.end "\s-2pregion\s0" L2: line -> from B0.w box ht .25 wid .5 with .e at L2.end "\s-2pregion\s0" right # # Draw share block # move to P2.e B1: line -> right .5 B2: box ht .75 wid .5 with .sw at B1.end "\s-2File\s0" "\s-2Sync\s0" B3: box ht .375 wid .5 with .nw at B2.sw "\s-2Other\s0" B4: box ht .375 wid .5 with .nw at B3.sw "\s-2Address\s0" \ "\s-2Space\s0" # # Connect up the lines # line <-> from P1.e to B2.w line -> from P3.e to B4.nw # # Add the global pregions # line -> down from B4.s down R3: pregion() right line -> from R3.e R2: pregion() line ->; pregion() # # Add additional Legends # move to B2.n "Share Block" above move to R1.nw + (0.0, 0.05) "Private Pregions" above move to R2.n + (0.1, 0.05) "Shared Pregions" above .PE .DE .DS C \fBFigure 5.\fP Share Block Data Structure .DE .KE .NH 2 Virtual Space Sharing .LP The kernel in which this was implemented is based on System V.3, and therefore uses the \fBregion\fP[Bach86] model of virtual memory. This model consists of 2 main data structures - .I regions, which describe contiguous virtual spaces (and contain all the page table information etc.) and .I pregions, which are linked per-process and describe the virtual address at which a region is attached to the process, and certain other per-process information about the region of memory. This model is designed to allow for full orthogonality between regions that grow (up or down), and those that can be shared. .LP The most interesting (and difficult) part of resource sharing to implement is sharing of the VM image, although the job is simplified by using .I regions as the basis for managing the VM. The .I sproc system call shares code with the standard .I fork call, the only difference being the handling of regions. If address space sharing is not indicated when the .I sproc() call is made, than the standard .I fork() operations are performed, generating an image that provides .I copy-on-write access to the VM image. If sharing of the VM image is specified, the private regions of the parent process are marked as .I copy-on-write in the child, while all other regions are shared. A point to remember is that a .I fork() or non-VM sharing .I sproc() call leaves any visible stack or other regions from the share group as .I copy-on-write elements of the new process. .LP Algorithmically, the private regions for a process are examined when demand paging or loading a new process before examining the shared regions. This provides the .I copy-on-write abilities of a non-VM sharing share group member. It also provides a basis for future enhancements to the manner in which the VM is shared. For instance, it could be possible to share part of the VM image and have .I copy-on-write access to other parts of the image. .LP .I Sproc() allocates a new stack segment in a non-overlapping region of the parent's virtual address space. The new process starts on this stack and therefore does not inherit the stack context of the parent (though the child can reference the parent's stack). The child process starts executing at the address given in the .I sproc() call. .LP Unfortunately, the stock (V.3) region implementation does not support growing or shrinking shared regions (e.g., the data and BSS portion of a process). The main problem is that although regions have a well defined locking protocol, a basic assumption is made that only the process owning a private region (one that has only a single reference) will ever change it. Using this assumption, various parts of the kernel hold implicit pointers into the region (i.e. pointers to pages) even though they no longer hold a lock on the region. If a process comes in and shrinks such a region, then any implicit pointers are left dangling. .\" Lack of adequate region locking also comes into play when scanning a pregion list attempting to match a virtual address reference with a region of actual memory (during a page fault, for example). This list is not usually protected via a lock since only the calling process can change it. With growable sharable regions, information that is used during the scan could change. The former problem is inherent in uni-processor as well as multi-processor systems, and the later is only a problem for multi-processor systems. .LP The obvious soution to this is to use a lock on the pregion list. Since all share group processes must obtain this lock each time they page fault, the lock was made a shared read lock - any number of processes can scan the list, but if a process needs to update or change the list or what it points to, it must wait until all others are done scanning. Unfortunately, to guard against the 'implicit' references mentioned above, the lock could not be hidden in the existing region lock routines, but had to be placed around the entire sections of code that reference the virtual address. .LP The shared read lock consists of a spin lock .I s_acclck which guards the counters: .I s_acccnt and .I s_waitcnt. .I s_acccnt is the number of processes reading the list (or -1 if someone is updating the list); .I s_waitcnt counts the number of processes waiting for the shared lock. .I s_updwait is a semaphore on which waiting processes sleep. Since operations that require the update lock are relatively rare (fork, exec, mmap, sbrk, etc) compared to the operations that scan (page fault, pager) the shared lock is almost always available and multiple processes do not collide. .LP When a process first creates a share group (by calling .I sproc(2)) all of its sharable pregions are moved to the list of pregions in the shared address block. Some types of regions are not currently shareable. For instance, private text regions may be created for debugging - that way breakpoints may (but are not required to) be set in an individual share group member. The shared pregion list is protected via the shared lock in all places that the pregion list is accessed. In most cases the only code change was to put calls to the shared lock routines around the large sections of code that reference the pregion or region. Since there is only one list, if one process adds a pregion (say through a mmap(2) call) all other share group members will immediately see that new virtual region. .LP An interesting problem occurs when a process wishes to delete some section of its virtual space either by removing or shrinking a region. In this case it is important that the actual physical pages not be freed until .B all share group members have agreed to not reference those pages. Also important is that the initiating process not have to wait for each group member to run before completing its system call (some members may not run for a long time). To solve this problem, we use the fact that the TLB (translation lookaside buffer) is managed by software for the target processor (a MIPS R2000[MIPS86]). Thus before shrinking or detaching a region, we synchronously flush the TLBs for ALL processors, while holding the update lock for the share group's pregions. Thus if any group members are running, they will immediately trap on a TLB miss exception, come into the kernel and attempt to grab the shared read lock and block. Since the lock is held for update, the process will sleep until the lock is released. The invoking process may then continue to remove the region, free the lock and leave the kernel. .NH 2 Other Attribute Sharing .LP \" something about making pools and why we want to share other things Unlike virtual space, other process resources are not visible outside of the kernel, thus it is only important that they be synchronized whenever a group member enters the kernel. As with virtual space synchronization, we don't want to force the calling process to wait until all other group members have synchronized. To implement this, we keep a copy of each resource in the shared address block: current directory inode pointer, open file descriptors, user ids, etc. This also allows immediate access to this data, as most of it is kept in the user area and therfore inaccessible to other than the owning process. .LP Those resources which have reference counts (file descriptors and inodes) have the count bumped one for the shared address block. This avoids any races whereby the process that changed the resource exits before all other group members have had a chance to synchronize. Since there always exists a reliable available copy of the data, all that remains is to synchronize the data on each member's entry to the kernel. To make this efficient, multiple bits in the proc structure .I p_flag word are used. When a group member changes a sharable resource, it first checks its .I p_shmask mask (the kernel version of the share mask) to see if it is sharing this particular resource. If so, .I s_fupdsema or .I s_fupdlock is locked to avoid any races between multiple updaters. The resource is then modified (open the file, set user id, etc), a copy is made in the shared address block, each sharing group member's .I p_flag word is updated, and the locks are released. When a shared process enters the system via a system call, the collection of bits in .I p_flag is checked in a single test; if any are set then a routine to handle the synchronization is called. Other events that must be checked on entry to the system were also changed to this scheme, thus lowering the system call overhead for most system calls. .LP The above scheme works well except if two processes attempt to update a resource at the same time. The lock will stop the second process, but it is important that the second process be synchronized prior to being allowed update the resource. This is handled by also checking the synchronization bits after acquiring the lock. .NH 1 Analysis .LP The resource sharing mechanism described above was implemented and tested on a multi-processor MIPS R2300 testbed. As expected, the time for a .I sproc() system call is slightly less than a regular .I fork(). The overhead for synchronizing the virtual spaces is negligible except when detaching or shrinking regions. In practice this only happens if a process shrinks its data space (fairly rare) or if a process does considerable VM management activity (e.g. mapping or unmapping files). A possible system performance gain could be to only selectively flush the TLBs of the other processors. .NH 1 Future Directions .LP Although the set of features included in the first release is adequate for many tasks, there are many other interesting capabilities that could be added. .LP For example, the ability to selectively share regions when calling .I sproc() could be a useful facility, somewhat along the lines of Mach shared memory, thus allowing some parts of the address space to be accessed .I copy-on-write while others are simply shared. This ability is a simple extension to the current scheme, as it only requires proper management of the private pregion list and the shared pregion list. .LP There are also other resources that could be usefully shared. For instance, the scheduling parameters of a process could be shared among the members of the share group. Since the shared address block is always resident, it provides a convienent handle for making scheduling decisions about the process group as a whole. In a multi-processor example, the programmer could specify that at least two of the processes in the share group must run in parallel, or the group should not be allowed to execute at all. The priority of the whole group could be raised or lowered, or a whole process group could be convienently blocked or unblocked. .LP Currently, calling .I exec(2) breaks the association with the share group. By modifying the concept of pregion sharing to handle a unique address space, it could be possible to have a group of unrelated programs managed as a whole for file sharing or scheduling purposes. .NH 1 Conclusion .LP This paper has presented a new and unique interface for the System V kernel, .I share groups, .R that allows an extremely high level of resource sharing between UNIX processes. The programmer has control of what is being shared. Normal UNIX semantics are maintained for all processes within a share group, thus maintaining compatability and expected behaviour. .LP This interface has been implemented within a production kernel, and meets the promise of its design. Processes not associated with a share group experience no penalty for the inclusion of share group support, while processes within a share group may take advantage of a common address space and automatic sharing of certain fundamental resources. .LP This interface also allows for a large degree of future extensions, and shows how an additional layer of process management may be added to the UNIX kernel without penalizing standard programs. .bp .SH References .XP Accetta, Mike, et.al., "Mach: A New Kernel Foundation For UNIX Development", in .I USENIX Association Proceedings, Summer, 1986. .R .XP Tevanian, Avadis, et.al., "Mach Threads and the UNIX Kernel: The Battle for Control", in .I USENIX Conference Proceedings, .R Summer 1987. .XP Wagner, J. C., and Barton, J. M., "Threads in System V: Letting UNIX Parallel Process", a work-in-progress paper presented in .I ;login:, .R USENIX Association, Vol. 12, No. 5, September/October 1987. .XP Barton, J. M., "Multiprocessors: Can UNIX Take the Plunge?", in .I UNIX Review, .R October, 1987. .XP Beck, Bob, and Olien, Dave, "A Parallel Programming Process Model", in .I USENIX Conference Proceedings, .R Winter, 1987. .XP Bach, Maurice J., .I The Design of the UNIX Operating System, .R Prentice Hall, New Jersey, 1986. .XP .I MIPS R2000 Processor Architecture, .R Mips Computer Systems, Inc., Mountain View, CA., 1986. ---------- cut here ----------------