Design of a Fully Preemptable Linux Kernel
George Anzinger and Nigel Gamble
September 6, 2000
MontaVista has developed a hard real-time fully preemptable Linux kernel, based
on Linux kernel 2.4. The preemptable kernel has the potential to dramatically improve
application responsiveness of the Linux kernel, while fully preserving the standard
Linux programming model. The current prototype of the preemptable Linux kernel (for
IA32/X86 platforms) is available for ftp download, at ftp://ftp.mvista.com.
The preemption model used is to allow the kernel to be preempted at any time when
it is not locked. This is very different from the model where preemption is actively
requested by the currently executing code. Using this model, when an event occurs
that causes a higher priority task to be executable, the system will preempt the
current task and run the higher priority task. Of course, there are times when this
should not be done. These include:
At all other times the MontaVista algorithm allows preemption. Also, whenever the
system exits from one of the above states, a test is made to see if preemption is
called for. If so, the current task is preempted. (Please see the final paragraph
below for important comments on this attribute.)
- While handling interrupts
- While doing "bottom half" processing. Bottom half processing is work that
an interrupt routine needs to do, but which can be done at a more relaxed pace.
- While holding a spinlock, writelock, or readlock. These locks were put in
the kernel to protect it from other processors in Symmetric Multiprocessing
(SMP) Systems. While these locks are held, the kernel is not preemptable for
reentrancy or data protection reasons (just as in the SMP case).
- While the kernel is executing the scheduler itself. The scheduler is charged
with executing the "best" task and if it is engaged in making that decision,
it should not be confused by asking it to give up the processor.
MontaVista's effort to bring preemption to Linux focused on two changes.
First the basic interrupt entry and completion code "entry.S" and its helpers was
modified to ensure that it never returns to a user with a pending soft interrupt,
context switch, or signal delivery. It was also modified to never return to system
code with a pending soft interrupt, or allowed context switch pending. Here "allowed"
means that the preemption lock count is zero. The preemption lock count is incremented
whenever a spinlock, writelock, readlock, interrupt or trap is taken and decremented
when ever these conditions clear.
The second change centered on modifying the header files for spinlock, writelock,
readlock, and some SMP code to produce code that modified the preemption count.
The scheduler was changed only slightly to directly prevent its own preemption and
to honor an additional "state" TASK_PREEMPTING flag. This flag is set whenever preemption
is taken and tells the scheduler that the task is to be treated as running, even
though its actual state may be other than running. This allows preemption to occur
during wake_up set up times when the kernel sets the current tasks state to something
other than running and then does other set up work on the way to calling the scheduler.
By using the TASK_PREEMPTING flag on the state (the underlying state is preserved)
the scheduler can distinguish between the preemption call and the completion of
the wake_up set up. Without this flag, the task could be put to sleep prior to completion
of the wake_up setup, and thus would never wake_up.
MontaVista is continuing to work on preemption and is currently looking at the following:
Over the long term, MontaVista is investigating whether preemption locks can be
eliminated (or at least greatly reduced in number) by protecting all the short-duration
critical regions with spinlocks that also disable interrupts on the local CPU, and
the long-duration critical regions with mutex locks. ("Long duration" means much
greater than the time taken by two context switches.) This will reduce the overhead
of the preemptable kernel, since there will no longer be any need to test for preemption
("polling for preemption") at the end of a preemption-locked region (which could
happen tens of thousands of times per second on an average system). Instead, preemption
would happen automatically as part of the interrupt servicing that causes a higher-priority
process to become runnable ("event-driven preemption"). Typically, this only happens
a few times to a few tens of times per second with an average system workload, making
the "event-driven preemption" model much more efficient than the "polling for preemption"
model. This method also has an added efficiency in that the system will take advantage
of the cache disruption caused by the interrupt (which is unavoidable) to continue
with the preemption.
- Change the spinlock, writelock, and readlock macros that also disable interrupts
to not bother with the preemption counter. The interrupt system being off prevents
preemption, so the flag is not needed.
- Set up the spinlock, writelock, and readlock macros so that both SMP and
preemption can be turned on at the same time. There is no reason an SMP system
cannot be preemptable.
- Measure the impact on throughput of a variety of workloads. Note that under
combined I/O and CPU intensive workloads, throughput may actually improve.
- Do timing analysis on the system locks.
- Using the above analysis for short locks, consider using the interrupt system
as a lock, instead of the preemption counter. This will reduce system overhead.
- Again, using the above analysis, consider ways to reduce the longer times.
This would most likely entail using mutex locks to protect the regions, thus
blocking only code that needed the same function, rather than the whole processor.
These mutex locks would be based on priority-inheriting binary semaphores. (Priority-inheriting
semaphores are designed such that a task holding such a semaphore runs at, or
above, the priority of the highest priority task waiting for the semaphore.
This prevents a low priority task from being blocked by priority while it is
holding a semaphore that a higher priority task wants.)