The emulation illusion breaks...

Dan

Dec 9, 2002

...with respect to timing assumptions.

I've discovered a class of issues related to how Hercules violates
relative timing assumptions of the OS running under it, especially
when emulating multiprocessor systems.

Consider the case of spin locking. In a single-CPU system, a critical
region of code can be protected by disabling interrupts while it is
running to protect the integrity of data that it accesses. In a
multiple-CPU system, that doesn't work due to the fact that other
CPUs can access the data even with interrupts disabled. Spin locks
are a classic solution to this problem. To acquire the lock, a CPU
disables its own interrupts, then spins waiting for a bit to become
zero using an atomic compare/exchange operation to set the bit, then
it accesses the data, clears the bit, and re-enables interrupts. If
another CPU is accessing the data, the CPU will spin until the other
CPU is done.

Spin locking is used quite commonly. It is done when the cost of two
context switches in CPU cycles is greater than the cost of spinning.
If all CPUs are running at the same rate, it might only take a few
instructions for the other CPU to be done accessing the data, so it's
an easy tradeoff to make.

Systems like MVS also reportedly spin waiting for synchronous I/O
operations to complete. Again, this is done when the cost of
switching contexts exceeds the cost of spin waiting.

Now the problem:

There is a fundamental assumption with spin waits that the spinning
itself can never affect the performance of whatever it is we are
waiting for. By spinning on one CPU, we can never cause the other CPU
to take longer to release the lock. By spinning on a CPU, we can
never cause an I/O operation to take longer. The two agents operate
independently of one another.

In an emulated system, this assumption is obviously violated. The
spinning CPU, emulated, takes cycles needed for the other CPU to
complete the thing it is waiting for. Additionally, the emulation
software creates other interdependencies, such as locks. In order to
examine the state of a processor, it is necessary to acquire a lock
to that state. Thus, while one CPU is examining the state of another,
the other is prevented from changing its state. When a CPU is
spinning waiting for the other's state to be something specific, the
lock contention can seriously hinder progress.

Hercules solves the problem of a CPU spinning waiting for an I/O
operation by running the I/O at a higher priority than the CPU. This
creates a scenario where the CPU's spinning cannot generally hold up
progress of the I/O it is waiting for (except for the priority
inversion case where the CPU is holding a lock to the state of the
device, but we hope the host system's scheduler deals with priority
inversion in one of the classic ways).

There is a real problem when multiple CPUs are emulated, because they
are at the same priority. There are many cases in the operation of an
OS like MVS where one CPU spins waiting for something to happen on
the other. Even in a system with two real CPUs emulating two
mainframe CPUs, we can run into this problem because one of the real
CPUs is occupied doing I/O while one of the emulated CPUs is spinning
waiting for the I/O while the other emulated CPU is spinning waiting
for a response from the first emulated CPU. There are many other
complexities like locking, or the design of spin waits on the target
system that complicate matters.

As a result, I recently found that switching my dual athelon machine
to emulating only one CPU had a very serious positive impact on
performance. That's because when the CPU is waiting for I/O, a CPU is
always available to do the I/O, and, in a single-emulated-CPU system,
one CPU never waits for another.

How to solve this problem?

It's difficult. I wonder how FLEX does it.

The only easy solution I can think of is to always emulate fewer CPUs
than you have real CPUs, and to set the CPUPRIO to a high priority
number. Leave one or more real CPUs free to run the host OS and
emulate I/O channels. In the case of a dual-proc system, emulate only
a single CPU, and leave CPUPRIO at 15 so the CPU doesn't slow down
any I/O processing, even when multiple channel programs are running
simultaneously.

It's a tough problem, though, because your physical CPUs must be
split between emulated CPUs and emulated I/O such that CPUs don't
hold up I/O while spinning waiting for I/O, and also such that when
one CPU is spinning waiting for another, there is always a real CPU
available to emulate the one being waited for. The latter requires a
high CPUPRIO. The former requires enough extra real CPUs to handle
the full CPU load of emulating I/O, whatever that may be, so that the
I/O channel emulation is never held up waiting for CPU resources that
are being wasted by spinning emulated CPUs.

In small scale systems with one or two processors, I would say the
answer is simply to stick to the default of only emulating one with a
low CPUPRIO.

--Dan

8:37 pm


Re: The emulation illusion breaks...

Jay Maynard

Dec 9, 2002

On Mon, Dec 09, 2002 at 08:37:33PM -0000, Dan <slimeybidet@...> wrote:
> I've discovered a class of issues related to how Hercules violates
> relative timing assumptions of the OS running under it, especially
> when emulating multiprocessor systems.

Thanks for a very well-thought-out discussion of the issue, Dan.

I've recommended for quite a while that, on an SMP host, that the number of
emulated CPUs be at most one less than the number of physical CPUs. This was
based on my own experimentation, when I had ready access to 4- and 8-way
hosts, although that was necessarily limited in scope. I hadn't given it
that much thought, and I'm glad someone did.

10:35 pm


Copyright 2002