Performance observations please

Joseph M. DeAngelo

Nov 19, 2002

I'd like to know if anyone has done performance analyses on typical
Hercules environments, i.e. running under an Intel based host. Are
there performance related how-to's located somewhere? I'm new to
this and would like to share my observations.

First allow me to congratulate the Hercules team on a job very well
done. As a veteran of running DOS and VS1 under VM from the early
70's, I can see right away that this is a big step forward from that
sort of thing. When I could hit the enter key on a telnet session to
the S/390 Linux rnuning under hercules and see it come back with an
immediate prompt, I was impressed. CMS running under VM running
under VM was no where near as fast.

I could also see that any command which involved I/O to dasd was
suffering. As someone who has written operating systems for
mainframes long ago, I can testify to the enormous number (thousands)
of instructions needed to translate any I/O request into a channel
program and run that program to a successful conclusion. I saw
someone else estimate that it takes an average of 200 native
instructions to emulate a single S/390 instruction. It would seem
that the greatest gain in performance would be obtained in cutting
out that log jam.

I am running Hercules on a 256M K6-2/400 AMD processor running Suse 8
Linux. I chose that environment because I feel that the Linux/Intel
kernel is more efficient than the Win2K. My Hercules environment
consists of a 128M machine with a 1G dasd to house the / filesystem,
a 1.8G drive to house /home, and a 200M dasd for the swap. I realize
that this is a very unimpressive initial hardware situation.

I ran a port our uni-Rexx product. Normally, under a similar
Linux/Intel system, the port and the corresponding QA backend would
take less than 15 minutes wall time. My run last night consumed
about 10 hours running SuSE 7.0 S/390 within my Hercules
environment. THis is a big difference although I fully recognize
that I have managed to avoid buying an expensive mainframe with a
system worth about $100 and a copy of Suse 8.0 that cost me about
$40. TO a large degree I got what I paid for.

I would appreciate any performance tips. I know I should buy a real
computer for starters.

I'd also like to expand on my observation concerning I/O perforance.
I've been a VM person since CP/67 was offerred by Boeing Systems as a
timesharing environment. As early as VM/370, this attempt to emulate
I/O was recognized as a significant bottleneck. IBM came out with an
enhancement for running OS/VS1 under VM that required OS/VS1 to
recognize that it was running in a virtual machine ( via a quirk in
the Store CPUID instruction). OS/VS1 would then send page I/O
directly to the VM hypervisor via a Diagnose instruction. This
eliminated the construction of CCW programs and their subsequent
decoding by VM.

I wonder if Hercules does the same or a similar trick with the Store
CPUID instruction that would permit it's guests to know that they are
running under Hercules which would allow tha DIAG instruction to be
used for a similar purpose.

When my SuSE 7.0 S/390 wants to read data from dasd, if it knows that
it is running on an Intel linux based host, the request could be more
efficiently translated, i.e. an fread() in the S/390 system could
conceiveably be translated into a diagnose instruction to the
Hercules hypervisor which, in turn, would have the mapping data
needed to satisfy the I/O request with its own direct fread() call.

Anyway, I recognize that my simplistic concept would require a lot
more work than my words might imply, but I think that it is the
truest path to enhanced Hercules performance.

5:59 pm


Re: Performance observations please

Adam Thornton

Nov 19, 2002

On Tue, Nov 19, 2002 at 05:59:27PM +0000, Joseph M. DeAngelo wrote:
> I ran a port our uni-Rexx product. Normally, under a similar
> Linux/Intel system, the port and the corresponding QA backend would
> take less than 15 minutes wall time. My run last night consumed
> about 10 hours running SuSE 7.0 S/390 within my Hercules
> environment. THis is a big difference although I fully recognize
> that I have managed to avoid buying an expensive mainframe with a
> system worth about $100 and a copy of Suse 8.0 that cost me about
> $40. TO a large degree I got what I paid for.

I find that Hercules emulation tends to cost me about two orders of
magnitude of speed compared to a similar task running "on the metal".
That is, given task X in Linux/390 under Hercules, or under Linux/x86 on
the Hercules host, there's roughly a factor of 100 in terms of
performance difference. So it seems like you're in my ballpark.

Adam
--
adam@...
"My eyes say their prayers to her / Sailors ring her bell / Like a moth
mistakes a light bulb / For the moon and goes to hell." -- Tom Waits

6:05 pm


Re: Performance observations please

Dan

Nov 19, 2002

Probably the most dramatic difference between the mainframe hardware
architecture and, say a typicaly "big-iron" UNIX machine architecture
is the I/O. To be more specific, mainframes tend to have a much
larger number of I/O devices connected to them, and the architecture
supports this large number of connections in a topology that doesn't
have the bottlenecks that smaller machines would have.

To the extent you are running I/O bound programs, the mainframe can
support a vastly greater number of them running simultaneously due to
the fact that it has an enormous ability to parallelize I/O. If you
run 1000 simultaneous jobs, each of which uses a dedicated DASD unit,
assuming they are all fairly-well I/O bound, each will complete in
about the time it would have if nothing else were happening on the
system. This is due to the highly parallel I/O architecture and the
highly efficient task switch.

That profile suggests a few things about emulating mainframes for
maximum benefit.

Firstly, the emulated mainframe is a lot like the real mainframe in
the sense that CPU cycles can be considered expensive. If we are
averaging 200 native cycles per mainframe instruction, mainframe
instructions are 200 times as expensive as Intel instructions. That's
probably just about in line with the hardware side. A mainframe
processor that could do as many instructions per second as, say, an
AMD processor, would cost about 200 times as much (by rough order
anyway).

So, in terms of CPU, you have the equivalent of a smallish mainframe
(say a P/390). Due to the highly efficient system architecture, and
I/O capabilities, a P/390 might easily be shared by 100 people if
they were all doing traditional MVS stuff (COBOL compiles, assembly
and link edit, using TSO, running batch jobs). Not bad for only 7
MIPS.

But the I/O picture is a lot more bleak. You might have 100 emulated
DASD units in your Hercules configuration, but chances are you are
only using a handful of real disk drives. That makes the disk drive a
serious bottleneck, especially considering that mainframe workloads
tend to utilize DASD a lot (since that's one of the best ways to get
scalability on a REAL mainframe).

We can take a lesson from how, say, a Multiprise 3000 handles this
situation. They have a real mainframe processor for the CPU, another
real mainframe processor for the channels, and an I/O system emulated
under OS/2. Between the P/390 and the MP/3000, IBM realized that the
OS/2 drivers and bus were a serious I/O bottleneck for DASD, so they
created a direct connection between the main DASD array and the S/390
side which is still controlled from OS/2, but the data itself doesn't
go through the PC bus, etc.

In order to get the kind of I/O performance needed to scale to the
level of the CPU capacity of that box, a high-performance, hardware
RAID disk subsystem is employed with a large, direct pipe to the
mainframe memory. It's an interesting approach really. You might have
50 emulated DASD units all sitting in storage on a big disk array
that has maybe 20 physical units. If the RAID is implemented well
(and I'm sure it probably is in this box), you should see something
approaching half of the performance of real mainframe DASD. Since the
control instructions are run offline of the actual mainframe CPU (by
a combination of the channel CPU and the Intel "driver" CPU), I/O can
still be parallelized a lot (but still less than in a "true"
mainframe).

So, my observations are:

A dual CPU box is a good idea. In fact, it would be a little better
to have two slower CPUs than one really fast one.

A high-performance disk system (with multiple disk arms, hardware
striping and caching, etc.) is also a good idea.

A ton of RAM isn't really necessary. Put your money into the disk
system and CPUs.

I am building a system like this:

Dual Athelon 2000
512 MB registered ECC RAM
Promise SX4 ATA RAID card
4 80 GB, 7200 RPM, ATA-133 Maxtor drives
Case with a big power supply and lots of cooling fans
UPS

The promise card is very cost effective, and has extremely good read
performance, but suffers a bit in write performance for RAID-5 due to
its slow processor for XOR calculation. This could be remedied by
going to a RAID-0 configuration, or 0+1, which would require 2 * (n -
1), or in my case 6, drives to get the same level of read
performance. I'm also putting 256MB ECC RAM on the promise card for
hardware disk caching.

With respect to how to configure the system, Linux seems like a
better way than Windows, because you have the extra layer of the
Cygwin stuff on Windows. My plan is to create several different Linux
file systems (in partitions/slices of the RAID storage), and divide
mainframe DASD files between them. That will have the effect of
reducing fragmentation and giving a finer granularity of locking in
cases where it is necessary to lock the volume to perform some
operation.

One more thing:

While running Linux/390 on an emulated mainframe is an interesting
exercise, I personally doubt it would have much practical value over
just running Linux directly on the hardware. Linux is designed for a
price/performance world where CPU cycles are 200 times cheaper than
they are in the mainframe world. The tradeoffs are all different. I
would check out MVS 3.8j, which is a public domain version that runs
great under Hercules, and performs actually quite acceptably as long
as you don't tax the I/O system too much (i.e. with only a few
concurrent users/jobs).

Regards,
--Dan

--- In hercules-390@y..., "Joseph M. DeAngelo" <j_m_deangelo@y...>
wrote:
> I'd like to know if anyone has done performance analyses on typical
> Hercules environments, i.e. running under an Intel based host. Are
> there performance related how-to's located somewhere? I'm new to
> this and would like to share my observations.
>
(snip)

7:18 pm


Re: Performance observations please

John Summerfield

Nov 20, 2002

On Wed, 20 Nov 2002 03:18, Dan wrote:
> Probably the most dramatic difference between the mainframe hardware
> architecture and, say a typicaly "big-iron" UNIX machine architecture
> is the I/O. To be more specific, mainframes tend to have a much
> larger number of I/O devices connected to them, and the architecture
> supports this large number of connections in a topology that doesn't
> have the bottlenecks that smaller machines would have.

Years ago we spend $19 000 000 on a pair of 3168 machines. CPU was about one
MIP, I/O was 1.5 Mbytes/sec one one BMX channel, aggregate 3.

In contrast, when I ran Herc/MVS a while ago on a PII 233, it was a few MIPS
and _much_ more I/O - equivalent about two generations later.

Nobody in their right mind will try to run the number of TSO users we did back
then. Today, licences permitting, you'd give everyone their own 10 MIPs or
more. If you want to synchronise stuff across a network, as I could recall,
JES3 could manage up to 32 systems.

Hercules on even cheap hardware (I recently bought a PII/266 system, 64 Mbyte
RAM) for $60 is _much_ more capable than the hardware people used to run MVS
3.8 on.

There are uses for Hercules, but if you're running a heavy production
workload, a real computer is a better bet.
--
Cheers
John Summerfield


Microsoft's most solid OS: http://www.geocities.com/rcwoolley/
Join the "Linux Support by Small Businesses" list at
http://mail.computerdatasafe.com.au/mailman/listinfo/lssb

2:27 am


Re: Performance observations please

John Summerfield

Nov 20, 2002

On Wed, 20 Nov 2002 01:59, Joseph M. DeAngelo wrote:
> irst allow me to congratulate the Hercules team on a job very well
> done. As a veteran of running DOS and VS1 under VM from the early
> 70's, I can see right away that this is a big step forward from that
> sort of thing. When I could hit the enter key on a telnet session to
> the S/390 Linux rnuning under hercules and see it come back with an
> immediate prompt, I was impressed. CMS running under VM running
> under VM was no where near as fast.
>
>
> I could also see that any command which involved I/O to dasd was
> suffering. As someone who has written operating systems for
> mainframes long ago, I can testify to the enormous number (thousands)
> of instructions needed to translate any I/O request into a channel
> program and run that program to a successful conclusion. I saw
> someone else estimate that it takes an average of 200 native
> instructions to emulate a single S/390 instruction. It would seem
> that the greatest gain in performance would be obtained in cutting
> out that log jam.

I first ran hercules on a Pentium 133. I estimated its CPU performance at
about equivalent to a 370/148, though I don't think you could get one with 16
Mbytes of RAM.

I/O performance far exceeds what was available then.

If you have a licence to run zOS, then that's different, of course.



--
Cheers
John Summerfield


Microsoft's most solid OS: http://www.geocities.com/rcwoolley/
Join the "Linux Support by Small Businesses" list at
http://mail.computerdatasafe.com.au/mailman/listinfo/lssb

2:31 am


Re: Performance observations please

Dan

Nov 20, 2002

--- In hercules-390@y..., John Summerfield <summer@c...> wrote:
> In contrast, when I ran Herc/MVS a while ago on a PII 233, it was a
few MIPS
> and _much_ more I/O - equivalent about two generations later.
>

(snip)

> Hercules on even cheap hardware (I recently bought a PII/266
system, 64 Mbyte
> RAM) for $60 is _much_ more capable than the hardware people used
to run MVS
> 3.8 on.

(snip)

That's interesting.

It would seem to me that there would still be a really big difference
in the level of I/O parallelization.

For example, if you had 10 batch jobs that were I/O bound (like most
batch jobs), and they could easily complete within your batch window,
it would be acceptable to run them using tape data sets. Supposing
you had enough tape drives (maybe 20-30), these jobs could run almost
without any impact to the rest of the system's performance. They
would be using very few CPU cycles per second, and all of their I/O
would be going across different channels than DASD or terminals, and
they would be employing highly inexpensive storage on devices that
ran totally independently of the rest of the system.

Same thing if you had lots of DASD units. With 3380/3980, two
independent data paths per string. If you had 5 strings, you could
have 10 jobs moving data along independent paths to independent units
at the same moment. Since this tends to be highly I/O bound work,
each has a very small resource consumption in main storage or CPU.

All of that I/O parallelization means doing a whole lot of work at
once, without the jobs really impacting one another very much in
terms of performance.

On a PC with, say, 4 fast hard drives on a RAID 0 controller with
caching, the highest transfer speed is much greater, but each disk
arm still has to seek and search for each request. With only 4 arms
across the data shared by a single controller that looks to the
operating system like a single storage device, requests to DASD are
largely serialized, and there can never be more than 4 physical disk
operations at a time.

The sheer number of I/O devices that can be connected to the
mainframe through independent paths greatly enhances its scalability.
To the extent it is I/O bound work, a few MIPS go a long way if
there's enough I/O capacity.

--Dan

3:28 am


Re: Performance observations please

John Summerfield

Nov 20, 2002

On Wed, 20 Nov 2002 11:28, Dan wrote:
> It would seem to me that there would still be a really big difference
> in the level of I/O parallelization.

No matter how you look at it, Herc on a PC isn't going to do a lot of parallel
I/O.

If you want real mainframe performance there's no substiture for a mainframe.

OTOH, Herc on PCs would make a pretty handy platform for programmers to do
their coding, compiling and some testing (licences permitting). If they need
serious I/O then use the big iron. Probably nothing technical would prevent
your personal mainframe from connecting to the corporate DB2 systems.

Similarly, it would be fine for learning about the latest version of DB2 UDB
for OS/390, but for performance evaluation, you still need a real computer.

btw Don't think you're going to cure I/O problems with a bunch of IDE drives.
You can only drive one at a time on each IDE port. And, if you have a mess of
ribbon in the box, ventilation is going to be a problem. Figure the Athlons
using 70+W each for starters.

--
Cheers
John Summerfield


Microsoft's most solid OS: http://www.geocities.com/rcwoolley/
Join the "Linux Support by Small Businesses" list at
http://mail.computerdatasafe.com.au/mailman/listinfo/lssb

4:24 am


Re: Performance observations please

Dan

Nov 20, 2002

--- In hercules-390@y..., John Summerfield <summer@c...> wrote:
> btw Don't think you're going to cure I/O problems with a bunch of
IDE drives.
> You can only drive one at a time on each IDE port. And, if you have
a mess of
> ribbon in the box, ventilation is going to be a problem. Figure the
Athlons
> using 70+W each for starters.

Well, there are some options. A bunch of IDE drives in a RAID
configuration will improve things somewhat. Even better (though more
expensive) would be a bunch of SCSI drives with different Linux file
systems, each containing some number of DASD volumes. You could get
up to 15 of them on a single HBA, but there would be more advantage
to using a couple of HBAs and splitting the drives between them.

The PCI bus itself can move 500 megabytes per second (which is the
same as over 100 parallel channels), and twice that if you use a 64
bit PCI HBA. Not bad.

The limitations of the physical package, power supply, cooling
system, etc. catch up with you really quickly in the PC world. I
fully agree that a real computer is necessary if you want to do real
work. I only mean to point out that there are various options, and
that the I/O parallelization consideration is a key hardware
optimization when emulating a mainframe.

The other concern would be having multiple processors, since, in the
emulated mainframe, running the channel program requires CPU cycles.

Regards,
--Dan

8:40 pm


Re: Performance observations please

mvt

Nov 21, 2002

On Wed, Nov 20, 2002 at 08:40:23PM -0000, Dan wrote:
(snip)
> Well, there are some options. A bunch of IDE drives in a RAID
> configuration will improve things somewhat. Even better (though more
> expensive) would be a bunch of SCSI drives with different Linux file
> systems, each containing some number of DASD volumes. You could get
> up to 15 of them on a single HBA, but there would be more advantage
> to using a couple of HBAs and splitting the drives between them.
>

Hi Dan,

My experience over the years leads me to think that one of the most
effective ways to improve I/O performance is to avoid the I/O entirely.

This thread (except Greg's comment) seems to overlook somewhat the
effects of available caching services provided by Linux (and other Unices)
at the filesystem level, and caches implemented by database products
hosted by the guest operating system.

Linux strives to avoid I/O by allocating available memory to the buffer
cache. The cache can be quite large (hundreds of megabytes or more) on
machines which are not memory constrained. The effect is similar to that
of the large 3880/3990 caching controllers of days gone by.

My time spent playing with MVS under Hercules leads me to believe that
the majority of the working set of active data becomes resident in the
buffer pool (except when MVS is hosting a database product which is
managing its own large buffer pools). Activity against active data
tends to result in no physical i/o at all (other than sync'ing write
activity to the brown media). The effectiveness of the cache is
even more profound when running with Greg's DASD compression code.
Even with a single large IDE drive, the apparent (from the MVS point
of view) I/O rates can be astonishingly good.

Unless a database which provides its own buffering is in the mix, I like
to limit the MVS memory to a few hundred megabytes less than physical
memory on the machine. If a database is running under MVS which
manages its own buffer pools, and if that internal database pool is
larger than the operating system provided buffer pool, then cache misses
are guaranteed to happen (thus polluting the cache, wasting cycles, and
degrading performance for other tasks, which is why Oracle and others
lobbied heavily for "raw device" functionality) and all available memory
should be allocated to MVS.

In the real world, running blue hardware, we find that 5 to 10 gb
of DB2 bufferpool is necessary to keep a pair of z900s (14 engines)
busy regardless of the number of paths, etc, etc, that we throw at the
problem. The mileage of others may vary.

So... when considering I/O performance, perhaps it would be wise to
considering throwing a couple of gigs of memory into the mix before
going too far down the SCSI, RAID, etc, road.

--
Reed H. Petty
rhp@...

3:06 pm


Re: Performance observations please

Dan

Nov 22, 2002

--- In hercules-390@y..., mvt <mvt@d...> wrote:

> Hi Dan,
>
> My experience over the years leads me to think that one of the most
> effective ways to improve I/O performance is to avoid the I/O
entirely.
>

That seems to be a very common perception. Please don't misunderstand
me. I'm not saying it's untrue. I just feel that it's looking at the
problem from the "wrong" (subjectively) angle.

If the goal of optimizing I/O performance is to have every I/O
complete as quickly as possible (i.e. minimizing the wait time for
each I/O), then it's absolutely correct to say having the data cached
in RAM makes matters better.

It seems to me that the misconception lies there. Most people
generally believe that the way to do a lot of work with their
computers is to figure out how to make their computers complete each
task as quickly as possible (i.e. they think performance equals
scalability). In my experience, it's much more complicated than that.

Scalability means the system can handle 'n' tasks simultaneously, and
each of those tasks will complete in an acceptable amount of time
from the point of view of the user waiting for it. It's not generally
true that reducing the turnaround time for each task will result in
your system being able to do the most possible work at one time.

In order to maximize the amount of work your system can do at one
time, it's necessary to try to minimize each task's impact on your
system's resources. The fewer resources each task uses, the more
tasks can be going at once with acceptable performance. If you think
about it, you'll see that the resources a task uses do not
necessarily go down as you improve its turnaround time. The more
we "optimize" a task for fast turnaround (beyond simple gains in
efficiency), the more expensive it becomes.

For example, memory is a faster storage medium than DASD, and it's
also a lot more expensive. A program that keeps its data in DASD data
sets is cheaper in system resources than one that keeps it all in
memory, even though that program's turnaround time might be longer.
But, if that turnaround time is still acceptable to the user, it's
better to use the cheaper resources for that program, leaving more
memory free for other things. If every program that runs on your
system uses 10K of memory and keeps all of its data on disk, you can
run 100 programs in one meg of main storage as long as you have
enough disk units and data bandwidth (say 20 drives with 5 programs
sharing each drive).

On the other hand, if your performance is based on having a lot of
cache memory instead of a good I/O system, it's much more expensive
to scale it up. If you need to run a batch job that processes a
million records sequentially, unless you have the main storage to
cache them all, that program will still have to access every single
bit of that data. Making that program go fast on a system with poor
I/O and lots of cache memory is going to be extremely expensive.
Consider the case of having to run 20 programs like that at once.
With 20 disk drives and some decent I/O hardware, they might only
take 100K each in main storage and their I/O will be completely
parallelized, meaning the system can likely do all 20 in about the
same time it would have taken to do one. Without the I/O
parallelization, the more data you start moving, with less locality
of reference, as with large batch runs, the more the poor performance
of the I/O shines through. In short, the system doesn't scale upward.

I definitely don't mean to disdain your observations. I have heard a
lot of people talk about how caching reduces the need for a good I/O
system. I haven't really seen it in practice. It seems to me that a
lot of these kinds of ideas spring from the fact that people tend to
buy a machine and dedicate it to one thing and then throw as much
hardware as they can at doing that one thing as fast as they can do
it. A lot of focus is placed on turnaround times with relatively
little thought to the cost per unit of work, and scalability. It's
counter intuitive to think a slower program is more efficient than a
faster one, but I have found it to be true in many cases. People
trade efficiency for raw speed, which seems to me like robbing Peter
to pay Paul.

--Dan

6:01 am


Re: Performance observations please

Greg Smith

Nov 22, 2002

Dan wrote:

>--- In hercules-390@y..., mvt <mvt@d...> wrote:
>
>
>
>>Hi Dan,
>>
>>My experience over the years leads me to think that one of the most
>>effective ways to improve I/O performance is to avoid the I/O entirely.
>>
>>
>
>That seems to be a very common perception.
>
<snip>

>If the goal of optimizing I/O performance is to have every I/O
>complete as quickly as possible (i.e. minimizing the wait time for
>each I/O), then it's absolutely correct to say having the data cached
>in RAM makes matters better.
>
>It seems to me that the misconception lies there. Most people
>generally believe that the way to do a lot of work with their
>computers is to figure out how to make their computers complete each
>task as quickly as possible
>
<nother snip>

>In order to maximize the amount of work your system can do at one
>time, it's necessary to try to minimize each task's impact on your
>system's resources.
>
<more snip>

>For example, memory is a faster storage medium than DASD, and it's
>also a lot more expensive.
>
<etc>

>On the other hand, if your performance is based on having a lot of
>cache memory instead of a good I/O system, it's much more expensive
>to scale it up.
>
<last one>

> People
>trade efficiency for raw speed, which seems to me like robbing Peter
>to pay Paul.
>
>
>
Don't get me wrong, but are you seriously arguing that having a cache
between disparate storage media is due to lazy programming ?? Why
do processors have a l1 and l2 cache ?? and disk control units a cache ??
and applications, such as hercules ?? I think even z/os tries to page
stuff using an lru-type algorithm.

I'm a bottom-up type of programmer. If my choice is coding read()/
write() or performing a search on some in-storage array that might
already have my data, then I'll burn the cpu to search the array as long
as the ratio of disk access time vs cpu time is great enough. I bought
1G memory for my 3 yr old dual piii 850mhz machine a while ago for
130usd. I don't consider that *that* expensive.

You are right in the sense that cache shouldn't be blindly applied to
solve a problem. But, it seems, you are making judgement calls against
code that you admittedly haven't even looked at. I don't blindly make
coding decisions. I take measurements, I trace the code, I examine the
assembler. In some complicated tasks, like garbage collection, my
intuition as to what should work best is shown wrong.

If you are serious that caching may be misapplied in hercules code
then please cite some examples.

Remember, hercules can run on, eg, linux-390. I can define my emulated
disks to be on a raid0 filesystem that spans multiple volumes across
multiple controllers and chpids. Or everything can be on a `lousy' ide
controller on my pc, which gets, btw, about 20MB/s.

Thanks,

Greg

7:26 am


Re: Performance observations please

Dan

Nov 22, 2002

--- In hercules-390@y..., Greg Smith <gsmith@n...> wrote:

> Don't get me wrong, but are you seriously arguing that having a
cache
> between disparate storage media is due to lazy programming ?? Why
> do processors have a l1 and l2 cache ?? and disk control units a
cache ??

No, that's not what I'm arguing. I think we are looking at this from
different angles.

It's a well-known system design principle that caching improves the
parallelism achievable in a system by helping to decouple
asynchronous subsystems with different price/performance tradeoffs.
You're talking about something that is fundamental to the design of
virtually all nontrivial computing systems.

The system gets probabalistic efficiency gains due to the fact that
references tend to be localized, and also due to the fact that
hardware accesses often involve multiple physical actions that must
be performed, the inertia of the actual mechanisms involved, and so
forth. These are all system design considerations. In fact, the
mainframe seems to be about the only system left that still tries to
optimize the physical hardware work (by reducing actuator motions
that must be carried out, etc.) Most other systems have "evolved" to
see the hardware as an abstract concept.

But I digress. My points are:

1. Caching does not obviate the need for I/O performance. It doesn't
even really reduce the need for I/O performance. The gains made by
caching can't generally be made by gains in I/O performance, and vice
versa. Caching is an important system consideration that is more or
less orthogonal to I/O performance.

2. Caching is a price/performance tradeoff that has a point of
diminishing returns. At some point, adding more cache costs more than
it provides. That suggests that there is a "right" amount of cache in
a system, beyond which it is a waste of resources to add more.
Finding the "right" point is a very complex and difficult task, which
is why there are system programmers who specialize in performance
management.

3. Caching is a system problem, not an application problem.
Applications should not do elaborate caching (beyond basic
buffering). Mainframe applications should minimize their impact to
system resources by using as little main storage as possible,
generally for their state data which have a tight locality of
reference, and keep the data they operate on in data sets. System
programmers or other users can decide whether those data sets should
reside on tape, DASD, hiperspace, main storage (VIO), etc.

I beleive that this discussion began with somebody saying that a good
way to improve I/O performance is to have a lot of cache RAM so as to
avoid doing any I/O at all. My response was meant to be something to
the effect of this: That's not really a good way to improve I/O
performance. It is simply a way to use more expensive hardware that
is faster instead of using cheaper hardware that is slower.

In other words, given a choice between a PC system with 256 MB of RAM
and 15 4 GB SCSI drives, and a PC system with 8 GB RAM and a single
60 GB IDE drive, I think the former would be capable of much greater
mainframe workloads with Hercules, considering that most commercial
workloads have a comparatively low reference locality and tend to be
I/O bound. I think the 15-fold increase in I/O parallelism buys more
scalability than the 16-fold increase in RAM.

When I say "greater workloads", I am talking about in a scenario
where the machine is doing many things at once.

To illustrate:

Imagine a job that processes 1 GB of data stored in a dataset, and
stores the 1 GB result in another dataset. Both datasets are
permanent. The job processes the records sequentially.

No matter how much caching the system can do, it is still necessary
to read 1 GB of data from DASD and ultimately to write 1 GB of data
back to DASD. The latter might happen in a "lazy writeback" system,
but it must still be done in order to ensure the data's consistency
if the system should suddenly crash or lose power. If you only ever
ran this one job on the system, the second time you ran it you would
avoid the need to read the data, but not the need to write the data.

Run the job cold on both of those systems. It may complete sooner on
the one with lots of memory, but it's not really complete because the
system still must write back all of the cached data to disk.
Ultimately, the systems perform somewhere pretty close to equally
with that job.

Now imagine you have 15 jobs like that. Each of them reads a
*different* 1 GB of data, and each of them produces a separate
dataset with 1 GB of data. Running in isolation, no job runs at
greater than 5% CPU utilization.

Run all 15 jobs at once on the system with 1 disk drive. The system
must allocate 30 datasets on the same drive (maybe different volumes,
but the same physical drive). Running 15 jobs at once has reduced the
amount of memory available for caching as well. The system must still
read in 15 GB of data, but it must all come from the one drive. The
single drive bottleneck means the CPU cannot be utilized to its full
potential (which should be 75% in this case). Caching cannot improve
this, because all of that data must be brought in from DASD before it
is in the cache, and we are only going to read it once. Once all of
the jobs finish generating their result sets, 15 GB of data must be
written back to the single drive. Caching can't improve that either,
it can only delay it. Even though the system has 16 times as much
memory available, it will still take at least 15 times as long to
complete all 15 jobs as it would have to complete one due to the fact
that they are all sharing a single drive. Matters are likely made
worse by the fact that the drive actuator is thrashing more.

Run them on the system with 15 drives, with a different drive per
job. Each drive has two datasets allocated to it. On this system, CPU
utilization goes right up to 75%, and each job utilizes a single
drive to the same extent that it would have if it were the only job
running on the system. All 15 jobs complete in the same amount of
time that it would have taken to complete only one of them.

But what if you preloaded all of the data and locked it in memory
(e.g. in hiperspaces or somesuch)? What if you used VIO for the
output data sets instead of DASD. Well, then you're just using more
expensive storage to do the job. Yes, any single job will run faster
if you throw more money at it. But if your goal is to put together a
multi-purpose system with the idea that it should be able to do as
much work as possible for your money (i.e. "bang for the buck"), a
high-performance DASD subsystem is a whole lot more cost effective
than a bunch of RAM. Also consider that a regular PC can't really
address 8 GB of RAM. It runs into a limitation at 4 GB that requires
special hardware to surpasss, costing even more.

I tend to view it like this: If I have a single job that is I/O
bound, and it completes in an acceptable amount of time using DASD
I/O, then I can run some large number of those jobs in parallel on
the same system as long as I have enough drives. Each will still
complete in about the same time it would have if nothing else were
happening on the system. As long as that number is acceptable, the
system is scalable in a way that is much more deterministic (i.e.
guaranteed) than trying to throw a lot of RAM at the problem. It's
not a question of trying to get one job to run as quickly as
possible. It's a question of trying to get the most possible work out
of the system.

> I'm a bottom-up type of programmer. If my choice is coding read()/
> write() or performing a search on some in-storage array that might
> already have my data, then I'll burn the cpu to search the array as
long
> as the ratio of disk access time vs cpu time is great enough.

So would I. But the choice of whether to have all of the data in an
array in the first place is a higher level design decision. When
processing data of some arbitrary size, do I dynamically allocate a
big buffer, pull it in from disk, and then do a bunch of work on it
in memory, or do I seek around the data on disk and do the work on
small chunks of it brought into fixed sized buffers? The former
trades machine resources to get speed. It will execute faster, and it
will take more memory. If the data set is truly arbitrarily sized,
then that makes it much worse because the memory usage of the program
is open-ended, meaning its worst case usage cannot be predicted at
design time. Neither choice is always right, but it should be
considered seriously at design time with an eye to the tradeoffs
involved. If the latter approach allows the program to complete in an
acceptable period of time, then it is probably a much better approach
since it makes more efficient use of machine resources.

> I bought
> 1G memory for my 3 yr old dual piii 850mhz machine a while ago for
> 130usd. I don't consider that *that* expensive.

Memory is expensive in many ways. Its cost per byte is still much
greater than disk space. Then there is the fact that the system can
only address a small, finite quantity of it (4 GB). If you want to
use the machine to process more than 4 GB of data at once, some of
that data will have to be in some other storage medium. At that
point, it is a good idea to keep the more important things in memory
and the less important things on disk. If you already have that
discipline in your application programming, then you already have a
system that scales up much bigger. Then there is the locality of
reference issue. Accessing a larger amount of memory at once results
in more cache misses, which dramatically slows the processor's
instruction rate. Cache misses are synchronous hits to CPU execution
(meaning they have to be considered a cost in terms of CPU cycles),
while I/O is always asynchronous. There is the allocator overhead.
Since RAM costs more per byte, it is desirable to use sophisticated
schemes to reduce or eliminate slack space. Those allocation
algorithms tend to have a much greater cost in CPU cycles than DASD
storage management, since it is acceptable to waste more of the
latter in order to reduce CPU usage.

There is also the important point that using a lot of memory does not
change the fact that the data must end up on disk anyway in order to
be in a permanent form, so there is some I/O involved even if you
wanted to have it all in memory all the time.

In general, system designs have evolved along a line of counting main
storage as a relatively small, finite, temporary, relatively
expensive storage medium, with a hierarchy of cheaper and more
permanent, but slower, storage mediums beyond it.

I think in many places there has been a trend toward programs (and
even system designs) that consider memory to be cheap and disdain I/O
as being expensive, and I think this trend has had a negative impact
on the overall efficiency, cost, and scalability of our systems.

>
> You are right in the sense that cache shouldn't be blindly applied
to
> solve a problem. But, it seems, you are making judgement calls
against
> code that you admittedly haven't even looked at.

Fair enough. But I didn't think we were talking about a specific
piece of code. This discussion began when I observed that I/O system
performance is important to the performance of an emulated mainframe,
and somebody suggested that perhaps having a lot of RAM would be a
better use of your money (when putting together a Hercules system)
than SCSI drives, etc. I only meant to say I disagree with that
statement.

> I don't blindly make
> coding decisions. I take measurements, I trace the code, I examine
the
> assembler. In some complicated tasks, like garbage collection, my
> intuition as to what should work best is shown wrong.

I don't know you very well, but just from talking with you I would
tend to assume you are careful and astute. I never meant to suggest
otherwise.

I've seen a lot of people put long hours into profiling and tweaking
something so that its execution time in a vacuum is as short as
possible. I think that's the wrong thing to be profiling and
optimizing in the first place. There is a tradeoff between turnaround
time and resource usage that should be worked until the code in
question uses the least possible system resources it can use for an
acceptable turnaround time in real-world usage. That's a lot more
complicated of a problem than getting it to go as fast as possible in
isolation, but I suggest it is "the stuff" of performance management.

>
> If you are serious that caching may be misapplied in hercules code
> then please cite some examples.
>

I never meant to suggest that. I was trying to say that if I were to
advise where to put your money into an emulated mainframe system to
get good performance, I'd spend more on the I/O and disk drives than
on the memory. That's for any emulated mainframe, whether Hercules or
FLEX-ES. I haven't looked at the Hercules code, but it seems to work
quite well for the limited amount of stuff I've done with it so far.

> Remember, hercules can run on, eg, linux-390. I can define my
emulated
> disks to be on a raid0 filesystem that spans multiple volumes across
> multiple controllers and chpids. Or everything can be on a `lousy'
ide
> controller on my pc, which gets, btw, about 20MB/s.

True. Even better, you can define them to be on individual disks. I
am going with IDE RAID for my Hercules box, but I think you'd get
better performance going SCSI with a bunch of smaller drives (say 4-8
GB), and splitting your DASD between them. RAID is a case of taking a
bunch of slow, parallel things and converting them to a single fast,
serial thing. I think they are more advantageous as slow, parallel
things. For example, every drive in the array must seek on every
access in a RAID system. If you split DASD between the drives, a
single program can process data sequentially on a single drive
without a seek between each read or write, and without affecting the
performance of other programs at all. Also, disk units nowadays have
caches and read-ahead logic that works much bettern when each disk is
dedicated to a small number of tasks.

One of my favorite analogies is the laundry. If you were designing a
public laundry facility that could handle 6 customers per hour, would
it be better to have one washer that completes a load in 10 minutes,
or 6 washers, each of which can complete a load in an hour, assuming
the cost is the same either way?

--Dan

10:06 pm


Re: Performance observations please

Jay Maynard

Nov 23, 2002

On Fri, Nov 22, 2002 at 10:06:39PM -0000, Dan wrote:
> In other words, given a choice between a PC system with 256 MB of RAM
> and 15 4 GB SCSI drives, and a PC system with 8 GB RAM and a single
> 60 GB IDE drive, I think the former would be capable of much greater
> mainframe workloads with Hercules, considering that most commercial
> workloads have a comparatively low reference locality and tend to be
> I/O bound. I think the 15-fold increase in I/O parallelism buys more
> scalability than the 16-fold increase in RAM.

This would be my opinion as well, though I haven't benchmarked it formally.
I will note that my dual PIII-550 with 4 18 GB SCSI drives and hardware RAID
flies compared to my other systems.

RAID also allows greater parallelization even when only one task is running.

> I think in many places there has been a trend toward programs (and
> even system designs) that consider memory to be cheap and disdain I/O
> as being expensive, and I think this trend has had a negative impact
> on the overall efficiency, cost, and scalability of our systems.

Yes. That way lies Windows.

> Fair enough. But I didn't think we were talking about a specific
> piece of code. This discussion began when I observed that I/O system
> performance is important to the performance of an emulated mainframe,
> and somebody suggested that perhaps having a lot of RAM would be a
> better use of your money (when putting together a Hercules system)
> than SCSI drives, etc. I only meant to say I disagree with that
> statement.

I will disagree with it as well. For my money, you only need enough RAM to
provide real storage for every byte defined to the Hercules system, both
main and expanded storage, plus enough additional to cover the needs of the
host OS and Hercules itself. A 370-mode user will notice no difference in
performance between 256 MB and 1 GB of RAM in the host system, if Hercules
is all he's running. (It would give him the ability to run more than just
Hercules while seeing no performance degradation.)

That said, memory is cheap enough these days that getting what seems a truly
huge amount is worth doing just to improve everything else, and to provide a
cushion against further growth in resource usage. You can never have too
much real storage in a virtual memory OS.

(Aside: One thing the FSF has in common with Microsoft is an explicit
disdain for making programs efficient; they both hold that machines are
getting bigger and faster quickly enough that they don't need to waste
effort on making programs run well on smaller, slower systems.)

> True. Even better, you can define them to be on individual disks. I
> am going with IDE RAID for my Hercules box, but I think you'd get
> better performance going SCSI with a bunch of smaller drives (say 4-8
> GB), and splitting your DASD between them. RAID is a case of taking a
> bunch of slow, parallel things and converting them to a single fast,
> serial thing. I think they are more advantageous as slow, parallel
> things. For example, every drive in the array must seek on every
> access in a RAID system. If you split DASD between the drives, a
> single program can process data sequentially on a single drive
> without a seek between each read or write, and without affecting the
> performance of other programs at all. Also, disk units nowadays have
> caches and read-ahead logic that works much bettern when each disk is
> dedicated to a small number of tasks.

This is quite true in the specific case of Hercules, or other applications
where you can deterministically split data access across different devices.
Hercules DASD emulation is one such case, of course. (At that point, you
wind up doing the same kind of performance tuning that you do on mainframe
systems with real (non-emulated, such as 3390 as opposed to a Shark) DASD.)
However, when you cannot determine exactly what I/Os will be directed to
what device, or where, you lose the ability to do this kind of tuning.

Not all RAID setups require every drive to seek to perform an I/O. If the
I/O is of a size smaller than the RAID stripe size on a RAID 0 (or RAID 0+1)
array, for example, only the specific drive where the data resides needs to
seek. On my 4-drive RAID 0 array, that means that more I/Os can be
overlapped, as another I/O can be issued while one drive is seeking. (Sound
familiar?) RAID 5 generally does require all drives to seek, however; this
is part of the performance tradeoff when selecting how to set up one's RAID
array.

The overhead of RAID is also affected by what's doing the array management.
Hardware RAID offloads all of that to the controller, which can do the
overhead tasks in the background, thus freeing up the host CPU even more
than SCSI does. (I have no experience with hardware IDE RAID, only SCSI, but
I would expect that the IDE RAID controller would provide the same
decoupling of host CPU activity from disk I/O that SCSI does intrinsically.)
The controller can also do predictive reading and cacheing transparently to
the host, thus providing the benefits with none (or little) of the cost.

2:40 pm


Copyright 2002