Newsgroups: comp.arch.storage
Path: sparky!uunet!van-bc!ubc-cs!newsserver.sfu.ca!sfu.ca!vanepp
From: van...@fraser.sfu.ca (Peter Van Epp)
Subject: Mainframe as high transaction rate database engine (longish)
Message-ID: <vanepp.704949857@sfu.ca>
Keywords: TPF Mainframes
Sender: ne...@sfu.ca
Organization: Simon Fraser University, Burnaby, B.C., Canada
Date: Mon, 4 May 1992 03:24:17 GMT
Lines: 143

Since a couple of people have asked to hear more about how the program that
airlines and banks use for reservations (in the case of airlines) and 
check clearing (in the case of banks),  I'll do a fast (and possibly not
dead accurate, I've haven't done it in 4 years -:)) description of the 
database/disk farm (as I said, I wouldn't call it a file system -:)) and
if you don't want to hear anymore I'll shut up -:).
	The following description applies to TPF1, I expect the general
architecture (and almost certainly the database layout) is still the same
in whatever the current version is (TPF 2.4 is the last one I know of).
	The reason that I say there is no file system is that everything
has to fit into a disk block, one of three (or 4) sizes fits all, 128 bytes
381 bytes, 1055 bytes and in later systems 4096 bytes. Other than the 
basic kernel (some 128k bytes as I recall) all programs and data exist in 
one of these blocks. In order to allow you to change the makeup of the 
underlying disk farm, there is actually a translation mechanism that accepts
a "block number" (a 4 byte hex value) and the system translates that (one
to one) to a disk and track cylindar sector. All blocks have a chain word
so when you run out of space in the current block, you can request a new one 
from one of two pools of free blocks, chain it to the current block and
continue doing whatever you were doing (in the case of a program, when
you run out of a 1055 byte block, you have to call a new program to get
more space (and split the source code so it maps between the two blocks
by hand!).
	As I said above, there are two pools of free blocks, short term
and long term, each pool contains all types of blocks (maybe not 128's
I'm not sure anymore -:)). The difference between the two is how the 
block directories work, and the amount of time you will need the block for
short term is only good for three to four hours and long term is yours for
as long as you don't release it. The short term pool area is intended for
use on a single transaction (or booking) basis, use it for up to an hour
or so and then release it. There are a series of bitmaps for short term
pool and a system call that will return you the address of a block of the 
requested size attached to your task control block and backed by a block
sized piece of main storage. When a directory is full, the system steps into
the next directory (setting all of the blocks in the directory to "not in 
use" whether they are or not!), the directories are set such that this takes
between 4 and 12 hours depending on how busy the system is. If an application
is still using a pool block that has been deallocated by teh directory rolling
over, then a system error occurs, the contents of main storage that are 
considered "interesting or useful" are dumped to a channel connected 
mag tape (that you want to go fast, believe me -:)) while the system "freezes"
for the time needed for the dump to get a consistant snapshot. Obviously you
want to dump the minimum amount of memory you can to the fastest tape device
you can -:)), once the dump finishes, an error message is sent to the user
(typically "system error - redo transaction", this is called use initiated
errpr correction -:)).
	This is all very fine for short term data, but since most airline
customers wish to book their flight somewhat more than 4 hours before flight
time we need something with little longer lifetime -:) This comes in the 
form of "fixed file" and long term pool, There is an area of the disk that
is like a preallocated file system, where you can request a block by a 
sort of file name (a hex number again -:)), this block serves as the 
anchor for a chain of long term pool blocks where the data about a flight,
passengers, and whatever other data you have is stored. One item to note
(to do with performance) is that successive records (of for instance 
passengers booked on a flight) are spread across the disks in the disk
farm (and there are usually lots of disk channels -:)). This is so that
when flight departure is near and lots of agents are making queries against
the same flight records, the single threadedness of tings is minimized
(ie. an agent changing Mr Aaaa's reservation doesn't block access to another
agent changing Mr Gggg's reservation, although it would briefly block Mr Abaa's
reservation from being changed). Another part of the performance secret is that
a single transaction should not execute for more than a few 10's to 100's
of milliseconds, so a booking is broken up into a long series of "transactions"
each being very short, and all of them modifying a block in short term pool
until all the data is in and the transation is "ended" and the data is written
out to the correct place in long term pool (again in a short transaction).
	Like short term pool, there are a series of directories for long
term pool blocks (bitmaps), unlike short term pool, these directories are
never recycled, and when a long term pool block is released, the directory
entry is not set to "free", it is left at "in use", all that happens is that
a log record is written to the log tape saying "block xyz has been released".
Understandably, over time more and more of the blcoks in long term storage
become "in use" and if left long enough, you would run out and everything 
would grind to a halt (very Unix like, well I set the return code on this
write that just failed, but you ignored it, so that must be what you wanted
to do! -:)), and it does to the great consternation of all. The proper
course of action (before or after running out of pool -;)) is to run something
called "recoup", this is effectivly a garbage collection pass, which starts
at the fixed records (one at a time) and chases all the chains of blocks that
are connected to the fixed records (marking them as "in use" as it goes).
Anything that isn't found connected to a block is considered released and 
will be reused. Offline (on another system, typically MVS), the release 
records written to the log tape are compared to the directories generated
from the online garbage collection and differences are printed out in a 
report that the systems programmer then sweats over to attemt to make sure
that things have worked correctly, and the directories generated really 
reflect the state of the online system. Once all looks reasonable, then
comes the exciting part, recoup rollin (or toss the bones and bet your
job -:)), at the point that the online garbage collection finished, the
system keeps track of which fixed directories it has modified at the 
point of rollin, the changes are added to the directories generated offline
and the live directories are overwritten, if all is ok, releasing for use
all of the long term pool blocks released since the last recoup and 
everybody is happy. The reason for the "toss the bones and bet your job"
comment above is the downside of this procedure, if something goes wrong
somewhere in recoup (ie. there was a system crash at a bad time and a 
bunch of directories that shouldn't have got released), when you roll in 
the new directories, instead of freeing the unused blocks like it should
it releases all the live database data (can you say "excuse me but does
this 23rd story window open" in 18 languages? Most people running recoup
can -:)). After the white faces, the denial, and the screaming, one gets
out the backup tapes and starts a restore, (the entire data base is backed
up every other night, since even several day old data represents a huge
loss in bookings), this is an exciting (and time consuming) procedure, one
that I have luckily never participated in, I heard a report of an american
carrier that lost their reservation system for some 14 hours to a mistake
that sounded from afar to be a bad recoup being rolled in. This is a disaster
since our up time target was something like 99.5% (ie. 2 or 3 hours per
month, sheduled and unsheduled), system downtime was unofficially estimated
to be around $3000 per minute on a Monday morning ... Exciting enough to 
justify having a spare $4,000,000 mainframe in case the live one broke
(it was of course used to run other systems on, but the size was always
selected to be large enough to run the res system on).
	Given the consequence of data loss, and downtime the backup schemes
are also interesting. As a normal event, the input transaction is written 
to a log tape, as well the old data from any record that the transaction 
changes also gets written out to the log tape (given that we were doing 
80 transactions per second, you again want fast tape drives!). This means
that during a restore, you can roll forward to any particular transaction
since the last backup if some thing goes wrong (like a software error 
destroying the database after some particular time for instance).
	The backup itself is a block by block (track by track actually)
copy of all the data on the disks while the system still runs. In order
for the backup to be consistant, anything written to a disk that has  
already been backed up (other than short term pool records as I recall)
are also written to the log tape (which for your sake better be on a
different channel and control unit than the dump drives, which are 
themselves perferably each on their own channel and control unit).
	A restore involved dumping all of the backup tapes back onto the
disks, then running the dump log tapes in sequence until the data base
is at a consistant state, then running in the transaction log tapes one at 
time until the database has been rolled forward to the desired point, then
you get to run a recoup and roll it in (hopefully right this time if thats
what you screwed up last time!) to get back a checked an consistant system.
This is something that you shouldn't try at home (nor at work if there is
any way to avoid it -:)).
	Thats the basics, feel free to ask questions (and any of the IBM
guys out there that know TPF, correct my mistakes -:)), or say thats all
very nice but I'm not interested so shut up -:)

Peter Van Epp / van...@sfu.ca

Newsgroups: comp.arch.storage
Path: sparky!uunet!usc!rpi!batcomputer!cornell!uw-beaver!ubc-cs!
newsserver.sfu.ca!sfu.ca!vanepp
From: van...@fraser.sfu.ca (Peter Van Epp)
Subject: Re: Mainframe as high transaction rate database engine (longish)
Message-ID: <vanepp.705127031@sfu.ca>
Keywords: TPF Mainframes
Sender: ne...@sfu.ca
Organization: Simon Fraser University, Burnaby, B.C., Canada
References: <vanepp.704949857@sfu.ca>
Date: Wed, 6 May 1992 04:37:11 GMT
Lines: 64

A couple of points of were raised via e-mail that may be of more general 
interest, one is what is the configuration that this thing runs under?

Here are the general specs (again from memory after 4 or 5 years)

CPU (main and backup) Amdahl 5860 (11 MIPS?) IBM3081KX (17 mips?)
 	(TPF could only use one of the 2 processors on the 3081)
16 channels in use in both cases (32 and 64 on the machine as I recall for
	VM and/or MVS)
   12 channels to 12 disk controllers to 160 Amdahl 6880 disks running in 
			3350 mode, (~500 megs/disk around 300 megs in use
			due to seek time limitations).
   2 channels to two controllers for tapes
   2 byte mux channels supporting ~ 140 sync lines and some 5000 terminals
		across five continents.

32 Megs of main memory of which TPF only used the bottom 16 megs
	further subdivided into the bottom ~2 megs being used for
	the various core blocks and the kernel and pool directory
	structures and the other 14 megs as a write through cache
	into the disk farm to further reduce the real I/O rate. 

This configuration supported 80 transactions per second with a response
time guarantee of 90% of the transactions completing in under 3 seconds 
(including round trip comms times on 2400 to 9600 baud lease lines).

As a size comparison (this is a small tpf system!) the large American
carriers were reported to have transaction rates in the 2000 transaction
per second range on the largest CPUs IBM made (3090-600j at that point).

Several incidental points of interest, the later IBM (and Amdahl) cpus
were being tuned to running MVS and VM, a TPF system typically achieved
30% few MIPS than the system was rated for on MVS or VM (and you could
buy a special  machine called a 9090-xxxx that has unspecified changes
to get those MIPs back under TPF). The other one is that Amdahl != IBM,
of you are running TPF, we tripped over several differences that didn't
bother MVS or VM and lest this be taken as critisism of Amdahl, they
were quickly overcome in all cases with Amdahl's assistance and of course
having the source to the operating system and not being afraid to change
it.

	The point was made by someone else that he was sure the airlines 
would like to scrap it and get something more modern. While probably
true, there are a few barriers, performance, you need enough power to
maintain the response rate (a suggested change that would double the
response time prompted one of the bosses to calculate that we would need
a 2/3 increase in the number of reservations agents for the same booking
load in the main reservations area, note that this ignored the effect
on checking passengers in to planes at airports ....). There is a huge
investment in applications software (I added up many of the object files
on the system one day and as I recall it was in the 2 to 3 megabyte range)
heavily modified for the work habits of the airline involved plus a huge
base of trained people at remote sites (some 8,000 to 10,000 employees
many of whom were trained to use some portion of the res system). Remember
that all these applications are in 370 assembler contrived to fit within
1055 byte blocks and dealing with data in 1055 byte chunks, moving it
to C, for instance would be a huge job even before considering the performance
issues  this however is getting fairly far away from storage ... 
	I will close this by pointing out that there is much the same (if
on a slightly more reasonable scale) problem with many other mainframe 
shops. I expect to be long dead before that last MVS shop shuts down. 

Peter Van Epp / van...@sfu.ca

Newsgroups: comp.arch.storage
Path: sparky!uunet!haven.umd.edu!darwin.sura.net!uvaarpa!murdoch!
fermi.clas.Virginia.EDU!gl8f
From: gl...@fermi.clas.Virginia.EDU (Greg Lindahl)
Subject: Re: Mainframe as high transaction rate database engine (longish)
Message-ID: <1992May7.165119.15764@murdoch.acc.Virginia.EDU>
Keywords: TPF Mainframes
Sender: use...@murdoch.acc.Virginia.EDU
Organization: Department of Astronomy, University of Virginia
References: <vanepp.704949857@sfu.ca> <vanepp.705127031@sfu.ca>
Date: Thu, 7 May 1992 16:51:19 GMT
Lines: 51

In this article, I will probably show a near-complete lack of facts.
Accept this as hand-waving.

In article <vanepp.7...@sfu.ca> van...@fraser.sfu.ca (Peter Van Epp) writes:

>Here are the general specs (again from memory after 4 or 5 years)
>
>CPU (main and backup) Amdahl 5860 (11 MIPS?) IBM3081KX (17 mips?)
> 	(TPF could only use one of the 2 processors on the 3081)
>16 channels in use in both cases (32 and 64 on the machine as I recall for
>	VM and/or MVS)
>   12 channels to 12 disk controllers to 160 Amdahl 6880 disks running in 
>			3350 mode, (~500 megs/disk around 300 megs in use
>			due to seek time limitations).
[...]
>This configuration supported 80 transactions per second with a response
>time guarantee of 90% of the transactions completing in under 3 seconds 
>(including round trip comms times on 2400 to 9600 baud lease lines).

That's a total of 48 gigabytes of disk storage. My guess is that this
whole setup is limited by the seek time of the disks; it's easy to buy
a CPU that's 3 times faster, but you can't get disks that have 1/3 the
seek time.

The total transfer rate through the channels seems to be fairly small.
80 transactions per second -- how much I/O per transaction? Well,
let's assume we can buy much more RAM than 16 megabytes these days. If
so, we might be able to get disk I/O down to 10kbyte per transaction,
or under a megabyte per second. An older mainframe channel runs at
around 3 megabytes per second. So they're using only 1/36 of their
possible transfer rate. Again, seek times on their disks is killing
them.

If you were building such a system from scratch today, you might be
tempted to buy 48 gigabytes of ECC ramdisk (!), since that should cost
you around $50/megabyte, or a mere $2.4 million. Back it up with
$100,000 of cheap SCSI disks, and log transactions to cheap SCSI
disks. Relatively inexpensive RISC CPU's could provide sufficient CPU
horsepower, and since the total transfer rate is low, the poor design
of most RISC memory subsystems for this kind of transfer rate would
not cause a problem. The system ought to have a more consistant
response since there is no seeking. It should also scale up fairly
easily.

Of course, this is mostly a pipe-dream, since I don't have very many
details about the original system, and since there are probably
aspects about the setup I proposed that would cause problems. However,
it is interesting to note that the rapidly-falling price of dram can
provide a revolution in transaction processing -- what happens when
dram is $5/megabyte? Sure, disks will be even cheaper, but seek times will
still be large.

Newsgroups: comp.arch.storage
Path: sparky!uunet!van-bc!ubc-cs!newsserver.sfu.ca!sfu.ca!vanepp
From: van...@fraser.sfu.ca (Peter Van Epp)
Subject: Re: Mainframe as high transaction rate database engine (longish)
Message-ID: <vanepp.705456278@sfu.ca>
Keywords: TPF Mainframes
Sender: ne...@sfu.ca
Organization: Simon Fraser University, Burnaby, B.C., Canada
References: <vanepp.704949857@sfu.ca> <vanepp.705127031@sfu.ca> 
<1992May7.165119.15764@murdoch.acc.Virginia.EDU>
Date: Sun, 10 May 1992 00:04:38 GMT
Lines: 125

gl...@fermi.clas.Virginia.EDU (Greg Lindahl) writes:

>In this article, I will probably show a near-complete lack of facts.
>Accept this as hand-waving.

>In article <vanepp.7...@sfu.ca> van...@fraser.sfu.ca (Peter Van Epp) writes:

>[...]

>That's a total of 48 gigabytes of disk storage. My guess is that this
>whole setup is limited by the seek time of the disks; it's easy to buy
>a CPU that's 3 times faster, but you can't get disks that have 1/3 the
>seek time.

I should point out that this is only 24 gigs of unique storage, since each
disk is mirrored to another disk down a separate channel and disk controller
(no single point of failure), and a write from the cache goes to both disks,
a read to cache comes from which ever disk has the least queue to increase
performance slightly.
	You are also correct that this is seek limited, that is why only 
2/3 of the disk is usable. As I recall 3380s are worse for this than 3350s
as the 3380s tended to have many shortish tracks compared to a 3350 and 
you could only use about 20% of the disk. While you can't get shorter seek
time disks, you can spread the data over more quite quick (even by todays
standards) disks. 3350s have a published seek time up in the 16 msec region
and for limited distances would do in the 10 to 12 range (remembering this is
a disk that has been obsoleted by the 3380 for 10 years or so!), as I recall
the 6880's were down in the 10 to 12 msec range published and over short 
distances could get down to the 6 to 8 msec range (these being 3380A models
which are now also obsolete)

>The total transfer rate through the channels seems to be fairly small.
>80 transactions per second -- how much I/O per transaction? Well,

As I recall 7 I/Os per transaction but I don't remember if this was physical
to the disk, (it likely was) or I/O to the cache. 

>let's assume we can buy much more RAM than 16 megabytes these days. If
>so, we might be able to get disk I/O down to 10kbyte per transaction,

You could then to, as I said there was more ram than we could use on each
machine, the problem at that point was (and likely still is) the architecture.
Since ACP (the predecessor of TPF) started out on 360 machines, and all the
application code is written in assembler, neither XA mode (more than 24 bits
of address) nor DAT (virtual memory) was usable by TPF. At that point, the 
latest version I know of had go XA and current ones are probably ESA, but
application code had to be copied below the 16 meg line to run (unless it 
was a new or rewritten application, taking a performace hit).
	The other point of interest here, is as I pointed out, this is a 
small to medium sized airline, the largest carriers had an order of magnitude
large transaction rate (80 transactions /sec -> ~2000 transactions a second) 
running the same applications and presumably the same number of I/Os per
transaction. The TPF machine was typically the smallest of the machines in 
house, the MVS/IMS machine was typically 4 times as large/fast for around
(I think!) 10 transactions a second on IMS ( a relational database running
under MVS).


>or under a megabyte per second. An older mainframe channel runs at
>around 3 megabytes per second. So they're using only 1/36 of their
>possible transfer rate. Again, seek times on their disks is killing
>them.

Correct as far as the disk I/O channel goes. I think the actual transfer 
rate is closer to 600k /sec, reponse time due (to I/O queuing) starts to
die when the channel is around 30% busy (ie. your 1 Mbyte/ sec) and we
were running between 15% and 20% busy. However, this 15 to 20% is per
channel, and the I/O is very evenly spread across the channels so (unlike
an MVS or VM system) all 12 of those channels were busy at the same time
(during a peak period, ie. day time) of course. Now our 600k I/O rate
on a single channel has climbed to a total into main memory I/O rate
of 7 megs/sec (and this excludes the I/O on the byte mux channels and 
Tape channels that are adding another couple of megs per second to the
I/O rate, still not a killer, but up there. 
	Now consider this, assume a server with a 17 meg/second VME bus
for I/O, and a fast internal CPU and a typical Dram SIMM main memory,
I would suggest that you are going to start seeing instruction starvation
on the CPU as its cache fights for main memory bandwidth with the I/O 
subsystem. I will point out that this is how the situation seems to me,
and that none of the Unix vendors we talked to could give me a convincing
reason as to why I might be wrong (this being in the context of an NFS
file server), with the exception of Auspex (which is as I have said before,
is why we are a happy Auspex customer). At this point, having experience with
Auspex, Unix I/O and backup issues, and NFS, given the decsion to make again
I'd recommend the Auspex even more strongly for reasons other than the 
apparant I/O performance (although that is still a good reason in itself
in my mind!). 
	I however am no more sure now, than when I talked to the various unix 
vendors, that my insistance that there could be an I/O problem using a standard
Unix server as an NFS server, where the single CPU and I/O bus are fielding all
the ethernet, and disk interrupts and request processing all over a 
(for instance) 17 meg/ second VME bus was going to fly. We could obviously do 
it on a set of such severs, but that isn't what was being quoted (presumably 
for cost reasons). This being the case, I would welcome some more discussion
along the lines "you ignorant mainframe bigot, of course it would work and 
here's why" (I got a lot of the first, but nobody ever came up with what I 
thought were reasonable numbers and a working site, for the second!) or 
responses of the form "hey he may just be right, what can we do about this?
(one assumes that telling your customer that your NFS server won't do the
job, that they better buy an Auspex wouldn't go over so well ...) As I say,
our decsion is made, and I for one am happy with it, but there are other
mainframe shops that are considering a move (sometimes, like us, not
voluntarily!) to Unix who would like to know that answer. We thought that 
we were going down a well trodden path from a mainframe to Unix, but either
we didn't ask the right people (and the vendors wouldn't admit to knowing the 
right people or we were talking to the wrong vendors which is very possible!),
or we are doing something quite strange, because answers to these types of
questions seem hard to come by.

>If you were building such a system from scratch today, you might be
>tempted to buy 48 gigabytes of ECC ramdisk (!), since that should cost
>you around $50/megabyte, or a mere $2.4 million. Back it up with
>$100,000 of cheap SCSI disks, and log transactions to cheap SCSI

The killer here is the "from scratch today", and I agree, something along
these lines would work except for the installed base of non portable 
applications, and the cost and risk of making such a change (since you 
really are "betting the business" on the change). The real killer (even
for IBM) is that this is a small market, I would guess that you are not
talking more than around 200 sites total, and the return is just not 
worth it when there is a solution (even though an expensive on) in 
place. 

Peter Van Epp / van...@sfu.ca