Vax 11/780 performance vs Sun 4/280 performance

Path: utzoo!attcan!uunet!husc6!uwvax!oddjob!gargoyle!att!alberta!
calgary!dave
From: d...@calgary.UUCP (Dave Mason)
Newsgroups: comp.unix.wizards,comp.unix.questions
Subject: Vax 11/780 performance vs Sun 4/280 performance
Keywords: vax sun
Message-ID: <1631@vaxb.calgary.UUCP>
Date: 25 May 88 22:28:05 GMT
Organization: U. of Calgary, Calgary, Ab.
Lines: 31

We are planning to replace 2 of our Vax 11/780s with  2 Sun 4/280s.
Each vax has 6 Mbytes of memory, 2 RA80 and 1 RA81, and 40 terminals.
The vaxes are currently running 4.3 BSD + NFS (from Mt Xinu).
Each sun is planned to have 32 Mbytes of memory, 2 of the new NEC
disk drives and will be running the same 40 terminals. The vaxes
are being used by undergrads doing pascal, f77 and C programming
(compile and bomb).  Most students use umacs (micro-emacs) as
their text editor.

What I was wondering is has anyone done a similiar type switchover?
Is there a horendous degradation of response when the load average gets
sufficiently high or does it degrade linearly with respect to load average?
Is overall performance of a Sun 4/280 better/worse/the same as a
similiarly loaded vax 11/780 (as configured above)?
Were there any surprises when you did the switchover?

My personal feeling is that we will win big, but the local DEC salesman
is making noises about Sun 4/280 performance, especially with > 15 users.
I just want to confirm if my opinion of the local DEC sales office is well
founded :-).

Please mail your responses. If there is sufficient interest I'll post
a summary to the net.

Thanks in advance for any comments.

				Dave Mason
				University of Calgary
				{ubc-cs,alberta,utai}!calgary!dave

Path: utzoo!utgpu!water!watmath!clyde!att!osu-cis!tut.cis.ohio-state.edu!
mailrus!ames!umd5!brl-adm!adm!weiser...@xerox.com
From: weiser...@xerox.com
Newsgroups: comp.unix.wizards
Subject: Vax 11/780 performance vs Sun 4/280 performance
Message-ID: <14968@brl-adm.ARPA>
Date: 27 May 88 17:08:54 GMT
Sender: n...@brl-adm.ARPA
Lines: 13

What your DEC salesperson may have heard, undoubtedly very indirectly, is that
there is a knee in the performance curve of the Sun-4/280 at > 15 processes
ready-to-run.  This has nothing to do with > 15 users: more like a load average
of > 15.  Do your vaxes ever run with a load average of > 15?  If not, ok.  But,
if they EVER hit 16 or 17, watch out on the Sun-4's:  I can trivially get my
Sun-4 completely wedged so I have to reboot with L1-A by just starting 19 little
processes which sleep for 100ms, wake-up and sleep again.  This doesn't even
raise the load average (but amounts to a load average of 19 to the context
switching mechanism, although not to the cpu).

And the Sun-3's are no better: the knee there is >7 processes.

-mark

Path: utzoo!utgpu!water!watmath!clyde!att!osu-cis!tut.cis.ohio-state.edu!
mailrus!ames!ncar!noao!arizona!modular!olson
From: ol...@modular.UUCP (Jon Olson)
Newsgroups: comp.unix.wizards
Subject: Re: Vax 11/780 performance vs Sun 4/280 performance
Summary: Re: Vax 11/780 performance vs Sun 3/Sun 4 performance
Message-ID: <601@modular.UUCP>
Date: 29 May 88 00:29:06 GMT
References: <14968@brl-adm.ARPA>
Organization: Modular Mining Systems, Tucson
Lines: 48

> What your DEC salesperson may have heard, undoubtedly very indirectly, is that
> there is a knee in the performance curve of the Sun-4/280 at > 15 processes
> ready-to-run.  This has nothing to do with > 15 users: more like a load average
> of > 15.  Do your vaxes ever run with a load average of > 15?  If not, ok.  But,
> if they EVER hit 16 or 17, watch out on the Sun-4's:  I can trivially get my
> Sun-4 completely wedged so I have to reboot with L1-A by just starting 19 little
> processes which sleep for 100ms, wake-up and sleep again.  This doesn't even
> raise the load average (but amounts to a load average of 19 to the context
> switching mechanism, although not to the cpu).
> 
> And the Sun-3's are no better: the knee there is >7 processes.
> 
> -mark

Nonsense, I just tried forking 32 copies of the following program
on my Sun 3/60 workstation.  Each one sleeps for 100 milliseconds,
wakes up, and sleeps again.  With 32 copies of it running, I could
notice no difference in response time and a `ps aux' showed none
of them using a significant amount of CPU time.  Maybe you are just
running out of memory and doing alot of swapping?

What I have noticed on our Vax 11/780, running VMS, is that it is
often equally slow with 1 user or 20 users.  Possibly VMS avoids the
`knee' by raising the priority of the NULL task when there aren't many
people on the machine???

---------------------------------------------------
#include <sys/time.h>

main()
  {
  struct timeval tv;

  tv.tv_sec = 0;
  tv.tv_usec = 100000;
  for( ;; )
    select( 0, 0, 0, 0, &tv );
  }

-- 
Jon Olson, Modular Mining Systems
USENET:     {ihnp4,allegra,cmcl2,hao!noao}!arizona!modular!olson
INTERNET:   modular!ol...@arizona.edu
-- 
Jon Olson, Modular Mining Systems
USENET:     {ihnp4,allegra,cmcl2,hao!noao}!arizona!modular!olson
INTERNET:   modular!ol...@arizona.edu

Path: utzoo!utgpu!water!watmath!clyde!att!osu-cis!
tut.cis.ohio-state.edu!mailrus!ames!ncar!noao!arizona!modular!
olson
From: ol...@modular.UUCP (Jon Olson)
Newsgroups: comp.unix.wizards
Subject: Re: Vax 11/780 performance vs Sun 4/280 performance
Summary: More Re: Sun 3/Sun performance
Message-ID: <602@modular.UUCP>
Date: 29 May 88 00:46:18 GMT
References: <14968@brl-adm.ARPA>
Organization: Modular Mining Systems, Tucson
Lines: 9

I also tried forking 32 `for(;;) ;' loops on a 3/60 with 8-mb.
Each process got about 3 percent of the CPU and the reponse was
still quote good for interactive work.  This stuff about a `knee'
at 7 processes just isn't real...
-- 
Jon Olson, Modular Mining Systems
USENET:     {ihnp4,allegra,cmcl2,hao!noao}!arizona!modular!olson
INTERNET:   modular!ol...@arizona.edu

Path: utzoo!attcan!uunet!husc6!mailrus!ames!umd5!brl-adm!adm!
weiser...@xerox.com
From: weiser...@xerox.com
Newsgroups: comp.unix.wizards
Subject: Re: Vax 11/780 performance vs Sun 4/280 performance
Message-ID: <15464@brl-adm.ARPA>
Date: 31 May 88 19:36:40 GMT
Sender: n...@brl-adm.ARPA
Lines: 41

--------------------
Nonsense, I just tried forking 32 copies of the following program
on my Sun 3/60 workstation.  Each one sleeps for 100 milliseconds,
wakes up, and sleeps again.  With 32 copies of it running, I could
notice no difference in response time and a `ps aux' showed none
of them using a significant amount of CPU time.  Maybe you are just
running out of memory and doing alot of swapping?

What I have noticed on our Vax 11/780, running VMS, is that it is
often equally slow with 1 user or 20 users.  Possibly VMS avoids the
`knee' by raising the priority of the NULL task when there aren't many
people on the machine???

#include <sys/time.h>

main()
  {
  struct timeval tv;

  tv.tv_sec = 0;
  tv.tv_usec = 100000;
  for( ;; )
    select( 0, 0, 0, 0, &tv );
  }

--------------------

No, not nonsense.  I changed 100000 to 25000, and ran 18 of these on my
Sun-4/260 with 120MB swap and 24MB ram, with very little else going on.
Perfmeter shows no disk activity, ps aux shows each of the 18 using almost no
cpu.  (And each of the 18 has more than millisecond to get in and out of select,
which is certainly enough).  And the system is to its knees!  (If it doesn't
work for you, try 19 or 20 or 21).  Window refreshes take 10's of seconds.  If I
kill off 3 of these, all is back to normal.

I don't have a 60C to try this on.  But, try reducing that delay factor and see
if you don't also see a knee in the performance curve well before the cpu should
be swamped.  (And in any case, swamped cpu doesn't need to imply knee in the
curve...)

-mark

Path: utzoo!attcan!uunet!husc6!bloom-beacon!bu-cs!bzs
From: b...@bu-cs.BU.EDU (Barry Shein)
Newsgroups: comp.unix.wizards
Subject: Re: Vax 11/780 performance vs Sun 4/280 performance
Message-ID: <23027@bu-cs.BU.EDU>
Date: 31 May 88 23:48:10 GMT
References: <14968@brl-adm.ARPA> <264@sdba.UUCP>
Organization: Boston U. Comp. Sci.
Lines: 64
In-reply-to: stan@sdba.UUCP's message of 31 May 88 17:32:33 GMT

Although I don't disagree with the original claim of Suns having knees
(related to NeXT being pronounced Knee-zit? never mind) the discussion
can lose sight of reality here.

A 780 cost around $400K* and supported around 20 logins, a Sun4 or
even Sun3/280 probably comes close to that in support for around 1/5
the price or less, and the CPU is much faster when a job gets it. If
your Vax was horribly overloaded and had 32 users just buy more than
one system and split the community, you'll also double the I/O paths
that way and probably have at least one system up almost all the time
(we NFS'd everything between our Suns in Math/Computer Science and
Information Technology here so they can log into any of them although
that does mean that if your home dir is on a down system you lose.)

Also the cost of things like memory is so much lower that you can
cheat like hell on getting performance. Who ever had a 32MB 780?
That's practically a minimum config for a Sun4 server.

The best use for a Sun server as a time-sharer is if a) you don't
expect rapid growth in the number of logins (eg. doubling in a year)
that will outgrow the machine and b) you expect a lot of the community
using the system to migrate from dumb terminals to workstations in the
reasonably near future, that way voila, you have the server,
especially if each new workstation means one less time-sharer and it
converges fairly rapidly. It's a nice way to give them time to get
their financial act together to buy workstations. For example, for our
CS and Math Faculty here having 3 servers worked out very well, many
of the users have now grown into workstations and the server
facilities were "just there".

Another rationale of course is that you're looking for just a little
system for perhaps a dozen or so peak load people, I don't know any
system off-hand that can do that as nicely as a system like the above
for the money.

If your needs are much more in the domain of traditional time-sharing
(eg. hordes of students that never ceases growing term to term, dumb
terminals and staying that way for the next few years [typically, if
you ever get them workstations you'll put an appropriate, separate,
server in *that* budget]) then you probably want to look at something
more expandable/upgradeable. I find Encores and (no direct experience
but I hear good things) Sequents pretty close to perfect for that kind
of usage. I'm sure there are others that will suffice but we don't use
them so I can't comment (we have 7 Encores and over 100 Suns here.)

Anyhow, seat-of-the-pants systems analysis on the net is probably a
precarious thing at best, I hope I've pointed out the issues are
several and small differences in two groups' needs can make any
recommendation inapplicable.

All I can say is we have quite a few Sun 3 servers here doing
something resembling traditional time-sharing and everyone seems very
happy with it. Given the right conditions it works out well, given the
wrong ones no doubt it would be a nightmare, so what else is new?

	-Barry Shein, Boston University

P.S. I have no vested interest in any of the above mentioned companies
although I am on the Board of Directors of the Sun Users Group, I
doubt that would be considered "vested".

* Yes I realize that it's been almost 10 years since the 780 came out,
but that was the original question.

Path: utzoo!dciem!nrcaer!scs!spl1!laidbak!att!osu-cis!
tut.cis.ohio-state.edu!mailrus!ames!oliveb!pyramid!voder!lynx!m5
From: m...@lynx.UUCP (Mike McNally)
Newsgroups: comp.unix.wizards
Subject: Re: Vax 11/780 performance vs Sun 4/280 performance
Message-ID: <3859@lynx.UUCP>
Date: 3 Jun 88 23:59:15 GMT
Article-I.D.: lynx.3859
References: <14968@brl-adm.ARPA> <601@modular.UUCP> 
<7331@swan.ulowell.edu> <2282@rpp386.UUCP>
Reply-To: m...@lynx.UUCP (Mike McNally)
Organization: Lynx Real-Time Systems Inc, Campbell CA
Lines: 29
Summary: My $.02

Re: small processes that sleep-wakeup-sleep-wakeup...

I tried this on my Integrated Solutions 68020 thing and got results
similar to those of the Sun; that is, up to about 6 or 7 of them the
system works fine, but after that everything gets real slow (I can't
test it too much because everybody gets mad here when the machine
freezes up).

I tried the same thing under LynxOS, our own BSD-compatible real-time
OS, and didn't notice very much degradation at all.  A major difference
between our machine and the Integrated Solutions is the MMU: even
though our platform is a 68010, our MMU is 16K of static RAM that holds
all the page tables all the time.  Context switch time is thus real
small.  Also, I think it's possible that the mechanism for dealing with
the timeout in select() is different internally under LynxOS as opposed
to Unix.

Of course, under the real-time OS, a high-priority CPU-bound task gets
the whole CPU, no questions asked.  That's a great way of degrading
editor response :-).

As a somewhat related side question, what does the Sun 4/SPARC MMU look
like?  Are lookaside buffer reloads done in software like on the MIPS
R[23]000?  (Is that really true about the R[23]000 anyhow?)

-- 
Mike McNally of Lynx Real-Time Systems

uucp: lynx!m5 (maybe pyramid!voder!lynx!m5 if lynx is unknown)

Path: utzoo!utgpu!water!watmath!clyde!bellcore!rutgers!ucsd!
ucbvax!decwrl!
pyramid!prls!mips!mash
From: m...@mips.COM (John Mashey)
Newsgroups: comp.unix.wizards
Subject: Re: Vax 11/780 performance vs Sun 4/280 performance
Message-ID: <2298@winchester.mips.COM>
Date: 5 Jun 88 16:41:10 GMT
References: <14968@brl-adm.ARPA> <601@modular.UUCP> 
<7331@swan.ulowell.edu> <2282@rpp386.UUCP> <3859@lynx.UUCP>
Reply-To: m...@winchester.UUCP (John Mashey)
Organization: MIPS Computer Systems, Sunnyvale, CA
Lines: 21

In article <3...@lynx.UUCP> m...@lynx.UUCP (Mike McNally) writes:
...
>As a somewhat related side question, what does the Sun 4/SPARC MMU look
>like?  Are lookaside buffer reloads done in software like on the MIPS
>R[23]000?  (Is that really true about the R[23]000 anyhow?)

The Sun-4 MMU, like earlier Suns, doesn't use a TLB, but has SRAMs
for memory maps (16 contexts' worth, compared to 8 in Sun-3/200, for
example).

The R[23]000 indeed do TLB-miss refill handling in software;
this is not unusual in RISC machines: HP Precision and AMD 29K (at least)
do this also.  The overall cost if typically 1% or less of CPU time,
which is fairly competitive with hardware refill, especially since one
of the larger costs on faster machines is the accumulated cache-miss penalty
for fetching PTEs from memory.
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  m...@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

Path: utzoo!attcan!uunet!seismo!rick
From: r...@seismo.CSS.GOV (Rick Adams)
Newsgroups: comp.unix.wizards
Subject: Re: Vax 11/780 performance vs Sun 4/280 performance
Summary: Sun 3/160 has a real knee at about 7 active processes
Message-ID: <44365@beno.seismo.CSS.GOV>
Date: 6 Jun 88 17:54:20 GMT
References: <15875@brl-adm.ARPA>
Organization: Center for Seismic Studies, Arlington, VA
Lines: 9

Last year when seismo (a Sun 3/160) was still passing mail around,
there was a VERY obvious performance degradation when the 8th or
9th sendmail became active. (No we didn't run out of memory. That
happened at about 14 sendmails)

I have always attributed it to the 7 user contexts.

---rick

Path: utzoo!attcan!uunet!husc6!mailrus!ames!elroy!cit-vax!mangler
From: mang...@cit-vax.Caltech.Edu (Don Speck)
Newsgroups: comp.unix.wizards
Subject: Re: Vax 11/780 performance vs Sun 4/280 performance
Summary: I/O throughput
Message-ID: <6926@cit-vax.Caltech.Edu>
Date: 13 Jun 88 08:58:03 GMT
References: <22957@bu-cs.BU.EDU> <14968@brl-adm.ARPA> 
<601@modular.UUCP> <7331@swan.ulowell.edu> <2282@rpp386.UUCP>
Organization: California Institute of Technology
Lines: 25

I am reminded of this article from comp.arch:

In article <44...@beno.seismo.CSS.GOV>, r...@seismo.CSS.GOV (Rick Adams) 
writes:
> Well, to start with I've got a Vax 11/780 with 7 6250 bpi 125 ips
> tape drives on it. It performs adequately when they are all running.
> I STILL haven't found anything to replace it with for a reasonable amount
> of money. Nothing in the Sun price range can handle that I/O volume.

I've seen a PDP-11/70 with eight tape drives, too.

And as Barry Shein said, "An IBM mainframe is an awesome thing...".
One weekend, noticing the 4341 spinning a pair of GCR drives at over
half their rated 275 ips, I was shocked to learn that it was reading
the disk file-by-file, not track at a time.  BSD filesystems just
can't compare to what this 2-MIPS machine could do with apparent ease.

How do they get that kind of throughput?  I refuse to believe that it's
all hardware.  Mainframe disks rotate at 3600 RPM like everybody else's
and their 3 MB/s transfer rate is only slightly higher than a SuperEagle.
A 2-MIPS CPU would be inadequate to run a BSD filesystem at those speeds,
so obviously their software overhead is a lot lower, while at the same
time wasting no disk time.  What is VM doing efficiently that Unix does
inefficiently?

Don Speck   sp...@vlsi.caltech.edu  {amdahl,ames!elroy}!cit-vax!speck

Path: utzoo!attcan!uunet!yale!husc6!bu-cs!bzs
From: b...@bu-cs.BU.EDU (Barry Shein)
Newsgroups: comp.unix.wizards
Subject: Re: Vax 11/780 performance vs Sun 4/280 performance
Message-ID: <23288@bu-cs.BU.EDU>
Date: 13 Jun 88 15:56:30 GMT
References: <22957@bu-cs.BU.EDU> <14968@brl-adm.ARPA> <601@modular.UUCP> 
<7331@swan.ulowell.edu> <2282@rpp386.UUCP> <6926@cit-vax.Caltech.Edu>
Organization: Boston U. Comp. Sci.
Lines: 78
In-reply-to: mangler@cit-vax.Caltech.Edu's message of 13 Jun 88 08:58:03 
GMT

>How do they get that kind of throughput?  I refuse to believe that it's
>all hardware.  Mainframe disks rotate at 3600 RPM like everybody else's
>and their 3 MB/s transfer rate is only slightly higher than a SuperEagle.
>A 2-MIPS CPU would be inadequate to run a BSD filesystem at those speeds,
>so obviously their software overhead is a lot lower, while at the same
>time wasting no disk time.  What is VM doing efficiently that Unix does
>inefficiently?
>
>Don Speck   sp...@vlsi.caltech.edu  {amdahl,ames!elroy}!cit-vax!speck

I think a lot of it *is* hardware. I know the big mainframes better
than the small ones. I/O devices are attached indirectly thru channel
controllers. Channels have their own paths to/from memory (that's
critical, multiple DMAs simultaneously.) Also, channels are
intelligent, I remember people saying the channels for the 370/168 had
roughly the same computing power as the 370/158 (ie. one model down,
sort of like saying that Sun3/280's use Sun3/180's as disk
controllers, actually the compute power is very similar in that
comparison.)

Channels execute channel commands directly out of memory, sort of
linked list structs in C lingo, with commands, offsets etc embedded in
them (this has become more common in the mini market also, the UDA is
similar tho I don't know if it's quite as general.) Channels can also
do things like search disks for particular keys, hi/lo/equal, without
involving the central processor. I don't know how much this is used in
the various filesystems, obviously a general data base thing.

The channels themselves aren't all that fast, around 3MB/sec, but 16
of them pumping simultaneously to/from different blocks of memory can
certainly make it feel fast.

I heard IBM recently announced a new addition to the 3381 disk series
(these are multi-GB disks) with 256MB (1/4 GB) of cache in the disk.
Rich or poor it's better to be rich.

The file systems tend to be much simpler (they avoid indirection at
the lower levels), at least in OS, which I'm sure contributes to the
performance, I/O is very asynchronous from a software perspective so
starting multiple I/Os is a natural way to program and sit back
waiting for completions. Note that RMS in VMS tries to mimic this kind
of architecture, but no one ever accused a Vax of having fast I/O.

A lot of what we would consider application code is in the OS I/O
code, known as "access methods", so reading various file formats
(zillions, actually, VSAM, ISAM, BDAM, BSAM...) and I/O disciplines
(VTAM etc) can be optimized at the "kernel" level (there's also
microcode assist on various machines for various operations), it also
tends to push applications programmers towards "being kind" to the OS,
things like pre-allocation of resources is pretty much enforced so a
lot of the dynamic resource management is just not done during
execution.

There is little doubt that to get a lot of this speedup on Unix
systems you'd have to give up niceties like tree'd directories,
extending files whenever you feel like, dynamic file opening during
run-time (OS tends to do deadlock avoidance rather than detection or
recovery so it needs to know what files you plan to use before your
jobs starts, that explains a *lot* of what JCL is all about,
pre-allocation of resources), etc. You probably wouldn't like it, it
would look just like MVS :-)

You'd also have to give up what we call "terminals" in most cases, IBM
terminals (327x's) on big systems are much more like disks,
half-duplex, fill in a screen locally and then blast entire screens
to/from memory in one block I/O operation, no per-char I/O. Emacs
would die. It helps, especially when you have a lot of terminals.  I
read about an IBM transaction system with 15,000 terminals logged in,
I said a lot of terminals.

But don't underestimate raw, frothing, manic hardware.

It's a big trade-off, large IBM mainframes are to I/O what Crays are
to floating point, but you really have to have the problem to want the
cure, for most folks it's unnecessary, MasterCard etc excepted.

	-Barry Shein, Boston University

Path: utzoo!attcan!uunet!husc6!uwvax!rutgers!bellcore!faline!thumper!
ulysses!andante!alice!dmr
From: d...@alice.UUCP
Newsgroups: comp.unix.wizards
Subject: Re: Vax 11/780 performance vs Sun 4/280 performance
Message-ID: <7980@alice.UUCP>
Date: 14 Jun 88 04:21:17 GMT
Organization: AT&T Bell Laboratories, Murray Hill NJ
Lines: 35

After decribing a lot of the grot you have to go through to get
3MB/s out of an MVS system, Barry Shein wrote,

> But don't underestimate raw, frothing, manic hardware.
> It's a big trade-off, large IBM mainframes are to I/O what Crays are
> to floating point...

Crays are better at I/O, too.  For example,
I made a 9947252-byte file by catting 4 copies of the dictionary
and read it:

3K$ time dd bs=172032 </tmp/big >/dev/null
57+1 blocks in
57+1 blocks out
	seconds
elapsed	1.251356
user	0.000639
sys	0.300725

which is a cool 8MB/s read from an ordinary Unix file in competition
with other processes on the machine.  (OK, I gave it a big buffer.)
The big guys would complain that they didn't get the full 10 or 12
MB/s that the disks give.  They would really be annoyed that I could
get only 50 MB/s when I read the file from the SSD, which runs at
1000MB/s, but to get it to go at full speed you need to resort to
non-standard Unix things.

The disk format on Unicos (Cray's version of SVr2) is an extent-based
scheme supporting the full Unix semantics except that they don't handle
files with holes (that is, the holes get filled in).  In an early
version, a naive allocation algorithm sometimes created files
ungrowable past a certain point, but I think they've worked on the
problem since then. 

				Dennis Ritchie

Path: utzoo!attcan!uunet!husc6!bu-cs!bzs
From: b...@bu-cs.BU.EDU (Barry Shein)
Newsgroups: comp.unix.wizards
Subject: Re: Vax 11/780 performance vs Sun 4/280 performance
Message-ID: <23326@bu-cs.BU.EDU>
Date: 14 Jun 88 16:39:38 GMT
References: <7980@alice.UUCP>
Organization: Boston U. Comp. Sci.
Lines: 23
In-reply-to: dmr@alice.UUCP's message of 14 Jun 88 04:21:17 GMT

Dennis Ritchie points out that his Cray observes disk I/O speeds that
compare favorably to those claimed for large IBM mainframes, thus in
contrast to my claim Crays may indeed be the "Crays" of I/O.

I think the proper question is sort/merging a disk farm and doing 1000
transactions/sec or more while keeping 8 or 12 tapes turning at or
near their rated 200 ips, not pushing bits thru a single channel (if
we're talking Crays then we're talking 3090's.)

If the Cray can keep pumping the I/O under those conditions (typical
job stream for a JC Penney's or Mastercard) then we all better short
IBM. Software or price would be no object if the Cray could do it
better (and more reliably, I guess that *is* an issue, but let's skip
that for now.)

Then again, who knows? Old beliefs die hard, far be it for me to
defend the Itsy Bitsy Machine company.

Mayhaps the Amdahl crew can provide some appropriate viciousness at
this point :-) Oh, please do!

	-Barry Shein, Boston University

Path: utzoo!attcan!uunet!lll-winken!lll-tis!helios.ee.lbl.gov!pasteur!
ucbvax!bloom-beacon!oberon!cit-vax!mangler
From: mang...@cit-vax.Caltech.Edu (Don Speck)
Newsgroups: comp.unix.wizards
Subject: Re: Vax 11/780 performance vs Sun 4/280 performance
Keywords: readahead, striping, file mapping
Message-ID: <6963@cit-vax.Caltech.Edu>
Date: 16 Jun 88 06:32:08 GMT
References: <22957@bu-cs.BU.EDU> <14968@brl-adm.ARPA> <601@modular.UUCP> 
<23288@bu-cs.BU.EDU> <7980@alice.UUCP> <23326@bu-cs.BU.EDU>
Organization: California Institute of Technology
Lines: 71

In article <23...@bu-cs.BU.EDU>, b...@bu-cs.BU.EDU (Barry Shein) writes:
> I think the proper question is sort/merging a disk farm and doing 1000
> transactions/sec or more while keeping 8 or 12 tapes turning at or
> near their rated 200 ips, not pushing bits thru a single channel

The hard part of this is getting enough disk throughput to feed even
one of those 200-ips tape drives.  The rest is replication.

Channels sound like essentially moving the disk driver into an I/O
processor, with lists of channel control blocks being analogous to
lists of struct buf's.	This makes it feasible to do more optimizations,
even real-time stuff like scatter-gather, chaining, and rotational
scheduling.

Barry mentions the UDA-50 as being similar.  But its processor is an
8085, and DMA speed is only 0.8 MB/s, making it much slower than a dumb
controller.  And the driver ends up spending as much time constructing
the channel control blocks as it would spend tending a dumb controller
like the Emulex SC7003.  The Xylogics 450, Xylogics 472, and DEC TS11
are like this too.  I find them all disappointingly slow.

I suspect the real reason for channel processors is to reduce interrupts,
which are so costly on big CPU's.  It makes sense for terminals; people
have made I/O processors that talk to Unix in clists (KMC-11's, etc)
which cuts the total interrupt rate by a large fraction.  But I don't
think it's necessary, or necessarily desirable, to inflict this on disks
& tapes, and certainly not unless the channel processor can talk in
struct buf's.

For all the optimizations that these I/O processors are supposed to do,
Unix rarely gives them the chance.  Unless there's more than two requests
outstanding at once, once they finish one, there's only one request to
choose from.  Unix has minimal readahead, so that's as many requests as
a single process can generate.	Raw I/O is even worse.

Asynchronous reads would be the obvious way to get enough requests in
the queue to optimize, but that seems unlikely to happen.  Rather,
explicit read commands are giving way to memory-mapped files (in Mach
and SunOS 4.0) where readahead becomes synonymous with prepaging.  It
remains to be seen whether much attention is put into this.

Barry credits the asynchronous nature of I/O on mainframe OS's to the
access methods, like RMS on VMS.  People avoid those when they want
speed (imagine using dbm to do sequential reads).  For instance, the
VMS "copy" command bypasses RMS when copying disk-to-disk, with the
curious result that it's faster to copy to a disk than to the null
device, because the null device is record-oriented, requiring RMS.

As DMR demonstrates, parallel-transfer disks are great for big files.
They're horrendously expensive though, and it's hard enough to find
controllers that keep up with even 3 MB/s, much less 10 MB/s.  But
they can be simulated with ordinary disks by striping across multiple
controllers, *if* the disks rotate as one.  Does anyone know of a cost-
effective disk that can phase-lock its spindle motor to that of a second
disk, or perhaps with the AC line?  With direct-drive electronically-
controlled motors becoming common, this should be possible.  The Eagle
has such a motor, but no provision for external sync.  I recall stories
of Cray's using phase-locked disks to advantage.

Of course, to get the most from high transfer rates, you need large
blocksizes; DMR's example looked like about one revolution.  Hence
the extent-based file allocation of mainframe OS's, etc.  Perhaps
it's time to pester Berkeley to double MAXBSIZE to 16384 bytes?
It would use 0.3% of memory for additional kernel page tables on a
VAX, but proportionately less on machines with larger page sizes.
8192 is practically the *minimum* blocksize on Suns, these days.

The one point that nobody mentioned is that you don't want the CPU
copying the data around between kernel and user address spaces when
there's a lot!	(Maybe it was just too obvious).

Don Speck   sp...@vlsi.caltech.edu  {amdahl,ames!elroy}!cit-vax!speck

Path: utzoo!attcan!uunet!husc6!uwvax!umn-d-ub!umn-cs!bungia!mn-at1!alan
From: a...@mn-at1.k.mn.org (Alan Klietz)
Newsgroups: comp.unix.wizards
Subject: Why UNIX I/O is so slow (was VAX vs SUN 4 performance)
Keywords: readahead, striping, file mapping
Message-ID: <441@mn-at1.k.mn.org>
Date: 17 Jun 88 19:16:32 GMT
References: <22957@bu-cs.BU.EDU> <14968@brl-adm.ARPA> <601@modular.UUCP> 
<23288@bu-cs.BU.EDU> <7980@alice.UUCP> <23326@bu-cs.BU.EDU> 
<6963@cit-vax.Caltech.Edu>
Reply-To: a...@mn-at1.UUCP (0000-Alan Klietz)
Organization: Minnesota Supercomputer Center
Lines: 125

In article <6...@cit-vax.Caltech.Edu> mang...@cit-vax.Caltech.Edu (Don Speck) 
writes:
<In article <23...@bu-cs.BU.EDU>, b...@bu-cs.BU.EDU (Barry Shein) writes:

	[why UNIX I/O is so slow compared to big mainframe OS]

A useful model is to partition the time spent by every I/O request
into fixed and variable length portions.   tf is the fixed overhead to
reset the interface hardware, queue the I/O request, wait for the
data to rotate under the head (for networks, the time to process all
of the headers), etc.  td is the marginal cost transferring a unit of
data (byte, block, whatever).  The total I/O utilization of a channel
in this case is characterized by

	        n td
	D = ------------
	     tf + n td

	for n units of data.  The lim  D = 1.0. 
				  n->inf

td is typically very small (microsecs), tf is typically orders
of magnitude higher (millisecs).  The curve usually has a knee;
UNIX I/O is often on the left side of the knee while most mainframe
OS's are on the right side.

<For all the optimizations that these I/O processors are supposed to do,
<Unix rarely gives them the chance.  Unless there's more than two requests
<outstanding at once, once they finish one, there's only one request to
<choose from.  Unix has minimal readahead, so that's as many requests as
<a single process can generate.	Raw I/O is even worse.

Yep, Unix needs to do larger I/O transfers.  Case in point: the Cray-2
has a 16 Gbyte/sec I/O throughput capability with incredibly expensive
80+ Mbit/s parallel-head disks (often stripped).   And yet, typing

	cp bigfile bigfile2

measures a transfer performance of only 18 Mbit/s, because BUFSIZ is 4K.

<Asynchronous reads would be the obvious way to get enough requests in
<the queue to optimize, but that seems unlikely to happen.  Rather,
<explicit read commands are giving way to memory-mapped files (in Mach
<and SunOS 4.0) where readahead becomes synonymous with prepaging.  It
<remains to be seen whether much attention is put into this.

There have been comments that SunOs 4.0 I/O overhead is 2 or 3 times
greater than under 3.0.  Demand paged I/O introduces all of the Turing
divination problems of trying to predict which pages (I/O blocks) the
program will use next.  IMHO, this is a step backward.

<Barry credits the asynchronous nature of I/O on mainframe OS's to the
<access methods, like RMS on VMS.  People avoid those when they want
<speed (imagine using dbm to do sequential reads).  For instance, the
<VMS "copy" command bypasses RMS when copying disk-to-disk, with the
<curious result that it's faster to copy to a disk than to the null
<device, because the null device is record-oriented, requiring RMS.

RMS systems developed through evolution ("survival of the fastest?")
to their current state of being I/O marvels.  Hence MVS preallocation
requirements, VMS, asynch channel I/O, etc.

<As DMR demonstrates, parallel-transfer disks are great for big files.
<They're horrendously expensive though, and it's hard enough to find
<controllers that keep up with even 3 MB/s, much less 10 MB/s.

Disk prices are dropping fast.  8" 1 Gb dual-head disks (6 MB/s) will be
common in about a year for $5000-$9000 qty 1.  The ANSI X3T9 IPI
(Intelligent Peripheral Interface) is now a full standard.  It starts
at 10 Mb/s and goes up to 25 Mb/s in the current configurations.
N.B. the vendors pushing this standard are: IBM, CDC, Unisys, Fujitsu,
NEC, Hitachi, (big mainframe manufacturers).   Unix in its current
incarnation is unable to take advantage of this new disk technology.

<they can be simulated with ordinary disks by striping across multiple
<controllers, *if* the disks rotate as one.  Does anyone know of a cost-
<effective disk that can phase-lock its spindle motor to that of a second
<disk, or perhaps with the AC line?  With direct-drive electronically-
<controlled motors becoming common, this should be possible.  The Eagle
<has such a motor, but no provision for external sync.  I recall stories
<of Cray's using phase-locked disks to advantage.

The thesis in my paper "Turbo NFS" (*) shows how you can get good
I/O performance without phase-locked disks by reorganizing the
file system contiguously.  Cylinders of data are prefetched from
selected disks at a rate commensurate with the rate of which the
data is consumed by the program.   Extents are allocated contiguously
by powers of 2.  The organization is called a "fractal file system".
Phillip Koch did the original work in this area (**).

<Of course, to get the most from high transfer rates, you need large
<blocksizes; DMR's example looked like about one revolution.  Hence
<the extent-based file allocation of mainframe OS's, etc.  Perhaps
<it's time to pester Berkeley to double MAXBSIZE to 16384 bytes?

Berkeley should start over.  The whole business with "cylinder groups"
tries to keep sets of blocks relatively near each other.  With the new
disks today, the average SEEK TIME IS OFTEN FASTER THAN THE ROTATIONAL
DELAY.  You don't want to keep blocks "near" each other, instead you want
to make each extent as large as possible.  Sorry, but cylinder groups are
archaic.

<The one point that nobody mentioned is that you don't want the CPU
<copying the data around between kernel and user address spaces when
<there's a lot!	(Maybe it was just too obvious).

Here is an area where paged I/O has an advantage.  The first UNIX vendor
to do contiguous file systems + paged I/O + prefetching will win big in
the disk I/O race.

<Don Speck   sp...@vlsi.caltech.edu  {amdahl,ames!elroy}!cit-vax!speck

(*) "Turbo NFS: Fast Shared Access for Cray Disk Storage", A. Klietz
(MN Supercomputer Center)  Proceedings of the Cray User Group, Spring 1988.

(**) "Disk File Allocation Based on the Buddy System", P. D. L. Koch
(Dartmouth)  ACM TOCS, Vol 5, No 3, November 1987.

--
Alan Klietz
Minnesota Supercomputer Center (*)
1200 Washington Avenue South
Minneapolis, MN  55415    UUCP:  a...@mn-at1.k.mn.org
Ph: +1 612 626 1836       ARPA:  a...@uc.msc.umn.edu  (was umn-rei-uc.arpa)

(*) An affiliate of the University of Minnesota

Path: utzoo!attcan!uunet!husc6!cmcl2!brl-adm!brl-smoke!gwyn
From: g...@brl-smoke.ARPA (Doug Gwyn )
Newsgroups: comp.unix.wizards
Subject: Re: Why UNIX I/O is so slow (was VAX vs SUN 4 performance)
Message-ID: <8124@brl-smoke.ARPA>
Date: 18 Jun 88 02:22:43 GMT
References: <22957@bu-cs.BU.EDU> <14968@brl-adm.ARPA> <601@modular.UUCP> 
<23288@bu-cs.BU.EDU> <7980@alice.UUCP> <23326@bu-cs.BU.EDU> 
<6963@cit-vax.Caltech.Edu> <441@mn-at1.k.mn.org>
Reply-To: g...@brl.arpa (Doug Gwyn (VLD/VMB) <gwyn>)
Organization: Ballistic Research Lab (BRL), APG, MD.
Lines: 11

In article <4...@mn-at1.k.mn.org> a...@mn-at1.UUCP (0000-Alan Klietz) writes:
-Berkeley should start over.  The whole business with "cylinder groups"
-tries to keep sets of blocks relatively near each other.  With the new
-disks today, the average SEEK TIME IS OFTEN FASTER THAN THE ROTATIONAL
-DELAY.  You don't want to keep blocks "near" each other, instead you want
-to make each extent as large as possible.  Sorry, but cylinder groups are
-archaic.

Such considerations should lead to the conclusion that each type of
filesystem may need its own access algorithms (perhaps in an I/O
processor).  This is easy to arrange via the File System Switch.

Path: utzoo!attcan!uunet!lll-winken!lll-tis!ames!ncar!noao!arizona!lm
From: l...@arizona.edu (Larry McVoy)
Newsgroups: comp.unix.wizards
Subject: Re: Why UNIX I/O is so slow (was VAX vs SUN 4 performance)
Keywords: actually FSS vs VNODE
Message-ID: <6032@megaron.arizona.edu>
Date: 29 Jun 88 01:12:28 GMT
References: <22957@bu-cs.BU.EDU> <14968@brl-adm.ARPA> <601@modular.UUCP> 
<23288@bu-cs.BU.EDU> <7980@alice.UUCP> <23326@bu-cs.BU.EDU> 
<6963@cit-vax.Caltech.Edu> <441@mn-at1.k.mn.org> <8124@brl-smoke.ARPA>
Reply-To: l...@megaron.arizona.edu (Larry McVoy)
Organization: U of Arizona CS Dept, Tucson
Lines: 9

In article <8...@brl-smoke.ARPA> g...@brl.arpa (Doug Gwyn (VLD/VMB) <gwyn>) 
writes:
>Such considerations should lead to the conclusion that each type of
>filesystem may need its own access algorithms (perhaps in an I/O
>processor).  This is easy to arrange via the File System Switch.

Do the wizards have a preference (based on logic, not religion, one presumes)
between the file system switch and the vnode method of virtualizing file
systems?  Anyone looked into both?
-- 
Larry McVoy	laidbak...@sun.com	1-800-LAI-UNIX x286