Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP Path: utzoo!utgpu!water!watmath!clyde!rutgers!sri-spam!ames!lll-tis!lll-lcc! pyramid!prls!mips!mash From: m...@mips.UUCP Newsgroups: comp.arch Subject: MIPS Performance Brief, zillions of numbers, very long Message-ID: <861@winchester.UUCP> Date: Thu, 29-Oct-87 16:09:02 EST Article-I.D.: winchest.861 Posted: Thu Oct 29 16:09:02 1987 Date-Received: Mon, 2-Nov-87 06:32:17 EST Lines: 1232 Keywords: benchmarks MIPS Performance Brief, PART 1 : CPU Benchmarks, Issue 3.0, October 1987. This is a condensed version, with MAC Charts, some explanatory details, and most advertisements deleted. For a full version on paper, with MAC charts an tables that aren't squished toether, send mail, NOT TO ME, but to: ....{ucbvax, decvax, ihnp4}!decwrl!mips!eleanor [Eleanor Bishop] (Please be patient: it will take a little while to get them sent.) As usual, I tried pretty hard to get the numbers right, but let us know if we goofed anywhere. We'll update the next time. The posting is 1200+ lines long, with a thousand or so benchmark numbers & performance ratios for dozens of machines and benchmarks. If you're not a glutton for data, "n" now or after the main summary! ------------ 1. Introduction New Features of This Issue More benchmarks are normalized to the VAX-11/780 under VAX/VMS,rather than UNIX. Livermore Loops and Digital Review Magazine benchmarks have been added, and the Spice section uses new public domain inputs. The Brief has been divided into two parts: user and system. The system bench- mark part is being greatly expanded (beyond Byte), and has been moved to a new document "MIPS Performance Brief - Part 2". User-level performance is mostly driven by compiler and library tuning, whereas system performance also depends on operating system and hardware configuration issues. The two Briefs syn- chronize with different release cycles: Part 2 will appear 1-2 months after Part 1. Benchmarking - Caveats and Comments While no one benchmark can fully characterize overall system performance, the results of a variety of benchmarks can give some insight into expected real performance. A more important benchmarking methodology is a side-by-side com- parison of two systems running the same real application. We don't believe in characterizing a processor with just a single number, but we follow (what seems to be) standard industry practice of using a mips-rating that essentially describes overall integer performance. Thus, we label a 5- mips machine to be one that is about 5X (i.e., anywhere from 4X to 6X!) faster than a VAX 11/780 (UNIX 4.3BSD, unless we can get Ultrix or VAX/VMS numbers) on integer performance, since this seems to be how most people intuitively compute mips-ratings. Even within the same computer family, performance ratios between processors vary widely. For example, [McInnis 87] characterizes a ``6 mips'' VAX 8700 as anywhere from 3X to 7X faster than the 11/780. Floating point speed often varies more than, and scales up slower than integer speed versus the 11/780. This paper analyzes one important aspect of overall computer system performance - user-level CPU performance. MIPS Computer Systems does not warrant or represent that the performance data stated in this document will be achieved by any particular application. (We have to say that, sorry.) 2. Benchmark Summary 2.1. Choice of Benchmarks This brief offers both public-domain and MIPS-created benchmarks. We prefer public domain ones, but some of the most popular ones are inadequate for accu- rately characterizing performance. In this section, we give an overview of the importance we attach to the various benchmarks, whose results are summarized on the next page. Dhrystone [DHRY 1.1] and Stanford [STAN INT] are two popular small integer benchmarks. Compared with the fastest VAX 11/780 systems, the M/1000 is 13-14X faster than the VAX on these tests, and yet, we rate the M/1000 as a 10-vax- mips machine. While we present Dhrystone and Stanford, we feel that the performance of large UNIX utilities, such as grep, yacc, diff, and nroff is a better (but not per- fect!) guide to the performance customers will receive. These four, which make up our [MIPS UNIX] benchmark, demonstrate that performance ratios are not sin- gle numbers, but range here from 8.6X to 13.7X faster than the VAX. Even these UNIX utilities tend to overstate performance relative to large applications, such as CAD applications. Our own vax-mips ratings are based on a proprietary set of larger and more stressful real programs, such as our com- piler, assembler, debugger, and various CAD programs. For floating point, the public domain benchmarks are much better. We're still careful not to use a single benchmark to characterize all floating point appli- cations. The Livermore Fortran kernels [LLNL DP] give insight into both vector and non- vector performance for scientific applications. Linpack [LNPK DP and LNPK SP] tests vector performance on a single scientific application, and stresses cache performance. Spice [SPCE 2G6] and Doduc [DDUC] test a different part of the floating point application spectrum. The codes are large and thus test both instruction fetch bandwidth and scalar floating point. Digital Review Maga- zines benchmark [DIG REV] is a compendium of FORTRAN tests that measure a wide variety of behavior, and seem to correlate well with some classes of real pro- grams. 2.2. Benchmark Summary Data This section summarizes the most important benchmark results described in more detail throughout this document. The numbers show performance relative to the VAX 11/780, i.e., larger numbers are better/faster. o A few numbers have been estimated by interpolations from closely-related benchmarks and/or closely-related machines. The methods are given in great detail in the individual sections. o Several of the columns represent summaries of multiple benchmarks. For example, the MIPS UNIX column represents 4 benchmarks, the SPICE 2G6 column 3, and LLNL DP represents 24. o In the Integer section, MIPS UNIX is the most indicative of real perfor- mance. o For Floating Point, we especially like LLNL DP (Livermore FORTRAN ker- nels), but all of these are useful, non-toy benchmarks. o In the following table, "Pub mips" gives the manufacturer-published mips- ratings. As in all tables in this document, the machines are listed in increasing order of performance according to the benchmarks, in this case, by Integer performance. o The summary includes only those machines for which we could get measured results on almost all the benchmarks and good estimates on the the results for the few missing data items. Summary of Benchmark Results (VAX 11/780 = 1.0, Bigger is Faster) Integer (C) Floating Point (FORTRAN) ---------------- ------------------------------------- MIPS DHRY STAN LLNL LNPK LNPK SPCE DIG DDUC Publ UNIX 1.1 INT DP DP SP 2G6 REV mips System 1 1 1 1 1 1 1 1 1 1 VAX 11/780# 2.1 1.9 1.8 1.9 2.9 2.5 1.6 *2 *1.3 2 Sun3/160 FPA *4 4.1 4.7 2.8 3.3 3.4 2.4 *3 1.7 4 Sun3/260 FPA 5.5 7.4 7.2 2.5 4.3 3.7 3.4 4.9 3.8 5 MIPS M/500 *6 5.9 6.5 5.9 6.9 5.6 5.3 6.2 5.2 6 VAX 8700 8.0 10.8 7.3 4.5 7.9 6.4 4.1 4.4 3.5 10 Sun4/260 9.2 11.3 11.8 8.1 7.1 7.6 6.6 7.6 7.3 8 MIPS M/800 11.3 13.5 14.1 9.7 8.6 9.2 8.0 9.3 8.8 10 MIPS M/1000 # VAX 11/780 runs 4.3BSD for MIPS UNIX, Ultrix 2.0 (vcc) for Stanford, VAX/VMS for all others. Use of 4.3BSD (no global optimizer) probably inflates the MIPS UNIX column by about 10%. * Although it is nontrivial to gather full set of numbers, it is important to avoid holes in benchmark tables, as it is too easy to be misleading. Thus, we had to make reasoned guesses at these numbers. The MIPS UNIX values for VAX 8700 and Sun-3/260 were taken from the Published mips-ratings, which are consistent (+/- 10%) with experience with these machines. DIG REV and DDUC were guessed by noting that most machines do somewhat better on DIG REV than on SPCE, and than a Sun-3/260 is usually 1.5X faster than a Sun-3/160 on floating-point benchmarks. Benchmark Descriptions: MIPS UNIX MIPS UNIX benchmarks: grep, diff, yacc, nroff, same 4.2BSD C source compiled and run on all machines. The summary number is the geometric mean of the 4 relative performance numbers. DHRY 1.1 Dhrystone 1.1, any optimization except inlining. STAN INT Stanford Integer. LLNL DP Lawrence Livermore Fortran Kernels, 64-bit. The summary number is the given as the relative performance based on the geometric mean, i.e., the "middle" of the 3 means. LNPK DP Linpack Double Precision, FORTRAN. LNPK DP Linpack Single Precision, FORTRAN. SPCE 2G6 Spice 2G6, 3 public-domain circuits, for which the geometric mean is shown. DIG REV Digital Review magazine, combination of 33 benchmarks. DDOC Doduc Monte Carlo benchmark. 3. Methodology Tested Configurations When we report measured results, rather than numbers published elsewhere, the configurations were as shown below. These system configurations do not neces- sarily reflect optimal configurations, but rather the in-house systems to which we had repeatable access. When we've had the faster results available, we've quoted them in place of our own system's numbers. DEC VAX-11/780 Main Memory: 8 Mbytes Floating Point: Configured with FPA board. Operating System: 4.3 BSD UNIX. DEC VAX 8600 Main Memory: 20 Mbytes Floating Point: Configured without FPA board. Operating System: Ultrix V1.2. (4.2BSD with many 4.3BSD tunings). Sun-3/160M CPU: (16.67 MHz MC68020) Main Memory: 8 Mbytes Floating Point: 12.5 MHz MC68881 coprocessor (compiled -f68881). Operating System: SunOS 3.2 (4.2BSD) MIPS M/500 CPU: 8MHz R2000, in R2300 CPU board, 16K I-cache, 8K D-cache Floating Point: R2010 FPA chip (8MHz) Main Memory: 8 Mbytes (2 R2350 memory boards) Operating System: UMIPS-BSD 2.1 (4.3BSD UNIX with NFS) MIPS M/800 CPU: 12.5 MHz R2000, in R2600 CPU board, 64K I-cache, 64K D-cache Floating Point: R2010 FPA chip (12.5MHz) Main Memory: 8 Mbytes (2 R2350 memory boards) Operating System: UMIPS-BSD 2.1 MIPS M/1000 CPU: 15 MHz R2000, in R2600 CPU board, 64K I-cache, 64K D-cache Floating Point: R2010 FPA chip (15 MHz) Main Memory: 16 Mbytes (4 R2350 memory boards) Operating System: UMIPS-BSD 2.1 Test Conditions All programs were compiled with -O (optimize), unless otherwise noted. C is used for all benchmarks except Whetstone, LINPACK, Doduc, Spice 2g.6, Hspice, and the Livermore Fortran Kernels, which use FORTRAN. When possible, we've obtained numbers for VAX/VMS, and use them in place of UNIX numbers. The MIPS compilers are version 1.21. User time was measured for all benchmarks using the /bin/time command. Systems were tested in normal multi-user development environment, with load factor <0.2 (as measured by uptime command). Note that this occasionally makes them run longer, due to slight interference from background daemons and clock handling, even on an otherwise empty system. Benchmarks were run at least 3 times and averaged. The intent is to show numbers that can be reproduced on live systems. Times (or rates, such as for Dhrystones, Whetstones, and LINPACK KFlops) are shown for the VAX 11/780. Other machines' times or rates are shown, and their relative performance ("Rel." column) normalized to the 11/780 treated as 1.0. VAX/VMS is used whenever possible as the base. Compilers and Operating Systems Unless otherwise specified, The M-series benchmark numbers use Release 1.21 of the MIPS compilers and UMIPS-BSD 2.1. Optimization Levels Unless otherwise specified, all benchmarks were compiled -O, i.e., with optimi- zation. UMIPS compilers call this level -O2, and it includes global intra- procedural optimization. In a few cases, we show numbers for -O3 and -O4 optimization levels, which do inter-procedural register allocation and pro- cedure merging. -O3 is now generally available. Now, let's look at the benchmarks. Each section title includes the (CODE NAME) that relates it back to the earlier Summary, if it is included there. 4. Integer Benchmarks 4.1. MIPS UNIX Benchmarks (MIPS UNIX) The MIPS UNIX Benchmarks described below are fairly typical of nontrivial UNIX programs. This benchmark suite provides the opportunity to execute the same code across several different machines, in contrast to the compilers and link- ers for each machine, which have substantially different code. User time is shown; kernel time is typically 10-15% of the user time (on the 780), so these are good indications of integer/character compute-intensive programs. The first 3 benchmarks were running too fast to be meaningful on our faster machines, so we modified the input files to get larger times. The VAX 8600 ran consistently around 3.8X faster than the 11/780 on these tests, but we sold it, so it's started to drop out as we've changed benchmarks. These benchmarks con- tain UNIX source code, and are thus not generally distributable. For better statistical properties, we now report the Geometric Mean of the Relative performance numbers, because it does not ignore the performance con- tributions of the shorter benchmarks. (In this case, the grep ratios drag the Geometric Mean down.) Expect real performance to lie between the Geometric Mean and the Total Relative number. Note: the Geometric Mean of N numbers is the Nth root of the product of those numbers. It is necessarily used in place of the arithmetic mean when computing the mean of performance ratios, or of benchmarks whose runtimes are quite dif- ferent. See [Fleming 86] for a detailed discussion. MIPS UNIX Benchmarks Results grep diff yacc nroff Total Geom System Secs Rel. Secs Rel. Secs Rel. Secs Rel. Secs+ Rel. Mean 11.2 1.0 246.4 1.0 101.1 1.0 18.8 1.0 377.5 1.0 1.0 11/780 4.3BSD 5.6 2.0 105.3 2.3 48.1 2.1 9.0 2.1 168.0 2.2 2.1 Sun-3/160M - - - - - - 5.0 3.8 - 3.8 3.8 DEC VAX 8600 2.4 4.7 35.8 6.9 19.5 5.2 3.3 5.7 61.0 6.2 5.5 MIPS M/500 - 7 - 8.5 - 9 - 7.5 - - 8.0 Sun-4 * 1.6 7.0 21.6 11.4 11.2 9.0 1.9 9.9 36.3 10.4 9.2 MIPS M/800 1.3 8.6 18.0 13.7 9.3 10.9 1.5 12.5 30.1 12.5 11.3 MIPS M/1000 + Simple summation of the time for all benchmarks. "Total Rel." is ratio of the totals. * These numbers derived as shown on next page. Note: in order to assure "apples-to-apples" comparisons, we moved the same copies of the (4.2BSD) sources for these to the various machines, compiled them there, and ran them, to avoid surprises from different binary versions of com- mands resident on these machines. Note that the granularity here is at the edge of UNIX timing, i.e., tenths of seconds make differences, especially on the faster machines. The performance ratios seen here seem typical of large UNIX commands on MIPS systems. Estimation of Sun-4 Numbers The 4 ratios cited in the table above were pieced together from somewhat frag- mentary information. Earlier Briefs used shorter benchmarks for the grep, diff, and yacc tests, for which we were able to get numbers from a Sun-4, with SunOS 3.2L, at 16.7MHz. By comparing with M/500 and M/800 numbers on the same tests, we can interpolate and at least estimate bounds on performance. On grep, both Sun-4 and M/800 used .6 seconds user time, so we assumed the same relative performance (7.0). On yacc, both Sun-4 and M/800 used .4 seconds, so we assumed the same relative performance (9.0). On diff, 3 runs on the Sun-4 yielded .3, .4. and .4 seconds. A MIPS M/500 is consistently .4, an M/800 .2, and an M/1000 usually .2, but occasionally .1. This test was clearly too small, and it is difficult to make strong assertions. However, it does seem that the Sun-4 is faster than the M/500 (6.9X VAX) but noticeably closer to it than to an M/800 (11.4X). We thus estimate around 8.5X, a little lower than halfway between the two MIPS systems. On nroff, a setup problem of ours aborted the Sun-4 run. At a seminar at Stan- ford this summer, the following number was given by Sun: 18.5 seconds for: troff -man csh.1 This is not sufficient information to allow exact duplication, but we tried running it two different ways: troff -a -man csh.1 >/dev/null troff -man ..... csh.1 The second case used numerous arguments to actually run it for a typesetter, whereas the first emits a representation to the standard output. The M/500 required 23.4 and 27.9 seconds user time, respectively, while the M/800 gave user times of 11.2 and 14.1 seconds. Assuming that the troff results are simi- lar to those of nroff, and using the worst of the M/800 times, we get a VAX- relative estimate of: (9.9X for M/800) X 14.1 (M/800 secs) / 18.5 (Sun-4 secs) which yields 7.5X for the Sun-4. All of this is obviously a gross approximation with numerous items of missing information. Timing granularity is inadequate. Results are being generalized from small samples. The source code may well differ for these programs. Our csh.1 manual pages may not be identical. Sun compilers will improve, etc, etc. We apologize for the low quality of this data, as Sun-4 access is not something we have in good supply. We'll run the newest forms of the benchmarks as soon as possible. However, the end results does seem quite consistent with with other benchmarks that various people in the industry have tried. Finally, note that this benchmark set is running versus 4.3BSD, not versus Ultrix 2.0 with vcc. Hence, the relative performance numbers are inflated somewhat relative to VAX/VMS or VAX-Ultrix numbers. From other experience, we'd guess that subtracting 10% from most of the computed mips-ratings would give a good estimate of the Ultrix 2.0 (vcc)-relative mips-ratings. 4.2. Dhrystone (DHRY 1.1) Dhrystone is a synthetic programming benchmark that measures processor and com- piler efficiency in executing a ``typical'' benchmark. The Dhrystone results shown below are measured in Dhrystones / second, using the 1.1 version of the benchmark. We include Dhrystone because it is popular. MIPS systems do extremely well on it. However, comparisons of systems based on Dhrystone and especially, only on Dhrystone, are unreliable and should be avoided. More details are given at the end of this section. According to [Richardson 87], 1.1 cleans up a bug, and is the correct version to use, even though results for a given machine are typi- cally about 15% less for 1.1 than with 1.0. Advice for running Dhrystone has changed over time with regard to optimization. It used to ask that people turn off optimizers that were more than peephole optimizers, because the benchmark contained a modest amount of "dead" code that optimizers were eliminating. However, it turned out that many people were sub- mitting optimized results, often unlabeled, confusing everyone. Currently, any numbers can be submitted, as long as they're appropriately labeled, except that procedure inlining (done by only a few very advanced compilers) must be avoided. We continue to include a range of numbers to show the difference optimization technology makes on this particular benchmark, and to provide a range for com- parison when others' cited Dhrystone figures are not clearly defined by optimi- zation levels. For example, -O3 does interprocedural register allocation, and -O4 does procedure inlining, and we know -O4 is beyond the spirit of the bench- mark. Hence, we now cite the -O3 numbers. We're not sure what the Sun-4's -O3 level does, but we do not believe that it does inlining either. In the table below, it is interesting to compare the performance of the two Ultrix compilers. Also, examination of the MIPS and Sun-4 numbers shows the performance gained by the high-powered optimizers available on these machines. The numbers are ordered in what we think is the overall integer performance of the processors. Dhrystone Benchmark Results - Optimization Effects No Opt -O -O3 -O4 NoReg Regs NoReg Regs Regs Regs Dhry's Dhry's Dhry's Dhry's Dhry's Dhry's /Sec /Sec /Sec /Sec /Sec /Sec System 1,442 1,474 1,559 1,571 DEC VAX 11/780, 4.3BSD 2,800 3,025 3,030 3,325 Sun-3/160M 4,896 5,130 5,154 5,235 DEC VAX 8600, Ultrix 1.2 8,800 10,200 12,300 12,300 13,000 14,200 MIPS M/500 8,000 8,000 8,700 8,700 DEC VAX 8550, Ultrix 2.0 cc 9,600 9,600 9,600 9,700 DEC VAX 8550, Ultrix 2.0 vcc 10,550 12,750 17,700 17,700 19,000 Sun-4, SunOS 3.2L 12,800 15,300 18,500 18,500 19,800 21,300 MIPS M/800 15,100 18,300 22,000 22,000 23,700 25,000 MIPS M/1000 Some other published numbers of interest include the following, all of which are taken from [Richardson 87], unless otherwise noted. Items marked * are those that we know (or have good reason to believe) use optimizing compilers. These are the "register" versions of the numbers, i.e., the highest ones reported by people. Dhrystone Benchmark Results Dhry's /Sec Rel. System 1571 0.9 VAX 11/780, 4.3BSD [in-house] 1757 1.0 VAX 11/780, VAX/VMS 4.2 [Intergraph 86]* 3325 1.9 Sun3/160, SunOS 3.2 [in-house] 3856 2.2 Pyramid 98X, OSx 3.1, CLE 3.2.0 4433 2.5 MASSCOMP MC-5700, 16.7MHz 68020, RTU 3.1* 4716 2.7 Celerity 1230, 4.2BSD, v3.2 6240 3.6 Ridge 3200, ROS 3.4 6374 3.6 Sun3/260, 25MHz 68020, SunOS 3.2 6423 3.7 VAX 8600, 4.3BSD 6440 3.7 IBM 4381-2, UTS V, cc 1.11 6896 3.9 Intergraph InterPro 32C, SYSV R3 3.0.0, Greenhills, -O* 7109 4.0 Apollo DN4000 -O 7142 4.1 Sun-3/200 [Sun 87] * 7249 4.2 Convex C-1 XP 6.0, vc 1.1 7409 4.2 VAX 8600, VAX/VMS in [Intergraph 86]* 7655 4.4 Alliant FX/8 [Multiflow] 8300 4.7 DG MV20000-I and MV15000-20 [Stahlman 87] 8309 4.7 InterPro-32C,30MHz Clipper,Green Hills[Intergraph 86]* 9436 5.4 Convergent Server PC, 20MHz 80386, GreenHills* 9920 5.6 HP 9000/840S [HP 87] 10416 5.9 VAX 8550, VAX/VMS 4.5, cc 2.2* 10787 6.1 VAX 8650, VAX/VMS, [Intergraph 86]* 11215 6.4 HP 9000/840, HP-UX, full optimization* 12639 7.2 HP 9000/825S [HP 87]* 13000 7.4 MIPS M/500, 8MHz R2000, -O3* 13157 7.5 HP 825SRX [Sun 87]* 14195 8.1 Multiflow Trace 7/200 [Multiflow] 14820 8.4 CRAY 1S 15007 8.5 IBM 3081, UTS SVR2.5, cc 1.5 15576 8.9 HP 9000/850S [HP 87] 18530 10.5 CRAY X-MP 19000 10.8 Sun-4/200, -O3* [Sun 87] 19800 11.3 MIPS M/800, 12.5MHz R2000, -O3* 23700 13.5 MIPS M/1000, 15MHz R2000, -O3* 28846 16.4 Amdahl 5860, UTS-V, cc1.22 31250 17.8 IBM 3090/200 43668 24.9 Amdahl 5890/300E, cc -O Unusual Dhrystone Attributes We've calibrated this benchmark against many more realistic ones, and we believe that its results must be treated with care, because the detailed pro- gram statistics are unusual in some ways. It has an unusually low number of instructions per function call (35-40 on our machines), where most C programs fall in in the 50-60 range or higher. Stated another way, Dhrystone does more function calls than usual, which especially penalizes the DEC VAX, making this a favored benchmark for inflating one's "VAX-mips" rating. Any machine with a lean function call sequence looks a little better on Dhrystone that it does on others. The dynamic nesting depth of function calls inside the timed part of Dhrystone is low (3-4). This means that most register-window RISC machines would never even once overflow/underflow their register windows and be required to save/restore registers. This is not to say fast function calls or register windows are bad (they're not!), merely that this benchmark overstates their performance effects. Dhrystone can spend 30-40% of the time in the strcpy function, copying atypi- cally long (30-character) strings, which happen to be alignable on word boun- daries. More realistic programs don't spend this much time in this sort of code, and when they do, they handle more shorter strings: 6 characters would be much more typical. On our machines, Dhrystone uses 0-offset addressing for 50% of memory data references (dynamic). Most real programs use 0-offsets 10-15% of the time. Of course, Dhrystone is a fairly small benchmark, and thus fits into almost any reasonable instruction cache. In conclusion, Dhrystone gives some indication of user-level integer perfor- mance, but is susceptible to surprises when comparing amongst architectures that differ strongly. Unfortunately, the industry seems to lack a good set of widely-available integer benchmarks that are as representative as are some of the popular floating point ones. 4.3. Stanford Small Integer Benchmarks (STAN INT) The Computer Systems Laboratory at Stanford University, has collected a set of programs to compare the performance of various systems. These benchmarks are popular in some circles as they are small enough to simulate, and are respon- sive to compiler optimizations. It is well known that small benchmarks can be misleading. If you see claims that machine X is up to N times a VAX on some (unspecified) benchmarks, these benchmarks are probably the sort they're talking about. Stanford Small Integer Benchmark Results Perm Tower Queen Intmm Puzzle Quick Bubble Tree Aggr Rel. Secs Secs Secs Secs Secs Secs Secs Secs Secs* Perf+ System 2.34 2.30 .94 1.67 11.23 1.12 1.51 2.72 3.08 .84 VAX 11/780 4.3BSD 2.60 1.0 VAX 11/780@ .72 1.07 .50 .93 5.53 .58 .97 1.05 1.42 1.8 Sun-3/160M [ours] .63 .63 .27 .73 2.96 .31 .44 .69 .86 3.0 VAX 8600 Ultrix1.2 .28 .35 .17 .42 2.22 .18 .25 .35 .50 5.2 VAX 8550# .28 .35 .13 .15 .88 .13 .17 .50 .40 6.5 VAX 8550## .65 4.7 Sun-3/200 [Sun 87] .18 .24 .15 .23 1.15 .17 .19 .34 .36 7.2 MIPS M/500 .36 7.3 Sun-4/200 [Sun 87] .12 .16 .11 .13 .61 .10 .12 .22 .22 11.8 MIPS M/800 .10 .13 .10 .11 .51 .08 .10 .17 .18 14.1 MIPS M/1000 * As weighted by the Stanford Benchmark Suite + Ratios of the Aggregate times @ Estimated VAX 11/780 Ultrix 2.0 vcc -O time. We get this by 3.08 * (.40o.02)/.50 = 2.60, i.e., using the VAX 8550 numbers to estimate the effect of optimization. The ".02" is a guess that optimization helps the 8550 a little more than it does the 11/780, because the former's cache is big enough to hold the whole program and data, whereas the latter's does not. Another way to put it is that the 8550 is not cache-missing very much, and so optimization pays off more in removing what's left, whereas the 11/780 will cache-miss more, and the nature of these particular tests is that the optimizations won't fix cache-misses. (None of this is very scientific, but it's probably within 10%!) # Ultrix 2.0 cc -O ## Ultrix 2.0 vcc -O. The quick and bubble tests actually had errors; how- ever, the times were in line with expectations (these two optimize well), so we used them. All 8550 numbers thanks to Greg Pavlov (ames!harvard!hscvax!pavlov, of Amherst, NY). The Sun numbers are from [Sun 87]. The published Sun-4 number is .356, for SunOS 3.2L software, i.e., it is slightly faster than the M/500. 5. Floating Point Benchmarks 5.1. Livermore Fortran Kernels (LLNL DP) Lawrence Livermore National Labs' workload is dominated by large scientific calculations that are largely vectorizable. The workload is primarily served by expensive supercomputers. This benchmark was designed for evaluation of such machines, although it has been run on a wide variety of hardware, includ- ing workstations and PCs [McMahon86]. The Livermore Fortran Kernels are 24 pieces of code abstracted from the appli- cations at Lawrence Livermore Labs. These kernels are embedded in a large, carefully engineered benchmark driver. The driver runs the kernels multiple times on different data sets, checks for correct results, verifies timing accu- racy, reports execution rates for all 24 kernels, and summarizes the results with several statistics. Unlike many other benchmarks, there is no attempt to distill the benchmark results down to a single number. Instead all 24 kernel rates, measured in mflops (million floating point operations per second) are presented individu- ally for three different vector lengths (a total of 72 results). The minimum and maximum rates define the performance range of the hardware. Various statistics of the 24 or 72 rates, such as the harmonic, geometric, and arith- metic means give insight into general behavior. Any one of these statistics might suffice for comparisons of scalar machines, but multiple statistics are necessary for comparisons involving machines with vector or parallel features. These machines have unbalanced, bimodal performance, and a single statistic is insufficient characterization. McMahon asserts: ``When the computer performance range is very large the net Mflops rate of many Fortran programs and workloads will be in the sub-range between the equi-weighted harmonic and arithmetic means depending on the degree of code parallelism and optimization. More accurate estimates of cpu workload rates depend on assigning appropriate weights for each kernel. McMahon's analysis goes on to suggest that the harmonic mean corresponds to approximately 40% vectorization, the geometric mean to approximately 70% vec- torization, and the arithmetic mean to 90%o vectorization. These three statis- tics can be interpreted as different benchmarks that each characterize certain applications. For example, there is fair agreement between the kernels' har- monic mean and Spice performance. LINPACK, on the other hand, is better characterized by the geometric mean. On the next two pages are shown a summary of results from McMahon's report, followed by the complete M/1000 results. (Given the volume of data, we've only done this on M/1000s. M/800s scale directly by .833X). The complete M/1000 data shows that MIPS performance is insensitive to vector length. The minimum to maximum variation is also small for this benchmark. Both characteristics are typical of scalar machines with mature compilers. Performance of vector and parallel machines, on the hand, may span two orders of magnitude on this benchmark, or more, depending on the kernel and the vector length. 64-Bit Livermore FORTRAN Kernels MegaFlops, L = 167, Sorted by Geometric Mean Harm. Geom. Arith. Rel.* Min Mean Mean Mean Max Geom. System .05 .12 .12 .13 .24 .7 VAX 780 w/FPA 4.3BSD f77 [ours] .06 .16 .17 .18 .28 1.0 VAX 780 w/FPA VMS 4.1 .11 .30 .33 .37 .87 1.9 SUN 3/160 w/FPA .20 .42 .46 .50 1.42 2.5 MIPS M/500 .17 .43 .48 .53 1.13 2.8 SUN 3/260 w/FPA [our numbers] .29 .58 .64 .70 1.21 3.8 Alliant FX/1 FX 2.0.2 Scalar .38 .72 .77 .83 1.57 4.5 SUN 4/260 w/FPA [Hough 87] .39 .94 1.00 1.04 1.64 5.9 VAX 8700 w/FPA VMS 4.1 .10 .76 1.06 1.50 5.23 6.2 Alliant FX/1 FX 2.0.2 Vector .33 .92 1.06 1.20 2.88 6.2 Convex C-1 F77 V2.1 Scalar .52 1.09 1.19 1.30 2.74 7.0 ELXSI 6420 EMBOS F77 MP=1 .51 1.26 1.37 1.48 2.70 8.1 MIPS M/800 .61 1.51 1.65 1.78 3.24 9.7 MIPS M/1000 1.01 1.06 1.94 3.33 12.79 11.4 Convex C-1 F77 V2.1 Vector .28 1.24 2.32 5.11 29.20 13.7 Alliant FX/8 FX 2.0.2 MP=8*Vec 1.51 4.93 5.86 7.00 17.43 34.5 Cray-1S CFT 1.4 scalar 1.23 4.74 6.09 7.67 21.64 35.8 FPS 264 SJE APFTN64 3.43 9.29 10.68 12.15 25.89 62.8 Cray-XMP/1 COS CFT77.12 scalar 0.97 6.47 11.94 22.20 82.05 70.2 Cray-1S CFT 1.4 vector 4.47 11.35 13.08 15.20 45.07 76.9 NEC SX-2 SXOS1.21 F77/SX24 scalar 1.47 12.33 24.84 50.18 188 146 Cray-XMP/1 COS CFT77.12 vector 4.47 19.07 43.94 140 1042 258 NEC SX-2 SXOS1.21 F77/SX24 vector 32-Bit Livermore FORTRAN Kernels MegaFlops, L = 167, Sorted by Geometric Mean Harm. Geom. Arith. Rel.* Min Mean Mean Mean Max Geom. System .05 .18 .20 .23 .48 .7 VAX 780 4.3BSD f77 [ours] .10 .28 .30 .32 .58 1.0 VAX 780 w/FPA VMS 4.1 .19 .46 .50 .56 1.26 1.7 SUN 3/160 w/FPA .30 .65 .71 .77 1.55 2.4 SUN 3/260 w/FPA [ours] .30 .66 .74 .83 1.60 2.5 Alliant FX/1 FX 2.0.2 Scalar .10 .60 .90 1.31 4.23 3.0 Alliant FX/1 FX 2.0.2 Vector .40 .97 1.05 1.14 2.08 3.5 MIPS M/500 .55 1.04 1.12 1.20 2.21 3.7 SUN 4/260 w/FPA [Hough 87] .36 1.11 1.27 1.42 3.61 4.2 Convex C-1 F77 V2.1 Scalar .46 1.26 1.36 1.45 2.41 4.5 VAX 8700 w/FPA VMS 4.1 .68 1.31 1.46 1.61 3.19 4.9 ELXSI 6420 EMBOS F77 MP=1 .93 2.02 2.19 2.36 3.96 7.3 MIPS M/1000 .28 1.30 2.47 5.59 33.52 8.2 Alliant FX/8 FX 2.0.2 MP=8*Vec .12 1.27 2.73 5.44 23.60 9.1 Convex C-1 F77 V2.1 Vector * Relative Performance, as ratio of the Geometric Mean numbers. This is a simplistic attempt to extract a single figure-of-merit. We admit this goes against the grain of this benchmark. The next table gives the complete M/1000 output, in the form used by McMahon. Livermore FORTRAN Kernels - Complete MIPS M/1000 Output Vendor MIPS MIPS MIPS MIPS | MIPS MIPS MIPS MIPS Model M/1000 M/1000 M/1000 M/1000 | M/1000 M/1000 M/1000 M/1000 OSystem BSD2.1 BSD2.1 BSD2.1 BSD2.1 | BSD2.1 BSD2.1 BSD2.1 BSD2.1 Compiler 1.21 1.21 1.21 1.21 | 1.21 1.21 1.21 1.21 OptLevel O2 O2 O2 O2 | O2 O2 O2 O2 Samples 72 24 24 24 | 72 24 24 24 WordSize 64 64 64 64 | 32 32 32 32 DO Span 167 19 90 471 | 167 19 90 471 Year 1987 1987 1987 1987 | 1987 1987 1987 1987 Kernel ------ ------ ------ ------ | ------ ------ ------ ------ 1 2.2946 2.2946 2.3180 2.3136 | 2.9536 2.9536 2.9613 2.9727 2 1.6427 1.6427 1.8531 1.8381 | 2.1218 2.1218 2.4691 2.4758 3 2.0625 2.0625 2.1260 2.1021 | 2.8935 2.8935 2.9389 2.9853 4 1.3440 1.3440 1.7954 1.9600 | 1.6836 1.6836 2.2084 2.4248 5 1.4652 1.4652 1.4879 1.4776 | 2.0924 2.0924 2.1096 2.1374 6 1.0453 1.0453 1.3734 1.4183 | 1.3076 1.3076 1.7920 1.8517 7 3.1165 3.1165 3.1304 3.1281 | 3.9336 3.9336 3.9581 3.9623 8 2.4829 2.4829 2.5725 2.5686 | 3.1625 3.1625 3.2853 3.2612 9 3.2215 3.2215 3.2359 3.2290 | 3.8708 3.8708 3.8831 3.8632 10 1.2293 1.2293 1.2336 1.2327 | 2.4413 2.4413 2.3419 2.3263 11 1.1907 1.1907 1.2274 1.2320 | 1.5789 1.5789 1.6365 1.6559 12 1.2102 1.2102 1.2404 1.2308 | 1.6004 1.6004 1.6414 1.6471 13 0.6095 0.6095 0.6272 0.6378 | 0.9288 0.9288 0.9334 0.9428 14 0.9712 0.9712 0.9455 0.6695 | 1.2133 1.2133 1.2175 1.0092 15 0.9894 0.9894 0.9605 0.9585 | 1.3701 1.3701 1.3314 1.3314 16 1.6427 1.6427 1.6159 1.6272 | 1.7904 1.7904 1.7567 1.7500 17 2.3898 2.3898 2.2674 2.2667 | 3.7320 3.7320 3.4866 3.5210 18 2.1462 2.1462 2.3321 2.3276 | 2.5642 2.5642 2.8431 2.8365 19 1.8268 1.8268 1.8540 1.8536 | 2.2883 2.2883 2.3369 2.3466 20 2.7821 2.7821 2.7800 1.9158 | 3.7104 3.7104 3.7214 3.6583 21 1.5372 1.5372 1.6004 1.6201 | 1.9644 1.9644 2.0564 2.0855 22 1.4507 1.4507 1.4506 1.4489 | 2.0802 2.0802 2.0646 2.0658 23 2.1395 2.1395 2.3897 2.3729 | 2.5972 2.5972 2.9489 2.9532 24 1.0148 1.0148 1.0458 1.0448 | 1.1995 1.1995 1.2226 1.2281 -------------- ------ ------ ------ ------ | ------ ------ ------ ------ Standard Dev. 0.6878 0.6938 0.6902 0.6742 | 0.8702 0.8870 0.8616 0.8669 Median Dev. 0.6869 0.6445 0.7023 0.6546 | 0.9837 0.8980 1.0239 1.0360 Maximum Rate 3.2359* 3.2215 3.2359 3.2290 | 3.9623* 3.9336 3.9581 3.9623 Average Rate 1.7834* 1.7419 1.8110 1.7698 | 2.3611* 2.2949 2.3811 2.3872 Geometric Mean 1.6469* 1.6053 1.6752 1.6330 | 2.1935* 2.1238 2.2175 2.2168 Median Rate 1.6272* 1.5899 1.7056 1.7326 | 2.2084* 2.1071 2.2727 2.3365 Harmonic Mean 1.5078* 1.4724 1.5368 1.4874 | 2.0235* 1.9590 2.0503 2.0372 Minimum Rate 0.6095* 0.6095 0.6272 0.6378 | 0.9288* 0.9288 0.9334 0.9428 Maximum Ratio 1.0000 0.9955 1.0000 0.9978 | 1.0000 0.9927 0.9989 1.0000 Average Ratio 1.0000 0.9767 1.0154 0.9923 | 1.0000 0.9719 1.0084 1.0110 Geometric Ratio 1.0000 0.9747 1.0171 0.9915 | 1.0000 0.9682 1.0109 1.0106 Harmonic Mean 1.0000 0.9765 1.0192 0.9864 | 1.0000 0.9681 1.0132 1.0067 Minimum Rate 1.0000 1.0000 1.0290 1.0464 | 1.0000 1.0000 1.0049 1.0150 * These are the numbers brought forward into the summary section. 5.2. LINPACK (LNPK DP and LNPK SP) The LINPACK benchmark has become one of the most widely used single benchmarks to predict relative performance in scientific and engineering environments. The usual LINPACK benchmark measures the time required to solve a 100x100 sys- tem of linear equations using the LINPACK package. LINPACK results are meas- ured in MFlops, millions of floating point operations per second. All numbers are from [Dongarra 87], unless otherwise noted. The LINPACK package calls on a set of general-purpose utility routines called BLAS -- Basic Linear Algebra Subroutines -- to do most of the actual computa- tion. A FORTRAN version of the BLAS is available, and the appropriate routines are included in the benchmark. However, vendors often provide hand-coded ver- sions of the BLAS as a library package. Thus LINPACK results are usually cited in two forms: FORTRAN BLAS and Coded BLAS. The FORTRAN BLAS actually come in two forms as well, depending on whether the loops are 4X unrolled in the FOR- TRAN source (the usual) or whether the unrolling is undone to facilitate recog- nition of the loop as a vector instruction. According to the ground rules of the benchmark, either may be used when citing FORTRAN BLAS results, although it is typical to note rolled loops with the annotation ``(Rolled BLAS).'' For our own numbers, we've corrected a few to follow Dongarra more closely than we have in the past. LINPACK output produces quite a few MFlops numbers, and we've tended to use the fourth one in each group, which uses more iterations, and thus is more immune to clock randomness. Dongarra uses the highest MFlops number that appears, then rounds to two digits. Note that relative ordering even within families is not particularly con- sistent, illustrating the extreme sensitivity of these benchmarks to memory system design. 100x100 LINPACK Results - FORTRAN and Coded BLAS From [Dongarra 87], Unless Noted Otherwise DP DP SP SP Fortran Coded Fortran Coded System .10 .10 .11 .11 Sun-3/160, 16.7MHz (Rolled BLAS)o .11 .11 .13 .11 Sun-3/260,25MHz 68020o20MHz 68881 (Rolled BLAS)+ .14 - - - Apollo DN4000, 25MHz (68020 o 68881) [ENEWS 87] .14 - .24 - VAX 11/780, 4.3BSD, LLL Fortran [ours] .14 .17 .25 .34 VAX 11/780, VAX/VMS .20 - .24 - 80386+80387, 20MHz, 64K cache, GreenHills .20 .23 .40 .51 VAX 11/785, VAX/VMS .29 .49 .45 .69 Intergraph IP-32C,30Mz Clipper[Intergraph 86] .30 - - - IBM RT PC, optional FPA [IBM 87] .33 - .57 - OPUS 300PM, Greenhills, 30MHz Clipper .36 .59 .51 .72 Celerity C1230, 4.2BSD f77 .38 - .67 - 80386oWeitek 1167,20MHz,64K cache, GreenHills .39 .50 .66 .81 Ridge 3200/90 .41 .41 .62 .62 Sun-3/160, Weitek FPA (Rolled BLAS)o .45 .54 .60 .74 HP9000 Model 840S [HP 87] .46 .46 .86 .86 Sun-3/260, Weitek FPA (Rolled BLAS)o .47 .81 .69 1.30 Gould PN9000 .49 .66 .84 1.20 VAX 8600, VAX/VMS 4.5 .49 .54 .62 .68 HP 9000/825S [HP 87] .57 .72 .86 .87 HP9000 Model 850S [HP 87] .60 .72 .93 1.2 MIPS M/500 .61 - .84 - DG MV20000-I, MV15000-20 [Stahlman, 87] .65 .76 .80 .96 VAX 8500, VAX/VMS .70 .96 1.3 1.9 VAX 8650, VAX/VMS .78 - 1.1 - IBM 9370-90, VS FORT 1.3.0 .97 1.1 1.4 1.7 VAX 8550/8700/8800, VAX/VMS 1.0 1.3 1.9 3.6 MIPS M/800 1.1 1.1 1.6 1.6 SUN 4/260 (Rolled BLAS)+ 1.2 1.7 1.3 1.6 ELXSI 6420 1.2 1.6* 2.3* 4.3 MIPS M/1000 1.6 2.0 1.6 2.0 Alliant FX-1 (1 CE) 2.1 - 2.4 - IBM 3081K H enhanced opt=3 3.0 3.3 4.3 4.9 CONVEX C-1/XP, Fort 2.0 (Rolled BLAS) 6.0 - - - Multiflow Trace 7/200 Fortran 1.4 (Rolled BLAS) 7.0 11.0 7.6 9.8 Alliant FX-8, 8 CEs, FX Fortran, v2.0.1.9 12 23 n.a. n.a. CRAY 1S CFT (Rolled BLAS) 39 57 n.a. n.a. CRAY X-MP CFT (Rolled BLAS) 43 - 44 - NEC SX-2, Fortran 77/SX (Rolled BLAS) + The Sun FORTRAN Rolled BLAS code appears to be optimal, so we used the same numbers for Coded BLAS. The 4X unrolled numbers for Sun-4 are .86 (DP) and 1.25 (SP) [Hough 87]. * These numbers are as reported by Dongarra. We prefer the typical results, which are slightly lower, viz. 1.2, 1.5, 2.2, and 4.3. On the next page, we take a subset of these numbers, and normalize them to the VAX/VMS 11/780. 100x100 LINPACK Results - FORTRAN and Coded BLAS VAX/VMS Relative Performance For A Subset of the Systems Rel. Rel. Rel. Rel. DP DP SP SP Fortran Coded Fortran Coded System .8 .6 .5 .3 Sun-3/260,25MHz 68020o20MHz 68881 (Rolled) 1.0 1.0 1.0 1.0 VAX 11/780, VAX/VMS 1.4 - 1.0 - 80386o80387, 20MHz, 64K cache, GreenHills 2.0 2.9 1.8 2.0 Intergraph IP-32C,30Mz Clipper[Intergraph 86] 2.7 - 2.7 - 80386oWeitek 1167,20MHz,64K cache, GreenHills 2.9 2.4 2.5 1.8 Sun-3/160, Weitek FPA (Rolled BLAS) 3.3 2.7 3.4 2.5 Sun-3/260, Weitek FPA (Rolled BLAS) 3.5 3.9 3.4 3.5 VAX 8600, VAX/VMS 4.5 4.1 4.2 3.4 2.6 HP9000 Model 850S [HP 87] 4.3 4.2 3.7 3.5 MIPS M/500 6.9 6.6 5.6 5.0 VAX 8550/8700/8800, VAX/VMS 7.1 7.6 7.6 10.6 MIPS M/800 7.9 6.5 6.4 4.7 SUN 4/260 (Rolled BLAS) 8.6 9.4 9.2 12.6 MIPS M/1000 11.4 11.8 6.4 5.9 Alliant FX-1 (1 CE) 21.4 19.4 17.2 14.4 CONVEX C-1/XP, Fort 2.0 (Rolled BLAS) 50 65 30 28.8 Alliant FX-8, 8 CEs, FX Fortran, v2.0.1.9 307 - 176 - NEC SX-2, Fortran 77/SX (Rolled BLAS) 5.3. Spice Benchmarks (SPCE 2G6) Spice [UCB 87] is a general-purpose circuit simulator written at U.C. Berke- ley. Spice and its derivatives are widely used in the semiconductor industry. It is a valuable benchmark because it shares many characteristics with other real-world programs that are not represented in popular small benchmarks. It uses both integer and floating-point computation heavily. The floating-point calculations are not vector oriented, as in LINPACK. Also, the program itself is very large and therefore tests both instruction and data cache performance. We have chosen to benchmark Spice version 2g.6 because of its general availa- bility. This is one of the later and more popular Fortran versions of Spice distributed by Berkeley. We felt that the circuits distributed with the Berke- ley distribution for testing and benchmarking were not sufficiently large and modern to serve as benchmarks. In previous version of this brief, we presented results on circuits we felt were representative, but which contained proprietary data. This time, we gathered and produced appropriate benchmark circuits that can be distributed, and have since been posted as public domain on Usenet. The Spice group at Berkeley found these circuits to be up-to-date and good candidates for Spice benchmarking. By distributing the circuits we obtained results for many other machines. In the table below, "Geom Mean" is the geometric mean of the 3 "Rel." columns. Spice2G6 Benchmarks Results digsr bipole comparator Geom Secs Rel. Secs Rel. Secs Rel. Mean System 1354.0 0.60 439.6 0.68 460.3 0.63 .6 VAX 11/780 4.3BSD, f77 V2.0 993.5 0.81 394.3 0.76 366.9 0.80 .8 Microvax-II Ultrix 1.1, fortrel 901.9 0.90 285.1 1.0 328.6 0.89 .9 SUN 3/160 SunOS 3.2 f77 -O -f68881 848.0 0.95 312.6 0.96 302.9 0.96 1.0 VAX 11/780 4.3BSD, fortrel -opt 808.1 1.0 299.1 1.0 291.7 1.0 1.0 VAX 11/780 VMS 4.4 /optimize 744.8 1.1 221.7 1.3 266.0 1.1 1.2 SUN 3/260 SunOS 3.2 f77 -O -f68881 506.5 1.6 170.0 1.8 189.1 1.5 1.6 SUN 3/160 SunOS 3.2 f77 -O -ffpa 361.2 2.2 112.0 2.7 129.4 2.3 2.4 SUN 3/260 SunOS 3.2 f77 -O -ffpa 296.5 2.7 73.4 4.1 83.0 3.5 3.4 MIPS M/500 225.9 3.6 63.7 4.7 73.4 4.0 4.1 SUN 4/260 f77 -O3 -Qoption as -Ff0+ - - - - - - 5.3 VAX 8700 (estimate) 136.5 5.9 42.6 7.0 41.4 7.0 6.6 MIPS M/800 125.5 6.4 39.5 7.6 39.3 7.4 7.1 AMDAHL 470V7 VMSP FORTVS4.1 114.3 7.1 35.4 8.4 34.5 8.5 8.0 MIPS M/1000 48.0 16.8 12.5 23.9 17.5 16.7 18.9 FPS 20/64 VSPICE (2g6 derivative) + Sun numbers are from [Hough 87], who notes that the Sun-4 number was beta software, and that a few modules did not optimize. Benchmark descriptions: digsr CMOS 9 bit Dynamic shift register with parallel load capability, i.e., SISO (Serial Input Serial Output) and PISO (Parallel Input Serial Out- put), widely used in microprocessors. Clock period is 10 ns. Channel length = 2 um, Gate Oxide = 400 Angstrom. Uses MOS LEVEL=2. bipole Schottky TTL edge-triggered register. Supplied with nearly- coin- cident inputs (synchronizer application). comparator Analog CMOS auto-zeroed comparator, composed of Input, Differential Amplifier and Latch. Input signal is 10 microvolts. Channel Length = 3 um, Gate Oxide = 500 Angstrom. Uses MOS LEVEL=3. Each part is con- nected by the capacitance coupling, which is often used for the offset cancellation. (Sometimes called Toronto, in honor of its source). Hspice is a commercial version of Spice offered by Meta-Software, which recently published benchmark results for a variety of machines [Meta-software 87]. (Note that the M/800 number cited there was before the UMIPS-BSD 2.1 and f77 1.21 releases, and the numbers have improved). The VAX 8700 Spice number (5.3X) was estimated by using the Hspice numbers below for 8700 and M/800, and the M/800 Spice number: (5.5: 8700 Hspice) / (6.9: M/800 Hspice) X (6.6: M/800 Spice) yields 5.3X. This section indicates that the performance ratios seem to hold for at least one important commercial version as well. Hspice Benchmarks Results HSPICE-8601K ST230 Secs Rel. System 166.5 .6 VAX 11/780, 4.2BSD 92.2 1.0 VAX 11/780 VMS 91.5 1.0 Microvax-II VMS 29.2 3.2 ELXSI 6400 29.1 3.2 Alliant FX/1 25.3 3.6 HyperSPICE (EDGE) 16.8 5.5 VAX 8700 VMS 16.3 5.7 IBM 4381-12 13.4 6.9 MIPS M/800 [ours] 11.3 8.2 MIPS M/1000 [ours] 3.27 28.2 IBM 3090 2.71 34.0 CRAY-1S 5.4. Digital Review (DIG REV) The Digital Review magazine benchmark [DR 87] is a 3300-line FORTRAN program that includes 33 separate tests, mostly floating-point, some integer. The magazine reports the times for all tests, and summarizes them with the geometric mean seconds shown below. All numbers below are from [DR 87], except the M/500 and M/800 figures. Digital Review Benchmarks Results Secs Rel. System 9.17 1.0 VAXstation II/GPX, VMS 4.5 2.90 3.2 VAXstation 3200 2.32 4.0 VAX 8600, VMS 4.5 2.09 4.4 Sun-4, SunOS 3.2L 1.86 4.9 MIPS M/500 [ours] 1.584 5.8 VAX 8650 1.480 6.2 Alliant FX/8, 1 CE 1.469 6.2 VAX 8700 1.200 7.6 MIPS M/800 [ours] 1.193 7.7 ELXSI 6420 .990 9.3 MIPS M/1000* .487 18.8 Convex C-1 XP * The actual run number was .99, which [DR 87] reported as 1.00. 5.5. Doduc Benchmark (DDUC) This benchmark [Doduc 86] is a 5300-line FORTRAN program that simulates aspects of nuclear reactors, has little vectorizable code, and is thought to be representative of Monte-Carlo simulations. It uses mostly double precision floating point, and is often viewed as a ``nasty'' benchmark, i.e., it breaks things, and makes machines underperform their usual VAX-mips ratings. Perfor- mance is given as a number R normalized to 100 for an IBM 370/168-3 or 170 for an IBM 3033-U, [ R = 48671/(cpu time in seconds) ], so that larger R's are better. In order of increasing performance, following are numbers for various machines. All are from [Doduc 87] unless otherwise specified. Double Precision Doduc Benchmark Results DoDuc R Relative Factor Perf. System 17 0.7 Sun3/110, 16.7MHz 19 0.7 Intel 80386o80387, 16MHz, iRMX 22 0.8 Sun-3/260, 25MHz 68020, 20MHz 68881 26 1.0 VAX 11/780, VMS 33 1.3 Fairchild Clipper, 30MHz, Green Hills 43 1.7 Sun-3/260, 25MHz, Weitek FPA 48 1.8 Celerity C1260 50 1.9 CCI Power 6/32 53 2.0 Edge 1 64 2.5 Harris HCX-7 85 3.3 Alliant FX/1 88 3.4 MIPS M/500, f77 -O2, runs 553 seconds 90 3.5 IBM 4381-2 90 3.5 Sun-4/200 [Hough 1987], SunOS 3.2L, runs 540 seconds 91 3.5 DEC VAX 8600, VAX/VMS 97 3.7 ELXSI 6400 99 3.8 DG MV/20000 100 3.8 MIPS M/500, f77 -O3, runs 488 seconds 101 3.9 Alliant FX/8 113 4.3 FPSystems 164 119 4.6 Gould 32/8750 129 5.0 DEC VAX 8650 136 5.2 DEC VAX 8700, VAX/VMS 150 5.7 Amdahl 470 V8, VM/UTS 178 6.8 MIPS M/800, f77 -O2, runs 273 secs 181 7.0 IBM 3081-G, F4H ext, opt=2 190 7.3 MIPS M/800, f77 -O3, runs 256 secs 218 8.4 MIPS M/1000, f77 -O2, runs 223 secs 229 8.8 MIPS M/1000, f77 -O3, runs 213 secs 236 9.1 IBM 3081-K 475 18.3 Amdahl 5860 714 27.5 IBM 3090-200, scalar mode 1080 41.6 Cray X/MP [for perspective:we have a lonnng way to go yet!] 5.6. Whetstone Whetstone is a synthetic mix of floating point and integer arithmetic, function calls, array indexing, conditional jumps, and transcendental functions [Curnow 76]. Whetstone results are measured in KWips, thousands of Whetstone interpreter instructions per second. In this case, some of our numbers actually went down, although compiled code has generally improved. First, the accuracy of several library routines was improved, at a slight cost in performance. Second, on machines this fast, relatively few clock ticks are actually counted, and UNIX timing includes some variance. We've been running many runs and averaging. We've now increased the loop counts from 10 to 1000 to increase the total run- ning time to the point where the variance is reduced. This changed the bench- mark slightly. Our experiences show some general uncertainty about the numbers reported by anybody: we've heard that various different source programs are being used. Whetstone Benchmark Results DP DP SP SP KWips Rel. Kwips Rel. System 410 0.5 500 0.4 VAX 11/780, 4.3BSD, f77 [ours] 715 0.9 1,083 0.9 VAX 11/780, LLL compiler [ours] 830 1.0 1,250 1.0 VAX 11/780 VAX/VMS [Intergraph 86] 960 1.2 1,040 0.8 Sun3/160, 68881 [Intergraph 86] 1,110 1.3 1,670 1.3 VAX 11/785, VAX/VMS [Intergraph 86] 1,230 1.5 1,250 1.0 Sun3/260, 25MHz 68020, 20MHz 68881 1,400 1.7 1,600 1.3 IBM RT PC, optional FPA [IBM 87] 1,730 2.1 1,860 1.5 Intel 80386o80387, 20MHz, 64K cache, GreenHills 1,740 2.1 2,980 2.4 Intergraph InterPro-32C,30MHz Clipper[Intergraph86] 1,744 2.1 2,170 1.7 Apollo DN4000, 25MHz 68020, 25MHz 68881 [ENEWS 87] 1,860 2.2 2,400 1.9 Sun3/160, FPA 2,092 2.5 3,115 2.5 HP 9000/840S [HP 87] 2,433 2.9 3,521 2.8 HP 9000/825S [HP 87] 2,590 3.1 4,170 3.3 Intel 80386oWeitek 1167, 20MHz, Green Hills 2,600 3.1 3,400 2.7 Sun3/260, Weitek FPA [measured elsewhere] 2,670 3.2 4,590 3.7 VAX 8600, VAX/VMS [Intergraph 86] 2,907 3.5 4,202 3.4 HP 9000 Model 850S [HP 87] 3,540 4.3 5,290 4.2 Sun-4 (reported secondhand, not confirmed) - - 6,400 5.1 DG MV/15000-12 3,950 4.8 6,670 5.3 VAX 8700, VAX/VMS, Pascal(?) [McInnis, 1987] 4,000 4.8 6,900 5.5 VAX 8650, VAX/VMS [Intergraph 86] 4,120 5.0 4,930 3.9 Alliant FX/8 (1 CE) [Alliant 86] 4,200 5.1 - - Convex C-1 XP [Multiflow] 4,220 5.1 5,430 4.3 MIPS M/500 6,930 8.0 8,570 6.9 MIPS M/800 7,960 9.6 10,280 8.2 MIPS M/1000 12,605 15 - - Multiflow Trace 7/200 [Multiflow] 25,000 30 - - IBM 3090-200 [Multiflow] 35,000 42 - - Cray X-MP/12 6. Acknowledgements Some people have noted that they seldom believe the numbers that come from cor- porations unless accompanied by names of people who take responsibility for the numbers. Many people at MIPS have contributed to this document, which was ori- ginally created by Web Augustine. Particular contributors to this issue include Mark Johnson (much Spice work, including creation of public-domain Spice benchmarks), and especially Earl Killian (a great deal of work in various areas, particularly floating-point). Final responsibility for the numbers in this Brief is taken by the editor, John Mashey. We thank David Hough of Sun Microsystems, who kindly supplied numbers for some of the Sun configurations, even fixing a few of our numbers that were incorrectly high, and who has also offered good comments on joint efforts look- ing for higher-quality benchmarks. We also thank Cliff Purkiser of Intel, who posted the Intel 80386 Whetstone and LINPACK numbers on Usenet. We also thank Greg Pavlov, who ran hordes of Stanford and Dhrystone benchmarks for us on a VAX 8550, Ultrix 2.0 system. 7. References [Alliant 86] Alliant Computer Systems Corp, "FX/Series Product Summary", October 1986. [Curnow 76] Curnow, H. J., and Wichman, B. A., ``A Synthetic Benchmark'', Computing Journal, Vol. 19, No. 1, February 1976, pp. 43-49. [Doduc 87] Doduc, N., FORTRAN Central Processor Time Benchmark, Framentec, June 1986, Version 13. Newer numbers were received 03/17/87, and we used them where different. E-mail: seismo!mcvax!ftcsun3!ndoduc [Dongarra 87] Dongarra, J., ``Performance of Various Computers Using Standard Linear Equa- tions in a Fortran Environment'', Argonne National Laboratory, August 10, 1987. [Dongarra 87b] Dongarra, J., Marin, J., Worlton, J., "Computer Benchmarking: paths and pit- falls", IEEE Spectrum, July 1987, 38-43. [DR 87] "A New Twist: Vectors in Parallel", June 29, 1987, "The M/1000: VAX 8800 Power for Price of a MicroVAX II", August 24, 1987, and "VAXstation 3200 Benchmarks: CVAX Eclipses MicroVAX II", September 14, 1987. Digital Review, One Park Ave., NY, NY 10016. [ENEWS 87] Electronic News, ``Apollo Cuts Prices on Low-End Stations'', July 6, 1987, p. 16. [Fleming 86] Fleming, P.J. and Wallace, J.J.,``How Not to Lie With Statistics: The Correct Way to Summarize Benchmark Results'', Communications of the ACM, Vol. 29, No. 3, March 1986, 218-221. [HP 87] Hewlett Packard, ``HP 9000 Series 800 Performance Brief'', 5954-9903, 5/87. (A comprehensive 40-page characterization of 825S, 840S, 850S). [Hough 86,1] Hough, D., ``Weitek 1164/5 Floating Point Accelerators'', Usenet, January 1986. [Hough 86,2] Hough, D., ``Benchmarking and the 68020 Cache'', Usenet, January 1986. [Hough 86,3] Hough, D., ``Floating-Point Programmer's Guide for the Sun Workstation'', Sun Microsystems, September 1986. [an excellent document, including a good set of references on IEEE floating point, especially on micros, and good notes on benchmarking hazards]. Sun3/260 Spice numbers are from later mail. [Hough 87] Hough, D., ``Sun-4 Floating-Point Performance'', Usenet, 08/04/87. [IBM 87] IBM, ``IBM RT Personal Computer (RT PC) New Models, Features, and Software Overview, February 17, 1987. [Intergraph 86] Intergraph Corporation, ``Benchmarks for the InterPro 32C'', December 1986. [Meta-Software 87] Meta-Software, ``HSPICE Performance Benchmarks'', June 1987. 50 Curtner Avenue, Suite 16, Campbell, CA 95008. [McInnis 87] McInnis, D., Kusik, R., Bhandarkar, D., ``VAX 8800 System Overview'', Proc. IEEE COMPCON, March 1987, San Francisco, 316-321. [McMahon 86] ``The Livermore Fortran Kernels: A Computer Test of the Numerical Perfor- mance Range'', December 1986, Lawrence Livermore National Labs. [MIPS 87] MIPS Computer Systems, "A Sun-4 Benchmark Analysis", and "RISC System Bench- mark Comparison: Sun-4 vs MIPS", July 23, 1987. [Purkiser 87] Purkiser, C., ``Whetstone and LINPACK Numbers'', Usenet, March 1987. [Richardson 87] Richardson, R., ``9/20/87 Dhrystone Benchmark Results'', Usenet, Sept. 1987. Rick publishes the source several times a year. E-mail address: ...!seismo!uunet!pcrat!rick [Serlin 87a] Serlin, O., ``MIPS, DHRYSTONES, AND OTHER TALES'', Reprinted with revisions from SUPERMICRO Newsletter, April 1986, ITOM International, P.O. Box 1450, Los Altos, CA 94023. Analyses on the perils of simplistic benchmark measures. [Serlin 87b] Serlin, O., SUPERMICRO #69, July 31, 1987. pp. 1-2. Offers good list of attributes customers should demand of vendor benchmark- ing. [Stahlman 87] Stahlman, M., "The Myth of Price/performance", Sanford C. Bernstein & Co, Inc, NY, NY, March 17, 1987. [Sun 86] SUN Microsystems, ``The SUN-3 Family: A Hardware Overview'', August 1986. [Sun 87] SUN Microsystems, SUN-4 Product Introduction Material, July 7, 1987. [UCB 87] U. C. Berkeley, CAD/IC group, ``SPICE2G.6'', March 1987. Contact: Cindy Manly, EECS/ERL Industrial Liason Program, 479 Cory Hall, University of Cal- ifornia, Berkeley, CA 94720. [Weicker 84] Weicker, R. P., ``Dhrystone: A Synthetic Systems Programming Benchmark'', Communications of the ACM, Vol. 27, No. 10, October 1984, pp. 1013-1030. ________ UNIX is a Registered Trademark of AT&T. DEC, VAX, Ultrix, and VAX/VMS are trademarks of Digital Equipment Corp. Sun-3, Sun-4 are Trademarks on Sun Microsystems. Many others are trademarks of their respective companies. -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: {decvax,ucbvax,ihnp4}!decwrl!mips!mash OR m...@mips.com DDD: 408-991-0253 or 408-720-1700, x253 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086