Path: utzoo!yunexus!geac!syntron!jtsv16!uunet!ncrlnk!ncrcae!hubcap!gatech! ncar!ames!vsi1!wyse!mips!mash From: m...@mips.COM (John Mashey) Newsgroups: comp.arch Subject: MIPS Performance Brief 3.5, October 1988 [more than long] Keywords: benchmarks Message-ID: <7000@winchester.mips.COM> Date: 26 Oct 88 05:08:06 GMT Article-I.D.: winchest.7000 Lines: 1380 People have been beating me up for this, so here it is. As usual, let me know if you have better numbers for things. We're always trying to get things up-to-date, but it's a Sisyphean task. (Note that this is an extract from the printed version, crunched to reduce the size for the net.) ------------READ NO FURTHER UNLESS YOU'RE GLUTTON FOR NUMBERS 1. Introduction New Features of This Issue Added to this issue are performance numbers for our new M/2000-8 systems, and new 1.31 compiler numbers for a few benchmarks. As a preview, a few numbers are shown for the next version of the compilers. Benchmarking - Caveats and Comments While no one benchmark can fully characterize overall system performance, the results of a variety of benchmarks can give some insight into expected real performance. A more important benchmarking methodology is a side-by-side com- parison of two systems running the same real application. We don't believe in characterizing a processor with just a single number, but we follow (what seems to be) standard industry practice of using a mips-rating that essentially describes overall integer performance. Thus, we label a 20- mips machine to be one that is about 20X (i.e., anywhere from 15X to 25X!) fas- ter than a VAX 11/780 on integer performance, since this seems to be how most people intuitively compute mips-ratings. (We compare against VAX/VMS compilers when possible, 4.3BSD otherwise.) Even within the same computer family, perfor- mance ratios between processors vary widely. For example, [McInnis 87] charac- terizes a ``6 mips'' VAX 8700 as anywhere from 3X to 7X faster than the 11/780. Floating point speed often varies more than, and scales up slower than integer speed versus the 11/780. In practice, we find that MIPS RISComputer mips- ratings are grossly similar to DEC's relative performance ratings, i.e., we've tried to make one "MIPS-mip" about equal to one DEC "VUP" (VAX Unit of Perfor- mance). We try to use a benchmark mix, and we compare against DEC's best com- pilers when at all possible, just as DEC does. Note that VUPs are not MVUPs (MicroVAX II Units of Performance): a MicroVAX-II is not as fast as a VAX 11/780 in many cases. This paper analyzes one aspect of overall computer system performance - user- level CPU performance. MIPS Computer Systems does not warrant or represent that the performance data stated in this document will be achieved by any particular application. (We have to say that, sorry.) 2. Benchmark Summary 2.1. Choice of Benchmarks This brief offers both public-domain and MIPS-created benchmarks. We prefer public domain ones, but some of the most popular ones are inadequate for accu- rately characterizing performance. In this section, we give an overview of the importance we attach to the various benchmarks, whose results are summarized on the next page. Dhrystone [DHRY 1.1] and Stanford [STAN INT] are two popular small integer benchmarks. Compared with the fastest VAX 11/780 systems, the M/120-5 is 13- 16X faster than the VAX on these tests, and yet, we rate the M/120-5 as a 12- vax-mips machine. In fact, if we chose different VAX software to compare against, we could call the M/120-5 17-19 mips, right now. However, our mips- ratings are derived from the performance of real programs, and we conclude the artificial benchmarks are not representative. We observe that many vendors claim mips-ratings based on the most favorable choice of benchmarks ("Dhrystone mips", for example) or performance estimates for machines not built. If you're comparing an M/120-5 against such claims, it is a 19-mips machine. While we include Dhrystone and Stanford, we feel that the performance of large UNIX utilities, such as grep, yacc, diff, and nroff is a better (but not per- fect!) guide to the performance customers will receive. These four, which make up our [MIPS UNIX] benchmark, demonstrate that performance ratios are not sin- gle numbers, but range here from 10X to 16X faster than the VAX, and 17-24X on the M/2000. Even these UNIX utilities tend to overstate performance relative to large applications, such as CAD applications. Our own vax-mips ratings are based on a proprietary set of larger and more stressful real programs, such as our com- piler, assembler, debugger, and various CAD programs. For floating point, the public domain benchmarks are much better. We're still careful not to use a single benchmark to characterize all floating point appli- cations. The Livermore Fortran kernels [LLNL DP] give insight into both vector and non- vector performance for scientific applications. Linpack [LNPK DP and LNPK SP] tests vector performance on a single scientific application, and stresses cache performance. Spice [SPCE 2G6] and Doduc [DDUC] test a different part of the floating point application spectrum. The codes are large and thus test both instruction fetch bandwidth and scalar floating point. 2.2. Benchmark Summary Data This section summarizes the most important benchmark results described in more detail throughout this document. The numbers show performance relative to the VAX 11/780, i.e., larger numbers are better/faster. o A few numbers have been estimated by interpolations from closely-related benchmarks and/or closely-related machines. The methods are given in great detail in the individual sections. o Several of the columns represent summaries of multiple benchmarks. For example, the MIPS UNIX column represents 4 benchmarks, the SPICE 2G6 column 3, and LLNL DP represents 24. o In the Integer section, MIPS UNIX is the most indicative of real perfor- mance. o For Floating Point, we especially like LLNL DP (Livermore FORTRAN ker- nels), but all of these are useful, non-toy benchmarks. o In the following table, "Pub mips" gives the manufacturer-published mips- ratings. As in all tables in this document, the machines are listed in increasing order of performance according to the benchmarks, in this case, by Integer performance. o The summary includes only those machines for which we could get measured results on almost all the benchmarks and good estimates on the results for the few missing data items. o The next few pages contain a summary table and graph. Summary of Benchmark Results (VAX 11/780 = 1.0, Larger is Faster) Integer (C) Floating Point (FORTRAN) MIPS DHRY STAN LLNL LNPK LNPK SPCE DDUC Publ UNIX 1.1 INT DP DP SP 2G6 mips System 1 1 1 1 1 1 1 1 1 VAX 11/780# 2.1 1.9 3.0 1.9 2.9 2.5 1.6 *1.3 2 Sun-3/160 FPA *4 4.1 4.4 2.8 3.3 3.4 2.4 1.7 4 Sun-3/260 FPA 6.4 7.4 6.2 2.5 4.3 3.7 3.4 3.8 5 MIPS M/500 *6 5.9 6.2 5.9 7.1 5.6 *5.3 5.2 6 VAX 8700 9.9 10.8 10.7 4.5 7.9 6.4 4.1 3.5 10 Sun-4/200 10.9 11.3 10.0 8.1 8.6 11.2 6.6 7.3 8 MIPS M/800 13.1 13.5 12.3 10.8 10.7 14.0 8.0 8.7 10 MIPS M/1000 14.9 15.6 13.3 12.1 15.0 16.0 9.7 11.1 12 MIPS M/120-5 21.9 24.1 21.3 18.2 25.7 23.6 16.0 17.0 20 MIPS M/2000-8 # VAX 11/780 runs 4.3BSD for MIPS UNIX, Ultrix 2.0 (vcc) for Stanford, VAX/VMS for all others. Use of 4.3BSD (no global optimizer) probably inflates the MIPS UNIX column by about 10%. * Although it is nontrivial to gather full set of numbers, it is important to avoid holes in benchmark tables, as it is too easy to be misleading. Thus, we had to make reasoned guesses at these numbers. The MIPS UNIX values for VAX 8700 and Sun-3/260 were taken from the Published mips-ratings, which are consistent (+/- 10%) with experience with these machines. DDUC was guessed by noting that most machines do somewhat better on DDUC than on SPCE, and than a Sun-3/260 is usually 1.5X faster than a Sun-3/160 on floating-point benchmarks. Benchmark Descriptions: MIPS UNIX MIPS UNIX benchmarks: grep, diff, yacc, nroff, same 4.2BSD C source compiled and run on all machines. The summary number is the geometric mean of the 4 relative performance numbers. DHRY 1.1 Dhrystone 1.1, any optimization except inlining. STAN INT Stanford Integer. LLNL DP Lawrence Livermore Fortran Kernels, 64-bit. The summary number is the given as the relative performance based on the geometric mean, i.e., the "middle" of the 3 means. LNPK DP Linpack Double Precision, FORTRAN. LNPK SP Linpack Single Precision, FORTRAN. SPCE 2G6 Spice 2G6, 3 public-domain circuits, for which the geometric mean is shown. DDUC Doduc Monte Carlo benchmark. 3. Methodology Tested Configurations When we report measured results, rather than numbers published elsewhere, the configurations were as shown below. These system configurations do not neces- sarily reflect optimal configurations, but rather the in-house systems to which we had repeatable access. When we've had faster results available, we've quoted them in place of our own system's numbers. DEC VAX-11/780 Main Memory: 8 Mbytes Floating Point: Configured with FPA board. Operating System: 4.3 BSD UNIX. MIPS M/800 CPU: 12.5 MHz R2000, in R2600 CPU board, 64K I-cache, 64K D-cache Floating Point: R2010 FPA chip (12.5MHz) Main Memory: 8 Mbytes (2 R2350 memory boards) Operating System: UMIPS-BSD 2.1 MIPS M/1000 CPU: 15 MHz R2000, in R2600 CPU board, 64K I-cache, 64K D-cache Floating Point: R2010 FPA chip (15 MHz) Main Memory: 16 Mbytes (4 R2350 memory boards) Operating System: UMIPS 3.0 MIPS M/120-5 CPU: 16.7 MHz R2000, 64K I-cache, 64K D-cache Floating Point: R2010 FPA chip (16.7 MHz) Main Memory: 16 Mbytes (2 memory boards) Operating System: UMIPS 3.0 MIPS M/2000-8 CPU: 25 MHz R3000, 64K I-cache, 64K D-cache Floating Point: R3010 FPA chip (25 MHz) Main Memory: 32 Mbytes Operating System: UMIPS 3.10 Test Conditions All programs were compiled with -O (optimize), unless otherwise noted. C is used for all benchmarks except Whetstone, LINPACK, Doduc, Spice 2G6, Hspice, and the Livermore Fortran Kernels, which use FORTRAN. When possible, we've obtained numbers for VAX/VMS, and use them in place of UNIX numbers. The MIPS compilers are version 1.21 or 1.31. User time was measured for all benchmarks using the /bin/time command. Systems were tested in normal multi-user development environment, with load factor <0.2 (as measured by uptime command). Note that this occasionally makes them run longer, due to slight interference from background daemons and clock handling, even on an otherwise empty system. Benchmarks were run at least 3 times and averaged. The intent is to show numbers that can be reproduced on live systems. How to Interpret the Numbers Times (or rates, such as for Dhrystones, Whetstones, and LINPACK KFlops) are shown for the VAX 11/780. Other machines' times or rates are shown, and their relative performance ("Rel." column) normalized to the 11/780 treated as 1.0. VAX/VMS is used whenever possible as the base. Compilers and Operating Systems Unless specified otherwise the M-series benchmark numbers use Release 1.31 of the MIPS compilers and UMIPS 3.0. Compiler release 1.31 improved many of the FORTRAN numbers, but changed integer performance relatively little. UMIPS 3.0 (RISC/os) is a System V, Release 3.0 port, with TCP/IP, NFS, a Berke- ley Fast File System, and other Berkeley features. UMIPS 3.10 added support for the M/2000, but is otherwise similar, and both use compiler release 1.31. Most user-level programs run at about the same speed on UMIPS-BSD 2.1 and UMIPS 3.0. Optimization Levels Unless otherwise specified, all benchmarks were compiled -O, i.e., with optimi- zation. UMIPS compilers call this level -O2, and it includes global intra- procedural optimization. In a few cases, we show numbers for -O3 and -O4 optimization levels, which do inter-procedural register allocation and pro- cedure merging. Now, let's look at the benchmarks. Each section title includes the (CODE NAME) that relates it back to the earlier Summary, if it is included there. 4. Integer Benchmarks 4.1. MIPS UNIX Benchmarks (MIPS UNIX) The MIPS UNIX Benchmarks are fairly typical of nontrivial UNIX commands. This benchmark suite provides the opportunity to execute the same code across several different machines, in contrast to the compilers and linkers for each machine, which have substantially different code. These benchmarks contain UNIX source code, and so are not generally distributable. User time is shown; kernel time is typically 10-15% of the user time, so these are good indications of integer/character compute-intensive programs. The old grep and nroff bench- marks ran too quickly on the faster machines to be meaningful; we now use longer tests. The old versions are still shown for reference, but no longer summarized. Temporary machine unavailability forced us to estimate the speeds on the Suns for these two tests, shown in italics. Note: the Geometric Mean of N numbers is the Nth root of the product of those numbers. It is necessarily used in place of the arithmetic mean when computing the mean of performance ratios, or of benchmarks whose runtimes are quite dif- ferent. See [Fleming 86] for a detailed discussion. MIPS UNIX Benchmarks Results (4.3BSD VAX 11/780 = 1.0, Larger is Faster) grep diff yacc nroff Geom old-grep old-nroff Secs Rel. Sec Rel Secs Rel. Secs Rel. Mean System Secs Rel. Secs Rel. 58.5 1.0 246.4 1.0 101.1 1.0 108.1 1.0 1.0 11/780 4.3BSD11.2 1.0 18.8 1.0 29.2 2.0 105.3 2.3 48.1 2.1 51.5 2.1 2.1 Sun-3/160M 5.6 2.0 9.0 2.1 7.8 7.5 35.8 6.9 19.5 5.2 17.5 6.2 6.4 MIPS M/500 2.4 4.7 3.3 5.7 5.1 11.5 25.1 9.7 11.8 8.6 10.6 10.2 9.9 Sun-4/200 -O3 1.6 7.0 2.2 8.6 5.1 11.5 21.7 11.4 11.2 9.0 9.2 11.8 10.9 MIPS M/800 1.6 7.0 1.9 9.9 4.2 13.9 18.0 13.7 9.3 10.9 7.6 14.2 13.1 MIPS M/1000 1.3 8.6 1.5 12.5 3.7 15.8 15.7 15.7 8.1 12.5 6.8 15.9 14.9 MIPS M/120-5 1.1 10.2 1.3 14.5 2.5 23.4 11.3 21.8 5.4 18.7 4.5 24.0 21.9 MIPS M/2000-8 0.7 16.0 0.9 20.9 Note: in order to assure "apples-to-apples" comparisons, we moved the same copies of the (4.2BSD) sources for these to the various machines, compiled them there, and ran them, to avoid surprises from different binary versions of com- mands resident on these machines. Note that the granularity here is at the edge of UNIX timing, i.e., tenths of seconds make differences, especially on the faster machines, although we've pushed this back by beefing up the grep and nroff benchmarks. Of course, by 1989, the smaller benchmarks will be in trouble again, but by then, we hope to replace these with a set of better ones anyway. Sun estimates: we had to estimate the Sun performance on several benchmarks, where numbers are shown in italics. We assumed that the Sun-3/160M performance ratio would stay about constant, and so kept the same ratios as the old-grep and old-nroff columns. For the Sun-4, we computed the grep number by improving the performance ratio over old-grep by the largest factor of improvement found among the MIPS machines (the M/800): (11.5/7) * 7 = 11.5. We did the same for nroff: (11.8/9.9) * 8.6 = 10.2. The result seems appropriate: the Sun-4/2xx has usually shown slightly lower integer performance than an M/800. The new benchmarks seem better indicators of real performance than the old ones, which exercised less code. For example, the new nroff benchmark uses a macro package, which is a little more realistic. Still, the performance change means that one should take care in running randomly-chosen UNIX commands: sim- ple changes to improve timing have added several apparent mips! Fortunately, these are not the actual benchmarks we use in our own relative-performance com- putations, so we still think the M/120 is a 12-VUPs machine, not 15-VUPs. Note this benchmark set is run versus 4.3BSD, not versus Ultrix 2.0 with vcc. From experience, we'd guess that subtracting 10%-15% from most of the computed mips-ratings would give a good estimate of the Ultrix 2.0 (vcc)-relative mips- ratings, depending on the machine's performance on more stressful benchmarks. With this background, it is interesting to analyze [AMD 88], which supplies diff, grep, and nroff benchmarks that are, of course, not exactly the same benchmarks as we used. It computed VAX-11/780 (4.3BSD)-relative performance ratios, as we have, and included Sun-4 performance ratios from an earlier issue of this Brief. As seen already, there can easily be several VAX-mips' differ- ence in choosing the specific benchmark, so we can't recommend mixing bench- marks together this way, especially as the AMD cases have run times 15-80X shorter than the current MIPS cases, and are simulations for the 29000s. We don't know whether or not the AMD simulator does cache flushing appropriate to running something in a UNIX environment, which adds even more difficulty making the comparison. But, to try to get even the most tenuous comparison with the AMD 29000, we'll use the AMD technique, combining data from our last table with that of AMD's: MIPS UNIX + AMD UNIX Benchmarks (4.3BSD VAX 11/780 = 1.0, Larger is Faster) grep diff nroff Geom oldMIPS newMIPS AMD MIPS AMD oldMIPS newMIPS AMD Mean System 1 1 1 1 1 1 1 1 1 VAX 11/780# - - 3.0 - 3.6 - - 3.1 3.2 Sun 3/60 7.0 11.5 - 9.7 - 8.6 10.2 - 9.3 Sun-4/200 7.0 11.5 - 11.4 - 9.9 11.8 - 10.1 MIPS M/800 - 10.7 - 13.4 - 10.8 11.4 29K VRAM, 25MHz 8.6 13.9 - 13.7 - 12.5 14.2 - 12.4 MIPS M/1000 10.2 15.8 - 15.7 - 14.5 15.9 - 14.2 MIPS M/120 - - 14.4 - 18.4 - 13.6 15.3 29K cache, 25MHz 16.0 23.4 - 21.8 - 20.5 24.0 - 20.9 MIPS M/2000 This is haphazard data at best, but we think this says that the cached 29000 @ 25MHz acts somewhat like a MIPS M/120, or slightly faster. (On grep, the AMD sits between the two M/120 numbers, on diff it is faster, and on nroff it is slower.) We'd call it 13 MIPS-mips, maybe 14 as compilers improve, or caches are made larger. None of these benchmarks stress the caches, so it is diffi- cult to guess what would happen on larger benchmarks. To be fair, that may be irrelevant anyway, as the 29000 seems tuned more for controller environments than large-system environments. 4.2. Dhrystone (DHRY 1.1) Dhrystone is a synthetic programming benchmark that measures processor and com- piler efficiency in executing a ``typical'' benchmark. The Dhrystone results shown below are measured in Dhrystones / second, using the 1.1 version of the benchmark. We include Dhrystone because it is popular. MIPS systems do extremely well on it. However, comparisons of systems based on Dhrystone and especially, only on Dhrystone, are unreliable and should be avoided. See details at the end of this section. Results for a given machine are typically about 15% less for 1.1 than with 1.0, and another 10% less for 2.x. We've found that most unlabeled Dhrystones offered by vendors use 1.1, so we still summarize that version, but we now include Dhrystone 2.1. Advice for running Dhrystone has changed over time. It used to ask people to turn off anything but peephole optimization, as the benchmark contained a mod- est amount of "dead" code. (This is one of the things that Dhrystone 2 attempts to fix.) However, many people actually were submitting optimized results, often unlabeled, confusing everyone. Currently, any numbers can be submitted, as long as they're appropriately labeled, as long as the avoid pro- cedure inlining, done by only a few very advanced compilers. We continue to include a range of numbers to show the difference optimization technology makes on this particular benchmark, and to provide a range for com- parison when others' cited Dhrystone figures are not clearly defined by optimi- zation levels. For example, -O3 does interprocedural register allocation, and -O4 does procedure inlining; -O4 is beyond the spirit of the benchmark. Sun's -O3 and our -O3 do different things, but neither does inlining, so we cite those numbers. Compare the performance of the two Ultrix compilers. Also, see the MIPS and Sun-4 numbers for the performance gained by the high-powered optimizers avail- able on these machines. Dhrystone (1.1, some 2.1) Benchmark Results - Optimization Effects No Opt -O -O3 -O4 NoReg Regs NoReg Regs Regs Regs Dhry's Dhry's Dhry's Dhry's Dhry's Dhry's /Sec /Sec /Sec /Sec /Sec /Sec System 1,442 1,474 1,559 1,571 DEC VAX 11/780, 4.3BSD 2,800 3,025 3,030 3,325 Sun-3/160M 4,896 5,130 5,154 5,235 DEC VAX 8600, Ultrix 1.2 8,800 10,200 12,300 12,300 13,000 14,200 MIPS M/500 8,000 8,000 8,700 8,700 DEC VAX 8550, Ultrix 2.0 cc 9,600 9,600 9,600 9,700 DEC VAX 8550, Ultrix 2.0 vcc 10,550 12,750 17,700 17,700 19,000 Sun-4/200, SunOS 3.2L 12,800 15,300 18,500 18,500 19,800 21,300 MIPS M/800 15,100 18,300 22,000 22,000 23,700 25,000 MIPS M/1000 18,700 21,500 25,800 25,800 27,400 29,200 MIPS M/120-5 30,700 32,400 39,700 39,700 42,300 45,300 MIPS M/2000-8 DHRYSTONE 2.1 19,000 20,400 23,200 23,200 24,700 27,000 MIPS M/120-5 31,300 33,000 36,700 36,700 38,800 42,800 MIPS M/2000-8 Other published numbers include the following, which are taken from [Richardson 87], unless otherwise noted. Items marked * are those that we know (or have good reason to believe) use optimizing compilers. These are the "register" versions of the numbers, i.e., the highest ones reported. Dhrystone 1.1 Benchmark Results Dhry's /Sec Rel. System 1,571 0.9 VAX 11/780, 4.3BSD [in-house] 1,757 1.0 VAX 11/780, VAX/VMS 4.2 [Intergraph 86]* 3,850 2.2 Sun-3/100 [Muchnick 88] 6,374 3.6 Sun-3/260, 25MHz 68020, SunOS 3.2 6,423 3.7 VAX 8600, 4.3BSD 6,440 3.7 IBM 4381-2, UTS V, cc 1.11 6,896 3.9 Intergraph InterPro 32C, SYSV R3 3.0.0, Greenhills, -O* 7,109 4.0 Apollo DN4000 -O 7,140 4.1 Sun-3/200 [Muchnick 88] * 7,249 4.2 Convex C-1 XP 6.0, vc 1.1 7,409 4.2 VAX 8600, VAX/VMS in [Intergraph 86]* 7,655 4.4 Alliant FX/8 [Multiflow] 8,300 4.7 DG MV20000-I and MV15000-20 [Stahlman 87] 8,309 4.7 InterPro-32C,30MHz Clipper,Green Hills[Intergraph 86]* 9,436 5.4 Convergent Server PC, 20MHz 80386, GreenHills* 9,920 5.6 HP 9000/840S [HP 87] 10,416 5.9 VAX 8550, VAX/VMS 4.5, cc 2.2* 10,787 6.1 VAX 8650, VAX/VMS, [Intergraph 86]* 11,215 6.4 HP 9000/840, HP-UX, full optimization* 12,639 7.2 HP 9000/825S [HP 87]* 13,000 7.4 MIPS M/500, 8MHz R2000, -O3* 13,157 7.5 HP 825SRX [Sun 87]* 14,109 8.0 Sun-4/110 * [Sun 88] 14,195 8.1 Multiflow Trace 7/200 [Multiflow] 14,820 8.4 CRAY 1S 15,007 8.5 IBM 3081, UTS SVR2.5, cc 1.5 15,576 8.9 HP 9000/850S [HP 87] 18,530 10.5 CRAY X-MP 19,000 10.8 Sun-4/200 [Muchnick 88], -O3* 19,800 11.3 MIPS M/800, 12.5MHz R2000, -O3* 23,430 13.3 HP 835S [RISC Mgmt 88] 23,700 13.5 MIPS M/1000, 15MHz R2000, -O3* 27,400 15.6 MIPS M/120-5, 16.7MHz R2000, -O3* 28,846 16.4 Amdahl 5860, UTS-V, cc1.22 31,250 17.8 IBM 3090/200 34,000 19.4 Motorola 88000, unknown configuration [RISC Mgmt 88] 35,653 20.3 AMD 29000, 25MHz, 2 8K caches (simulation) [AMD 88] 42,300 24.1 MIPS M/2000, -O3* 43,668 24.9 Amdahl 5890/300E, cc -O 53,108 30.2 CCI Power 7/64 (simulation) [Simpson 88] Unusual Dhrystone Attributes We've calibrated this benchmark against many more realistic ones, and we believe that its results must be treated with care, because the detailed pro- gram statistics are unusual in some ways. It has an unusually low number of instructions per function call (35-40 on our machines), where most C programs fall in in the 50-60 range or higher. Stated another way, Dhrystone does more function calls than usual, which especially penalizes the DEC VAX, making this a favored benchmark for inflating one's "VAX-mips" rating. Any machine with a lean function call sequence looks a little better on Dhrystone than it does on others. The dynamic nesting depth of function calls inside the timed part of Dhrystone is low (3-4). This means that most register-window RISC machines would never even once overflow/underflow their register windows and be required to save/restore registers. This is not to say fast function calls or register windows are inherently bad (they're not!), merely that this benchmark overstates their performance effects. Dhrystone can spend 30-40% of the time in the strcpy function, copying atypi- cally long (30-character) strings, which happen to be alignable on word boun- daries, unlike more typical uses. More realistic programs don't spend this much time in this sort of code, and when they do, they handle more shorter strings: 6 characters would be much more typical. Even odder, the only serious use of partial-word operations in Dhrystone is in hand-coded routines, not in compiler-generated code. On our machines, Dhrystone uses 0-offset addressing for 50% of memory data references (dynamic). Most real programs use 0-offsets 10-15% of the time. This, and the previous effect, make some machines look better on Dhrystone than they would on more typical programs. In particular, this benchmark is very kind to the AMD 29000, as it exercises none of the architectural areas where we believe the 29000 would lose performance on more realistic programs: o supports only 0-offsets [Dhrystone uses 0-offsets heavily] o expensive partial-word load/stores [Dhrystone doesn't use them] o supports byte-comparison for trailing-zero [useful], but especially helps Dhrystone due to atypical use of strings. Of course, Dhrystone is a fairly small benchmark, and thus fits into almost any reasonable instruction cache. In conclusion, Dhrystone gives some indication of user-level integer perfor- mance, but is susceptible to surprises when comparing amongst architectures that differ strongly. Unfortunately, the industry seems to lack a good set of widely-available integer benchmarks that are as representative as are some of the popular floating point ones. 4.3. Stanford Small Integer Benchmarks (STAN INT) The Computer Systems Laboratory at Stanford University, has collected a set of programs to compare the performance of various systems. These benchmarks are popular in some circles as they are small enough to simulate, and are respon- sive to compiler optimizations. It is well known that small benchmarks can be misleading. In particular, on the faster machines, the resolution of the shorter benchmarks is really not very good, i.e., a time shown as ".05" is about 3 clock-ticks. We definitely think this benchmark overstates performance. Stanford Small Integer Benchmark Results Perm Tower Queen Intmm Puzzle Quick Bubble Tree Geo Rel. Secs Secs Secs Secs Secs Secs Secs Secs Mean Perf System 2.34 2.30 .94 1.67 11.23 1.12 1.51 2.72 2.14 .7 VAX 11/780 4.3BSD 1.60 1.0 VAX 11/780@ .63 .63 .27 .73 2.96 .31 .44 .69 .62 2.6 VAX 8600 Ultrix1.2 .75 .95 .30 .40 1.82 .34 .39 1.24 .53 3.0 Sun-3/100 .41 .48 .18 .25 1.09 .20 .23 .70 .36 4.4 Sun-3/200 -O3 .28 .35 .17 .42 2.22 .18 .25 .35 .35 4.6 VAX 8550# .28 .35 .13 .15 .88 .13 .17 .50 .26 6.2 VAX 8550## .18 .24 .15 .23 1.15 .17 .19 .34 .26 6.2 MIPS M/500 .22 7.3 Sun-4/110 [Sun 88] .12 .16 .11 .13 .61 .10 .12 .22 .16 10.0 MIPS M/800 .11 .17 .09 .15 .55 .10 .12 .20 .15 10.7 Sun-4/200 -O3 .097 .124 .067 .135 .694 .089 .124 .142 .136 11.8 29K+VRAM [AMD 88] .10 .13 .10 .11 .51 .08 .10 .17 .13 12.3 MIPS M/1000 .096 .118 .077 .089 .458 .072 .092 .164 .118 13.3 MIPS M/120-5 .066 .096 .052 .120 .559 .077 .089 .130 .109 14.7 29K cache [AMD 88] .065 .078 .045 .059 .303 .048 .060 .108 .075 21.3 M/2000-8 * The Stanford's old Aggregate Weighting has been replaced by the Geometric Mean as a more understandable measure. We thank people at Sun Microsystems for promoting this improvement. Among other things, it brings the numbers closer (down) to relative performance numbers observed on more substantial benchmarks. @ Estimated VAX 11/780 Ultrix 2.0 vcc -O time. We get this by either of two ways: (11/780 BSD cc) * (VAX 8550 Ultrix vcc) / (VAX 8550 Ultrix cc) 2.14 * .26 / .35 = 1.588 (use 1.60) The 8550 is rated as approximately a 6-VUP machines, and 6 * .26 = 1.56, so the guess is probably close, likely to be within the range 1.50-1.70. We estimate this number only because it's been hard for us to get, and we don't think this benchmark is crucially important. # Ultrix 2.0 cc -O ## Ultrix 2.0 vcc -O. The quick and bubble tests actually had errors; how- ever, the times were in line with expectations (these two optimize well), so we used them. All 8550 numbers thanks to Greg Pavlov (ames!harvard!hscvax!pavlov, of Amherst, NY). The Sun numbers are from [Muchnick 88], and reflect the latest (as of this writing) Sun compiler technology, which has improved these numbers for all Sun systems over the last year. 5. Floating Point Benchmarks 5.1. Livermore Fortran Kernels (LLNL DP) Lawrence Livermore National Labs' workload is dominated by large scientific calculations that are largely vectorizable. The workload is primarily served by expensive supercomputers. This benchmark was designed for evaluation of such machines, although it has been run on a wide variety of hardware, includ- ing workstations and PCs [McMahon86]. The Livermore Fortran Kernels are 24 pieces of code abstracted from the appli- cations at Lawrence Livermore Labs. These kernels are embedded in a large, carefully engineered benchmark driver. The driver runs the kernels multiple times on different data sets, checks for correct results, verifies timing accu- racy, reports execution rates for all 24 kernels, and summarizes the results with several statistics. Unlike many other benchmarks, there is no attempt to distill the benchmark results down to a single number. Instead all 24 kernel rates, measured in mflops (million floating point operations per second) are presented individu- ally for three different vector lengths (a total of 72 results). The minimum and maximum rates define the performance range of the hardware. Various statistics of the 24 or 72 rates, such as the harmonic, geometric, and arith- metic means give insight into general behavior. Any one of these statistics might suffice for comparisons of scalar machines, but multiple statistics are necessary for comparisons involving machines with vector or parallel features. These machines have unbalanced, bimodal performance, and a single statistic is insufficient characterization. McMahon asserts: ``When the computer performance range is very large the net Mflops rate of many Fortran programs and workloads will be in the sub-range between the equi-weighted harmonic and arithmetic means depending on the degree of code parallelism and optimization. More accurate estimates of cpu workload rates depend on assigning appropriate weights for each kernel. McMahon's analysis goes on to suggest that the harmonic mean corresponds to approximately 40% vectorization, the geometric mean to approximately 70% vec- torization, and the arithmetic mean to 90%+ vectorization. These three statis- tics can be interpreted as different benchmarks that each characterize certain applications. For example, there is fair agreement between the kernels' har- monic mean and Spice performance. LINPACK, on the other hand, is better characterized by the geometric mean. The complete M/120-5 data shows that MIPS performance is insensitive to vector length. The minimum to maximum variation is also small for this benchmark. Both characteristics are typical of scalar machines with mature compilers. Performance of vector and parallel machines, on the other hand, may span two orders of magnitude on this benchmark, or more, depending on the kernel and the vector length. 64-Bit Livermore FORTRAN Kernels MegaFlops, L = 167, Sorted by Geometric Mean Harm. Geom. Arith. Rel.* Min Mean Mean Mean Max Geom. System .05 .12 .12 .13 .24 .7 VAX 780 w/FPA 4.3BSD f77 [ours] .06 .16 .17 .18 .28 1.0 VAX 780 w/FPA VMS 4.1 .11 .30 .33 .37 .87 1.9 SUN 3/160 w/FPA .20 .42 .46 .50 1.42 2.5 MIPS M/500, f77 1.21 .17 .43 .48 .53 1.13 2.8 SUN 3/260 w/FPA [our numbers] .29 .58 .64 .70 1.21 3.8 Alliant FX/1 FX 2.0.2 Scalar .38 .72 .77 .83 1.57 4.5 SUN 4/200 w/FPA [Hough 87] .39 .94 1.00 1.04 1.64 5.9 VAX 8700 w/FPA VMS 4.1 .10 .76 1.06 1.50 5.23 6.2 Alliant FX/1 FX 2.0.2 Vector .33 .92 1.06 1.20 2.88 6.2 Convex C-1 F77 V2.1 Scalar .52 1.09 1.19 1.30 2.74 7.0 ELXSI 6420 EMBOS F77 MP=1 .51 1.26 1.37 1.48 2.70 8.1 MIPS M/800, f77 1.21 .65 1.63 1.83 2.03 3.50 10.8 MIPS M/1000, f77 1.30 .11 1.06 1.94 3.33 12.79 11.4 Convex C-1 F77 V2.1 Vector .80 1.85 2.06 2.27 3.89 12.1 MIPS M/120-5, f77 1.31 .28 1.24 2.32 5.11 29.20 13.7 Alliant FX/8 FX 2.0.2 MP=8*Vec .95 2.75 3.10 3.42 5.82 18.2 MIPS M/2000, f77 1.31 1.0 3.1 3.6 4.0 6.5 21.2 MIPS M/2000, f77 1.40# 1.51 4.93 5.86 7.00 17.43 34.5 Cray-1S CFT 1.4 scalar 1.23 4.74 6.09 7.67 21.64 35.8 FPS 264 SJE APFTN64 3.43 9.29 10.68 12.15 25.89 62.8 Cray-XMP/1 COS CFT77.12 scalar 0.97 6.47 11.94 22.20 82.05 70.2 Cray-1S CFT 1.4 vector 4.47 11.35 13.08 15.20 45.07 76.9 NEC SX-2 SXOS1.21 F77/SX24 scalar 1.47 12.33 24.84 50.18 188 146 Cray-XMP/1 COS CFT77.12 vector 4.47 19.07 43.94 140 1042 258 NEC SX-2 SXOS1.21 F77/SX24 vector * Relative Performance, as ratio of the Geometric Mean numbers. This is a simplistic attempt to extract a single figure-of-merit. We admit this goes against the intent of this benchmark suite, and apologize to Mr. McMahon, but we ran out of space in our summaries. # Next version of the compiler system, not yet released to production, and hence not carried into summaries. However, this nicely illustrates a case where compiler tuning added a VAX 8600's performance, almost. Livermore FORTRAN Kernels - Complete MIPS M/120-5 Output Vendor MIPS MIPS MIPS MIPS | MIPS MIPS MIPS MIPS Model M/120-5 M/120-5 M/120-5 M/120-5 |M/120-5 M/120-5 M/120-5 M/120-5 OSystem V.3 3.0 V.3 3.0 V.3 3.0 V.3 3.0 |V.3 3.0 V.3 3.0 V.3 3.0 V.3 3.0 Compiler 1.31 1.31 1.31 1.31 | 1.31 1.31 1.31 1.31 OptLevel O2 O2 O2 O2 | O2 O2 O2 O2 Samples 72 24 24 24 | 72 24 24 24 WordSize 64 64 64 64 | 32 32 32 32 DO Span 167 19 90 471 | 167 19 90 471 Year 1988 1988 1988 1988 | 1988 1988 1988 1988 Kernel ------ ------ ------ ------ | ------ ------ ------ ------ 1 2.8800 2.8800 2.9535 2.9459 | 3.9122 3.9122 3.9142 3.8487 2 2.2009 2.2009 2.5339 2.5451 | 3.6809 3.6809 3.6518 2.9828 3 2.8506 2.8506 2.9680 2.9677 | 4.1781 4.1781 4.0510 3.8582 4 1.8240 1.8240 2.6133 2.9772 | 3.5978 3.5978 3.1571 2.2205 5 2.0083 2.0083 2.0533 2.0438 | 3.2797 3.2797 3.2766 3.1695 6 1.3938 1.3938 1.9006 1.9267 | 3.0934 3.0934 2.8727 1.9807 7 3.8400 3.8400 3.8885 3.8846 | 5.0121 5.0121 4.9773 4.9192 8 3.5009 3.5009 3.5273 3.5325 | 4.6659 4.6659 4.6961 4.6136 9 3.5529 3.5529 3.5801 3.5833 | 4.4751 4.4751 4.4476 4.3809 10 1.4000 1.4000 1.4017 1.4071 | 2.8119 2.8119 2.8116 2.8000 11 1.4250 1.4250 1.4749 1.4808 | 2.7811 2.7811 2.6947 2.4806 12 1.4410 1.4410 1.4760 1.5000 | 2.7827 2.7827 2.7007 2.6127 13 0.8034 0.8034 0.8515 0.8684 | 1.0682 1.0682 1.0604 1.0510 14 1.3824 1.3824 1.3555 0.9176 | 1.5660 1.5660 1.9144 1.8807 15 1.0735 1.0735 1.0452 1.0476 | 1.4139 1.4139 1.4085 1.4494 16 1.5332 1.5332 1.4803 1.5143 | 1.6219 1.6219 1.6097 1.6585 17 2.6562 2.6562 2.5389 2.5452 | 3.2781 3.2781 3.2841 3.4185 18 3.2112 3.2112 3.1598 3.1598 | 4.5896 4.5896 4.6200 4.4800 19 2.8224 2.8224 2.9124 2.8772 | 3.5246 3.5246 3.5479 3.4005 20 3.3757 3.3757 3.3471 2.6667 | 4.5008 4.5008 4.5147 4.5298 21 2.0880 2.0880 2.2436 2.2955 | 3.5501 3.5501 3.4517 3.1764 22 1.5131 1.5131 1.5228 1.5196 | 2.1875 2.1875 2.1782 2.1454 23 3.1670 3.1670 3.3730 3.3600 | 4.3913 4.3913 4.4064 4.1833 24 0.9238 0.9238 0.9283 0.9396 | 1.0874 1.0874 1.1057 1.0631 -------------- ------ ------ ------ ------ | ------ ------ ------ ------ Standard Dev. 0.9189 0.9157 0.9180 0.9208 | 1.1609 1.1514 1.1537 1.1741 Median Dev. 1.1275 1.1205 1.0009 1.0429 | 1.3376 1.4080 1.3476 1.2294 Maximum Rate 3.8885* 3.8400 3.8885 3.8846 | 5.0121* 4.9192 4.9773 5.0121 Average Rate 2.2670* 2.2028 2.2971 2.2711 | 3.1465* 3.0127 3.1814 3.2104 Geometric Mean 2.0628* 2.0030 2.0943 2.0608 | 2.8840* 2.7590 2.9219 2.9371 Median Rate 2.2222 2.0481 2.3887 2.4203 | 3.2766 3.0762 3.2804 3.4022 Harmonic Mean 1.8545* 1.8070 1.8856 1.8420 | 2.5773* 2.4784 2.6135 2.6092 Minimum Rate 0.8034* 0.8034 0.8515 0.8684 | 1.0510* 1.0510 1.0604 1.0682 Maximum Ratio 1.0000 0.9875 1.0000 0.9989 | 1.0000 0.9814 0.9930 1.0000 Average Ratio 1.0000 0.9716 1.0132 1.0018 | 1.0000 0.9574 1.0110 1.0203 Geometric Ratio 1.0000 0.9710 1.0152 0.9990 | 1.0000 0.9566 1.0131 1.0184 Harmonic Mean 1.0000 0.9743 1.0167 0.9932 | 1.0000 0.9616 1.0140 1.0123 Minimum Rate 1.0000 1.0000 1.0598 1.0809 | 1.0000 1.0000 1.0089 1.0163 * These are the numbers brought forward into the summary section. 5.2. LINPACK (LNPK DP and LNPK SP) The LINPACK benchmark has become one of the most widely used single benchmarks to predict relative performance in scientific and engineering environments. The usual LINPACK benchmark measures the time required to solve a 100x100 sys- tem of linear equations using the LINPACK package. LINPACK results are meas- ured in MFlops, millions of floating point operations per second. All numbers are from [Dongarra 88], unless otherwise noted. The LINPACK package calls on a set of general-purpose utility routines called BLAS -- Basic Linear Algebra Subroutines -- to do most of the actual computa- tion. A FORTRAN version of the BLAS is available, and the appropriate routines are included in the benchmark. However, vendors are encouraged to provide hand-coded versions of the BLAS as a library package. Thus LINPACK results are usually cited in two forms: FORTRAN BLAS and Coded BLAS. The FORTRAN BLAS actually come in two forms as well, depending on whether the loops are 4X unrolled in the FORTRAN source (the usual) or whether the unrolling is undone to facilitate recognition of the loop as a vector instruction. According to the ground rules of the benchmark, either may be used when citing FORTRAN BLAS results, although it is typical to note rolled loops with the annotation ``(Rolled BLAS).'' For our own numbers, we've corrected a few to follow Dongarra more closely than we have in the past. LINPACK output produces quite a few MFlops numbers, and we've tended to use the fourth one in each group, which uses more iterations, and thus is more immune to clock randomness. Dongarra uses the highest MFlops number that appears, then rounds to two digits. Note that relative ordering even within families is not particularly con- sistent, illustrating the extreme sensitivity of these benchmarks to memory system design. 100x100 LINPACK Results - FORTRAN and Coded BLAS From [Dongarra 88], Unless Noted Otherwise DP DP SP SP Fortran Coded Fortran Coded System .10 .10 .11 .11 Sun-3/160, 16.7MHz (Rolled BLAS)+ .11 .11 .13 .11 Sun-3/260,25MHz 68020+20MHz 68881 (Rolled BLAS)+ .13 .16 .17 .22 DEC MicroVAX II, VAX/VMS .14 - - - Apollo DN4000, 25MHz (68020 + 68881) [ENEWS 87] .14 - .24 - VAX 11/780, 4.3BSD, LLL Fortran [ours] .14 .17 .25 .34 VAX 11/780, VAX/VMS .20 - .24 - 80386+80387, 20MHz, 64K cache, GreenHills .29 .49 .45 .69 Intergraph IP-32C,30Mz Clipper[Intergraph 86] .38 - .67 - 80386+Weitek 1167,20MHz,64K cache, GreenHills .41 .41 .62 .62 Sun-3/160, Weitek FPA (Rolled BLAS)+ .41 .45 .66 .79 DEC MicroVAX 3200/3500/3600, VAX/VMS .45 .54 .60 .74 HP9000 Model 840S [HP 87] .46 .46 .86 .86 Sun-3/260, Weitek FPA (Rolled BLAS)+ .49 .66 .84 1.20 VAX 8600, VAX/VMS 4.5 .49 .54 .62 .68 HP 9000/825S [HP 87] .57 .72 .86 .87 HP9000 Model 850S [HP 87] .60 .72 .93 1.2 MIPS M/500, f77 1.21 .65 .76 .80 .96 VAX 8500, VAX/VMS .70 .96 1.3 1.9 VAX 8650, VAX/VMS .78 - 1.1 - IBM 9370-90, VS FORT 1.3.0 .86 - 1.2 - Sun-4/110 [Sun 88] .99 1.2 1.4 1.7 VAX 8550/8700/8800, VAX/VMS 1.1 1.1 1.6 1.6 SUN 4/200 (Rolled BLAS)+ 1.2 1.3 2.8 3.6 MIPS M/800, f77 1.31 1.5 1.7 1.8 2.0 ELXSI 6420 1.5 1.6 3.5 4.3 MIPS M/1000, f77 1.31 1.6 2.0 1.6 2.0 Alliant FX-1 (1 CE) 2.1 - 2.4 - IBM 3081K H enhanced opt=3 2.1 2.2 4.0 4.8 MIPS M/120-5, f77 1.31 2.5! 2.5! - - CCI Power 7/64 (simulation) [Simpson 88] 3.0 3.3 4.3 4.9 CONVEX C-1/XP, Fort 2.0 (Rolled BLAS) 3.6 3.9 5.9 7.1 MIPS M/2000-8, f77 1.31 3.8 4.0 6.6 7.1 MIPS M/2000-8, f77 1.40 # (Rolled BLAS) 6.0 - - - Multiflow Trace 7/200 Fortran 1.4 (Rolled BLAS) 7.6 11.0 7.6 9.8 Alliant FX-8, 8 CEs, FX Fortran, v2.0.1.9 12 23 n.a. n.a. CRAY 1S CFT (Rolled BLAS) 52 61 n.a. 67 ETA10-E (1 proc, 10.5ns) 56 60 n.a. n.a. CRAY X-MP/4 CFT (Rolled BLAS) + The Sun FORTRAN Rolled BLAS code appears to be optimal, so we used the same numbers for Coded BLAS. The 4X unrolled numbers for Sun-4/200 are .86 (DP) and 1.25 (SP) [Hough 87]. ! These numbers were given without specifying FORTRAN or Coded. # Next version of compiler, not yet released. 100x100 LINPACK Results - FORTRAN and Coded BLAS VAX 11/780, VAX/VMS Relative Performance For A Subset of the Systems Rel. Rel. Rel. Rel. DP DP SP SP Fortran Coded Fortran Coded System .8 .6 .5 .3 Sun-3/260,25MHz 68020+20MHz 68881 (Rolled) 1.0 1.0 1.0 1.0 VAX 11/780, VAX/VMS 2.0 2.9 1.8 2.0 Intergraph IP-32C,30Mz Clipper[Intergraph 86 2.7 - 2.7 - 80386+Weitek 1167,20MHz,64K cache, GreenHill 2.9 2.4 2.5 1.8 Sun-3/160, Weitek FPA (Rolled BLAS) 3.3 2.7 3.4 2.5 Sun-3/260, Weitek FPA (Rolled BLAS) 3.5 3.9 3.4 3.5 VAX 8600, VAX/VMS 4.5 4.1 4.2 3.4 2.6 HP9000/850S [HP 87] 4.3 4.2 3.7 3.5 MIPS M/500, f77 1.21 6.1 - 4.8 - Sun-4/110 [Sun 88] 7.1 7.1 5.6 5.0 VAX 8550/8700/8800, VAX/VMS 7.9 6.5 6.4 4.7 SUN 4/200 (Rolled BLAS) 8.6 7.6 11.2 10.6 MIPS M/800, f77 1.31 10.7 9.0 14.0 12.6 MIPS M/1000, f77 1.31 11.4 11.8 6.4 5.9 Alliant FX-1 (1 CE) 15.0 12.9 16.0 14.1 MIPS M/120-5, f77 1.31 21.4 19.4 17.2 14.4 CONVEX C-1/XP, Fort 2.0 (Rolled BLAS) 25.7 22.9 23.6 20.9 MIPS M/2000-8, f77 1.31 54 65 30 28.8 Alliant FX-8, 8 CEs, FX Fortran, v2.0.1.9 400 353 - - CRAY XMP/4 The following lists various M/2000 MFLOPS numbers. Note that the numbers vary substantially, even on this scalar machine. Thus, if you're buying unlabeled MFLOPS, caveat emptor. Assorted 64-Bit MFLOPS Measures Livermore FORTRAN Gaussian Matrix Kernels Elimination Multiply Peak Harm. Geom. Arith. Linpk Linpk 1000x 50x50 Multiply Min Mean Mean Mean Max FORT Coded 1000 Coded /Add System 1.0 3.1 3.6 4.0 6.5 3.8 4.0 7.0 9.1 10.0 MIPS M/2000# Assorted 32-Bit MFLOPS Measures Livermore FORTRAN Gaussian Matrix Kernels Elimination Multiply Peak Harm. Geom. Arith. Linpk Linpk 1000x 50x50 Multiply Min Mean Mean Mean Max FORT Coded 1000 Coded /Add System 1.6 4.2 4.8 5.3 8.4 6.6 7.1 9.6 11.8 12.5 MIPS M/2000# # Next version of compiler (f77 1.40), not yet released. 5.3. Spice Benchmarks (SPCE 2G6) Spice [UCB 87] is a general-purpose circuit simulator written at U.C. Berke- ley. Spice and its derivatives are widely used in the semiconductor industry. It is a valuable benchmark because it shares many characteristics with other real-world programs that are not represented in popular small benchmarks. It uses both integer and floating-point computation heavily. The floating-point calculations are not vector oriented, as in LINPACK. Also, the program itself is very large and therefore tests both instruction and data cache performance. We have chosen to benchmark Spice version 2g.6 because of its general availa- bility. This is one of the later and more popular Fortran versions of Spice distributed by Berkeley. We felt that the circuits distributed with the Berke- ley distribution for testing and benchmarking were not sufficiently large and modern to serve as benchmarks. We gathered and produced appropriate benchmark circuits that can be distributed, and have since been posted as public domain on Usenet. The Spice group at Berkeley found these circuits to be up-to-date and good candidates for Spice benchmarking. In the table below, "Geom Mean" is the geometric mean of the 3 "Rel." columns. Spice2G6 Benchmarks Results digsr bipole comparator Geom Secs Rel. Secs Rel. Secs Rel. Mean System 1354.0 0.60 439.6 0.68 460.3 0.63 .6 VAX 11/780 4.3BSD, f77 V2.0 993.5 0.81 394.3 0.76 366.9 0.80 .8 Microvax-II Ultrix 1.1, fortrel 901.9 0.90 285.1 1.0 328.6 0.89 .9 SUN 3/160 SunOS 3.2 f77 -O -f68881 848.0 0.95 312.6 0.96 302.9 0.96 1.0 VAX 11/780 4.3BSD, fortrel -opt 808.1 1.0 299.1 1.0 291.7 1.0 1.0 VAX 11/780 VMS 4.4 /optimize 744.8 1.1 221.7 1.3 266.0 1.1 1.2 SUN 3/260 SunOS 3.2 f77 -O -f68881 506.5 1.6 170.0 1.8 189.1 1.5 1.6 SUN 3/160 SunOS 3.2 f77 -O -ffpa 361.2 2.2 112.0 2.7 129.4 2.3 2.4 SUN 3/260 SunOS 3.2 f77 -O -ffpa 296.5 2.7 73.4 4.1 83.0 3.5 3.4 MIPS M/500 225.9 3.6 63.7 4.7 73.4 4.0 4.1 SUN 4/200 f77 -O3 -Qoption as -Ff0+ - - - - - - 5.3 VAX 8700 (estimate) 136.5 5.9 42.6 7.0 41.4 7.0 6.6 MIPS M/800 125.5 6.4 39.5 7.6 39.3 7.4 7.1 AMDAHL 470V7 VMSP FORTVS4.1 114.3 7.1 35.4 8.4 34.5 8.5 8.0 MIPS M/1000 92.4 8.7 28.5 10.5 29.7 9.8 9.7 MIPS M/120-5 53.7 15.1 18.5 16.2 17.6 16.6 16.0 MIPS M/2000-8, f77 1.31 48.0 16.8 12.5 23.9 17.5 16.7 18.9 FPS 20/64 VSPICE (2G6 derivative) + Sun numbers are from [Hough 87], who notes that the Sun-4 number was beta software, and that a few modules did not optimize. Thus, these numbers should improve. Benchmark descriptions: digsr CMOS 9 bit Dynamic shift register with parallel load capability, i.e., SISO (Serial Input Serial Output) and PISO (Parallel Input Serial Out- put), widely used in microprocessors. Clock period is 10 ns. Channel length = 2 um, Gate Oxide = 400 Angstrom. Uses MOS LEVEL=2. bipole Schottky TTL edge-triggered register used as a synchronizer. comparator Analog CMOS auto-zeroed comparator, composed of Input, Differential Amplifier and Latch. Input signal is 10 microvolts. Channel Length = 3 um, Gate Oxide = 500 Angstrom. Uses MOS LEVEL=3. Each part is con- nected by capacitive coupling, which is often used for the offset can- cellation. (Sometimes called Toronto, in honor of its source). Hspice is a commercial version of Spice offered by Meta-Software, which recently published benchmark results for a variety of machines [Meta-software 87]. (Note that the M/800 number cited there was before the UMIPS-BSD 2.1 and f77 1.21 releases, and the numbers have improved). The VAX 8700 Spice number (5.3X) was estimated by using the Hspice numbers below for 8700 and M/800, and the M/800 Spice number: (5.5: 8700 Hspice) / (6.9: M/800 Hspice) X (6.6: M/800 Spice) yields 5.3X. This section indicates that the performance ratios seem to hold for at least one important commercial version as well. Hspice Benchmarks Results HSPICE-8601K S2T30 Secs Rel. System 166.5 .6 VAX 11/780, 4.2BSD 92.2 1.0 VAX 11/780 VMS 91.5 1.0 Microvax-II VMS 29.2 3.2 ELXSI 6400 29.1 3.2 Alliant FX/1 25.3 3.6 HyperSPICE (EDGE) 16.8 5.5 VAX 8700 VMS 16.3 5.7 IBM 4381-12 13.4 6.9 MIPS M/800 [ours] 11.3 8.2 MIPS M/1000 [ours] 8.7 10.6 MIPS M/120-5 [ours] 5.3 17.4 MIPS M/2000 [ours] 3.27 28.2 IBM 3090 2.71 34.0 CRAY-1S Again, as in the less-vectorizable Livermore Kernels, the M/120-5 performs about 30% as fast as a CRAY-1S, and the M/2000 50%. Spice and Hspice are examples of large programs where the M/2000 outperforms the M/120 by more than the clock ratio of 1.5X, illustrating the effects of a more block-oriented memory system. 5.4. Digital Review The Digital Review magazine benchmark [DR 87] is a 3300-line FORTRAN program that includes 33 separate tests, mostly floating-point, some integer. The magazine reports the times for all tests, and summarizes them with the geometric mean seconds shown below. Most numbers below are from [DR 87]. Note that Digital Review gives relative performance using the MicroVax II as a basis for comparison (MVUPS). For consistency with the rest of this document, we use the VAX 11/780, which significantly affects the ratios. Digital Review has substantially revised their benchmark to fix various odd and unrepresentative behaviors, such as having many of its tests dominated by the time to call a large initialization routine. Many of these conspire to show lower VUPs ratings than is typical for machines running real programs. See the October 10, 1988 Digital Review for detail. We leave the table below for the time being, but do not ascribe much weight to it, and will shift to the new, more realistic one shortly. We applaud DR for several reasons. First, they try to offer some useful benchmarks in place of empty mips-ratings, which is more than many magazines do. Second, they are willing to listen to input and improve the usefulness of their benchmarks. We believe that recent Sun compiler work has improved the Sun-4s' performance, but we do not yet have those numbers for sure, although the number shown is reasonably consistent. Digital Review Benchmarks Results (33 Tests) Secs Rel. System 9.17 0.7 VAXstation II/GPX, VMS 4.5 6.75 1.0 VAX 11/780, VMS [DEC], 6.80 [ours] 2.90 2.3 VAXstation 3200 2.32 2.9 VAX 8600, VMS 4.5 2.32 2.9 Sun-4/110 [Sun 88] 2.09 3.2 Sun-4/200, SunOS 3.2L [OLD] 1.86 3.6 MIPS M/500, f77 1.21 [ours] 1.72 3.9 Sun-4/200 3.2 Prod (secondhand, not confirmed) 1.584 4.2 VAX 8650 1.480 4.6 Alliant FX/8, 1 CE 1.469 4.6 VAX 8700 1.200 5.6 MIPS M/800, f77 1.21 [ours] 1.193 5.7 ELXSI 6420 .990 6.8 MIPS M/1000, f77 1.21* .940 7.2 MIPS M/1000, f77 1.31 [ours] .783 8.6 MIPS M/120-5 [ours] .553 12.2 MIPS M/2000 [ours] .487 18.8 Convex C-1 XP * The actual run number was .99, which [DR 87] reported as 1.00. 5.5. Doduc Benchmark (DDUC) This benchmark [Doduc 86] is a 5300-line FORTRAN program that simulates aspects of nuclear reactors. It has little vectorizable code, and is thought to be representative of Monte-Carlo simulations. The program is offered in both sin- gle and double precision. The original goal of using this piece of code as a benchmark was to offer a rapid check on the good behavior of the compiler and intrinsic functions. In addition it can be used as a pure CPU benchmark with an unusually high floating point percentage. Some caveats are necessary. This simulation iterates until certain conditions are met. The number of bits in the floating point format, the rounding algorithm, and the accuracy of math libraries on different machines all affect the number of iterations required to converge. More ``accurate'' machines seem to require fewer iterations to converge, and double precision seems to converge faster than single precision, although there is no rigorous proof for either idea. As a consequence, one would have to scale the timing results for a fixed number of iterations to compare the timing between different machines. Fortunately the time required for each iteration is constant during the run and the variation of the total number of iterations to convergence varies very little (about 2 percent as measured on 10 different machines). Refer to the author for more in-depth discussions. Observed total number of iterations to converge: Single precision: 5881 (Sun_68881) to 6010 (CCI-6/32) (M1000=5906) Double precision: 5408 (Edge1) to 5492 (CCI-6/32) (M1000=5479) Performance is given as a number R, normalized to 100 (IBM 370/168-3) or 170 (IBM 3033-U): [ R = 48671/(Cpu_time_in_seconds) ] Larger R's are better, and Cpu_time_in_seconds is the 64-bit version. 64-Bit Doduc Benchmark Results DoDuc R Relative Factor Perf. System 17 0.7 Sun-3/110, 16.7MHz 19 0.7 Intel 80386+80387, 16MHz, iRMX 22 0.8 Sun-3/260, 25MHz 68020, 20MHz 68881 26 1.0 VAX 11/780, VMS 33 1.3 Fairchild Clipper, 30MHz, Green Hills 43 1.7 Sun-3/260, 25MHz, Weitek FPA 48 1.8 Celerity C1260 50 1.9 CCI Power 6/32 53 2.0 Edge 1 64 2.5 Harris HCX-7 85 3.3 Alliant FX/1 88 3.4 MIPS M/500, f77 1.21 -O2, runs 553 seconds 90 3.5 IBM 4381-2 90 3.5 Sun-4/200 [Hough 1987], SunOS 3.2L, runs 540 seconds 91 3.5 DEC VAX 8600, VAX/VMS 97 3.7 ELXSI 6400 99 3.8 DG MV/20000 100 3.8 MIPS M/500, f77 1.21 -O3, runs 488 seconds 101 3.9 Alliant FX/8 113 4.3 FPSystems 164 119 4.6 Gould 32/8750 129 5.0 DEC VAX 8650 136 5.2 DEC VAX 8700, VAX/VMS 150 5.7 Amdahl 470 V8, VM/UTS 181 7.0 IBM 3081-G, F4H ext, opt=2 190 7.3 MIPS M/800, f77 1.21 -O3, runs 256 secs 201 7.7 HP 9000/850 [Nhuan Doduc, e-mail, 10/16/88] 214 8.2 HP 9000/835 [Nhuan Doduc, e-mail, 10/16/88] 227 8.7 MIPS M/1000, f77 1.31 -O3, runs 214(178) secs 236 9.1 IBM 3081-K 280 10.8 M120-5, f77 1.31 -O2, runs 173(148) secs 289 11.1 M120-5, f77 1.31 -O3, runs 168(144) secs 291 11.2 Apollo DN10000, runs 167 seconds 438 16.8 M2000, f77 1.31 -O2, runs 111(94) secs 443 17.0 M2000, f77 1.31 -O3, runs 109(93) 475 18.3 Amdahl 5860 586 22.5 CDC Cyber 990-E, Fortv2/opt=high/vector 714 27.5 IBM 3090-200, scalar mode 915 35.2 Fujitsu VP-200 1080 41.6 Cray X/MP [for perspective: ... long way to go yet!] [Oct 88: well, it's not as long as it was last year] 5.6. Whetstone Whetstone is a synthetic mix of floating point and integer arithmetic, function calls, array indexing, conditional jumps, and transcendental functions [Curnow 76]. Whetstone results are measured in KWips, thousands of Whetstone interpreter instructions per second. On machines this fast, relatively few clock ticks are actually counted, and UNIX timing includes some variance. We increased the loop counts from 10 to 1000 to increase the total running time to reduce the variance. Our experiences show some general uncertainty about the numbers reported by anybody, as source code versions differ. Whetstone Benchmark Results DP DP SP SP KWips Rel.Kwips Rel. System 410 0.5 500 0.4 VAX 11/780, 4.3BSD, f77 [ours] 715 0.9 1,083 0.9 VAX 11/780, LLL compiler [ours] 830 1.0 1,250 1.0 VAX 11/780 VAX/VMS [Intergraph 86] 924 1.1 1,039 0.8 Sun-3/160C, 68881 [Wilson 88] 1,230 1.5 1,250 1.0 Sun-3/260, 25MHz 68020, 20MHz 68881 1,581 1.9 1,886 1.5 Apollo DN4000, 25MHz 68020, 25MHz 68881 [Wilson 88] 1,730 2.1 1,860 1.5 Intel 80386+80387, 20MHz, 64K cache, GreenHills 1,740 2.1 2,980 2.4 Intergraph InterPro-32C, 30MHz Clipper [Intergraph 86] 1,863 2.2 2,433 1.9 Sun-3/160, FPA [Wilson 88] 2,092 2.5 3,115 2.5 HP 9000/840S [HP 87] 2,433 2.9 3,521 2.8 HP 9000/825S [HP 87] 2,590 3.1 4,170 3.3 Intel 80386+Weitek 1167, 20MHz, Green Hills 2,673 3.2 3,569 2.9 Sun-3/260, Weitek FPA [Wilson 1988] 2,670 3.2 4,590 3.7 VAX 8600, VAX/VMS [Intergraph 86] 2,907 3.5 4,202 3.4 HP 9000/850S [HP 87] 2,940 3.5 4,215 3.4 Sun-4/110 [Sun 88] 3,885 4.7 5,663 4.5 Sun-4/200 [Wilson 1988] 3,950 4.8 6,670 5.3 VAX 8700, VAX/VMS, Pascal(?) [McInnis, 1987] 4,000 4.8 6,900 5.5 VAX 8650, VAX/VMS [Intergraph 86] 4,120 5.0 4,930 3.9 Alliant FX/8 (1 CE) [Alliant 86] 4,200 5.1 - - Convex C-1 XP [Multiflow] 4,220 5.1 5,430 4.3 MIPS M/500 4,400 5.315,000 12.0 Motorola 88000 [RISC Mgmt 88, Simpson 88] 6,600 8.0 - - HP 835S [RISC Mgmt 88] 6,930 8.0 8,570 6.9 MIPS M/800 7,960 9.610,280 8.2 MIPS M/1000 9,100 11.011,400 9.1 MIPS M/120-5 12,605 15.2 - - Multiflow Trace 7/200 [Multiflow] 13,600 16.417,300 13.8 MIPS M/2000-8 14,069 17.0 - - CCI Power 7/64 (simulation) [Simpson 88] 16,300 19.620,500 16.4 MIPS M/2000-8, f77 1.31, -O4 (inlining, not quite fair!) 25,000 30 - - IBM 3090-200 [Multiflow] 35,000 42 - - Cray X-MP/12 6. Acknowledgements Some people have noted that they seldom believe the numbers that come from cor- porations, unless accompanied by names of people who take responsibility for the numbers. Many people at MIPS have contributed to this document. Particu- lar contributors to this issue include Earl Killian, Mark Johnson, Dr. James Mannos, and Pat LeFevre. As usual, the editor, John Mashey, is finally respon- sible for all of the numbers. We thank Cliff Purkiser of Intel, who posted the Intel 80386 Whetstone and LIN- PACK numbers on Usenet. We also thank Greg Pavlov, who ran hordes of Stanford and Dhrystone benchmarks for us on a VAX 8550, Ultrix 2.0 system. 7. References [Alliant 86] Alliant Computer Systems Corp, "FX/Series Product Summary", October 1986. [AMD 88] Advanced Micro Devices, "Am29000 Performance Analysis", May 1988. [Curnow 76] Curnow, H. J., and Wichman, B. A., ``A Synthetic Benchmark'', Computing Journal, Vol. 19, No. 1, February 1976, pp. 43-49. [Doduc 87] Doduc, N., FORTRAN Central Processor Time Benchmark, Framentec, June 1986, Version 13. Newer numbers were received 03/17/87, and we used them where different. E-mail: uunet!inria!ftc!ndoduc [Dongarra 88] Dongarra, J., ``Performance of Various Computers Using Standard Linear Equa- tions in a Fortran Environment'', Argonne National Laboratory, February 16, 1988. [Dongarra 87b] Dongarra, J., Marin, J., Worlton, J., "Computer Benchmarking: paths and pit- falls", IEEE Spectrum, July 1987, 38-43. [DR 87] "A New Twist: Vectors in Parallel", June 29, 1987, "The M/1000: VAX 8800 Power for Price of a MicroVAX II", August 24, 1987, and "VAXstation 3200 Benchmarks: CVAX Eclipses MicroVAX II", September 14, 1987. Digital Review, One Park Ave., NY, NY 10016. [DR 88] "RISC-Based Systems Shatter the 10-MIPS Threshold", and "Widening the Lead", Digital Review, May 16, 1988. [ENEWS 87] Electronic News, ``Apollo Cuts Prices on Low-End Stations'', July 6, 1987, p. 16. [Fleming 86] Fleming, P.J. and Wallace, J.J.,``How Not to Lie With Statistics: The Correct Way to Summarize Benchmark Results'', Communications of the ACM, Vol. 29, No. 3, March 1986, 218-221. [HP 87] Hewlett Packard, ``HP 9000 Series 800 Performance Brief'', 5954-9903, 5/87. (A comprehensive 40-page characterization of 825S, 840S, 850S). [Hough 86,1] Hough, D., ``Weitek 1164/5 Floating Point Accelerators'', Usenet, January 1986. [Hough 86,2] Hough, D., ``Benchmarking and the 68020 Cache'', Usenet, January 1986. [Hough 86,3] Hough, D., ``Floating-Point Programmer's Guide for the Sun Workstation'', Sun Microsystems, September 1986. [an excellent document, including a good set of references on IEEE floating point, especially on micros, and good notes on benchmarking hazards]. Sun-3/260 Spice numbers are from later mail. [Hough 87] Hough, D., ``Sun-4 Floating-Point Performance'', Usenet, 08/04/87. [IBM 87] IBM, ``IBM RT Personal Computer (RT PC) New Models, Features, and Software Overview, February 17, 1987. [Intergraph 86] Intergraph Corporation, ``Benchmarks for the InterPro 32C'', December 1986. [Meta-Software 87] Meta-Software, ``HSPICE Performance Benchmarks'', June 1987. 50 Curtner Avenue, Suite 16, Campbell, CA 95008. [McInnis 87] McInnis, D., Kusik, R., Bhandarkar, D., ``VAX 8800 System Overview'', Proc. IEEE COMPCON, March 1987, San Francisco, 316-321. [McMahon 86] ``The Livermore Fortran Kernels: A Computer Test of the Numerical Perfor- mance Range'', December 1986, Lawrence Livermore National Labs. [MIPS 87] MIPS Computer Systems, "A Sun-4 Benchmark Analysis", and "RISC System Bench- mark Comparison: Sun-4 vs MIPS", July 23, 1987. [Muchnick 88] Muchnick, S.S., "Optimizing Compilers for SPARC", SunTechnology, Summer 1988, Sun Microsystems. [Purkiser 87] Purkiser, C., ``Whetstone and LINPACK Numbers'', Usenet, March 1987. [Richardson 87] Richardson, R., ``9/20/87 Dhrystone Benchmark Results'', Usenet, Sept. 1987. Rick publishes the source several times a year. E-mail address: ...!seismo!uunet!pcrat!rick [Serlin 87a] Serlin, O., ``MIPS, DHRYSTONES, AND OTHER TALES'', Reprinted with revisions from SUPERMICRO Newsletter, April 1986, ITOM International, P.O. Box 1450, Los Altos, CA 94023. Analyses on the perils of simplistic benchmark measures. [Serlin 87b] Serlin, O., SUPERMICRO #69, July 31, 1987. pp. 1-2. Offers good list of attributes customers should demand of vendor benchmark- ing. [Simpson 88] Simpson, David, "OEMS Cheer Motorola's 88000", Mini-Micro Systems, August 1988, 83-91. (Note that LINPACK numbers were not specified as FORTRAN or Coded, and that no configuration information is given; LINPACK numbers can be heavily influenced by cache sizes, so the published numbers are difficult to calibrate. Also, no 64-bit Whetstone numbers are provided.) [Stahlman 87] Stahlman, M., "The Myth of Price/performance", Sanford C. Bernstein & Co, Inc, NY, NY, March 17, 1987. [Sun 86] SUN Microsystems, ``The SUN-3 Family: A Hardware Overview'', August 1986. [Sun 87] SUN Microsystems, SUN-4 Product Introduction Material, July 7, 1987. [Sun 88] SUN Microsystems, ``Sun-4/110 Preliminary Benchmark Results'', WSD Perfor- mance Group, 01/28/88. [UCB 87] U. C. Berkeley, CAD/IC group, ``SPICE2G.6'', March 1987. Contact: Cindy Manly, EECS/ERL Industrial Liason Program, 479 Cory Hall, University of Cal- ifornia, Berkeley, CA 94720. [Weicker 84] Weicker, R. P., ``Dhrystone: A Synthetic Systems Programming Benchmark'', Communications of the ACM, Vol. 27, No. 10, October 1984, pp. 1013-1030. [Wilson 88] Wilson, David, "The Sun 4/260 RISC-Based Technical Workstation", UNIX Review 6, 7 (July 1988), 91-101. ________ RISComputer is a trademark of MIPS Computer Systems. UNIX is a Registered Trademark of AT&T. DEC, VAX, Ultrix, and VAX/VMS are trademarks of Digital Equipment Corp. Sun-3, Sun-4 are Trademarks on Sun Microsystems. Many others are trademarks of their respective companies. -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: {ames,decwrl,prls,pyramid}!mips!mash OR m...@mips.com DDD: 408-991-0253 or 408-720-1700, x253 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086