SGI's interesting idea of a "speedup"

Path: utzoo!utgpu!watmath!clyde!att!osu-cis!tut.cis.ohio-state.edu!mailrus!ncar!ames!pasteur!ucbvax!adt.UUCP!madd
From: m...@adt.UUCP (jim frost)
Newsgroups: comp.sys.sgi
Subject: SGI's interesting idea of a "speedup"
Message-ID: <8812162201.AA24738@adt.uucp>
Date: 16 Dec 88 22:01:31 GMT
Sender: dae...@ucbvax.BERKELEY.EDU
Organization: The Internet
Lines: 82

Quoted from "Porting Applications to the IRIS-4D Family":

-- begin quote --

5.3 New Drawing Subroutines

Software reliease 4D1-3.0 introduced several new Graphics Library
subroutines for drawing and pixel access.  Silicon Graphics recommends
converting old style routines to the new ones for three reasons:

    * Your code will be more portable.

    * On the GT and future products, the new subroutines will run up
      to 10 times faster than their old counterparts.

    * The new subroutines simplify the Graphics Library and allow for
      future expansion.

In most cases, the convertion is simple -- just substitute the new
subroutines for the old ones.  Unfortunately, the new subroutines do
not work in display lists, so if your code is based primarily on
display lists, the solution is not so simple.

This table gives a comparison of old and new subroutines.

----------------------------------------------------------------------
Technique		Old Subroutines		New Subroutines
----------------------------------------------------------------------
draw connected		move,draw,draw		bgnline,v3f,v3f,
line segments					endline

draw closed		move,draw,draw		bgnclosedline,v3f,v3f,
hollow polygons		or poly			endclosedline

draw filled		pmv,pdr,pdr,pclos	bgnpolygon,v3f,v3f,
polygons		polf or splf		endpolygon

draw points		pnt,pnt			bgnpoint,v3f,v3f,
						endpoint

read pixels		readpixels,readRGB	rectread,lrectread

write pixels		writepixels,writeRGB	rectwrite,lrectwrite

draw triangular		new			bgntmesh,v3f,v3f,
meshes						endtmesh

color(vector)		RGBcolor		cpack or c3i

surface normal		normal			n3f

clear screen,		clear,zclear		czclear
Z-buffer

create RGB		RGBwritemask		wmpack
writemask
----------------------------------------------------------------------

-- end quote --

Interestingly, the 10x factor seems to be correct as one of our
customers reported that our product "ran ten times slower" on the GT.

We happily followed the SGI guide to speed them up.  At one point we
changed all our readpixel() calls to rectread() calls, a non-trivial
task because they don't have the same arguments at all.  To our great
surprise, the following was printed when the new call was made:

	<rectread> is not implemented.

We were impressed at just how fast their new function didn't work, as
I'm sure you can guess.

Curious, we investigated.  Making use of "strings", we found that
libgl_s.a contained the string "<%s> is not implemented.".  Just how
many functions might call whatever routine has that string is
something that scares me.

Jim Frost
Associative Design Technology
(508) 366-9166
m...@bu-it.bu.edu

Path: utzoo!attcan!uunet!lll-winken!lll-lcc!ames!pasteur!ucbvax!
DDATHD21.BITNET!XBR2D96D
From: XBR2D...@DDATHD21.BITNET (Knobi der Rechnerschrat)
Newsgroups: comp.sys.sgi
Subject: Misc.
Message-ID: <8812200602.aa17057@SMOKE.BRL.MIL>
Date: 19 Dec 88 06:43:53 GMT
Sender: dae...@ucbvax.BERKELEY.EDU
Organization: The Internet
Lines: 71

Hello Netlanders,

  I've a few questions about SGI's new GTX architecture. They are based
on the 3.1 release notes and a document called "IRIS GTX: A Technical
Report, Rev 2":

- which type of CPU (16 MHZ or 25 MHZ) and how many of them do I need
  to get the full graphics speed (100.000 Z-buffered 4-sided, G-shaded,
  P-lighted, independent polygons). I ask this question, because one of
  SGI's competitors (they have a vector/parallel-oriented Workstation
  with up to 4 CPU's, Graphics computations done in the CPU) had to admit
  (after applying some spanish inqusition tools) that they need 4 CPU's
  to reach their maximum graphics performance and  that there may exist
  situations, where graphics can consume all resources of the system.

- Chapter "8.2 Graphics Notes" in the 4D-3.1 release notes states that
  some of the graphics routines (c3*, c4*, n3f, v2*, v3*, v4*) should be
  called with quadword-aligned data to get full GTX performance.
  Does this mean all the variables have to be "double" (which I don't
  beleave) or that the first byte of a "float x[3]" vector has to start
  on a quadword-address? In the latter case I only have to rearrange our
  data-structures.

- does shademodel(FLAT) work again under 3.1?

As a  last point I want to comment on Jim Frost who wrotes a note about

> Subject: SGI's interesting idea of a "speedup"
                        .
                        .
                        .
                        .
>Interestingly, the 10x factor seems to be correct as one of our
>customers reported that our product "ran ten times slower" on the GT.
>
>We happily followed the SGI guide to speed them up.  At one point we
>changed all our readpixel() calls to rectread() calls, a non-trivial
>task because they don't have the same arguments at all.  To our great
>surprise, the following was printed when the new call was made:
>
>    <rectread> is not implemented.
>
>We were impressed at just how fast their new function didn't work, as
>I'm sure you can guess.
>
>Curious, we investigated.  Making use of "strings", we found that
>libgl_s.a contained the string "<%s> is not implemented.".  Just how
>many functions might call whatever routine has that string is
>something that scares me.
>
>Jim Frost
>Associative Design Technology
>(508) 366-9166
>m...@bu-it.bu.edu

Did you get your "not implemented" on a G or GT. If its on a G (as I
suspect) how can you expect routines to be implemented that make only
sense on the GT architecture (another example is smoothline())? I think
its a good idea to allow you to use the calls, but to tell you that they
don't work.


Have a merry Christmas and a happy new year 89
Martin Knoblauch

TH-Darmstadt
Physical Chemistry 1
Petersenstrasse 20
D-6100 Darmstadt
West-Germany

BITNET: <XBR2D96D@DDATHD21>

Path: utzoo!attcan!uunet!lll-winken!lll-lcc!ames!sgi!...@patton.SGI.COM
From: j...@patton.SGI.COM (Jim Barton)
Newsgroups: comp.sys.sgi
Subject: Re: Misc.
Summary: The Answer Man
Message-ID: <23835@sgi.SGI.COM>
Date: 21 Dec 88 17:44:35 GMT
References: <8812200602.aa17057@SMOKE.BRL.MIL>
Sender: dae...@sgi.SGI.COM
Organization: Silicon Graphics, Inc., Mountain View, CA
Lines: 65

In article <8812200602.aa17...@SMOKE.BRL.MIL>, XBR2D...@DDATHD21.BITNET 
(Knobi der Rechnerschrat) writes:
> Hello Netlanders,
> 
>   I've a few questions about SGI's new GTX architecture. They are based
> on the 3.1 release notes and a document called "IRIS GTX: A Technical
> Report, Rev 2":
> 
> - which type of CPU (16 MHZ or 25 MHZ) and how many of them do I need
>   to get the full graphics speed (100.000 Z-buffered 4-sided, G-shaded,
>   P-lighted, independent polygons). I ask this question, because one of
>   SGI's competitors (they have a vector/parallel-oriented Workstation
>   with up to 4 CPU's, Graphics computations done in the CPU) had to admit
>   (after applying some spanish inqusition tools) that they need 4 CPU's
>   to reach their maximum graphics performance and  that there may exist
>   situations, where graphics can consume all resources of the system.

ALL GTX class machines can reach full graphics performance with a single
CPU driving the graphics.  In a 4-popper, this means you get >3 CPU's
of compute performance to use as you wish.  (Unlike the competition, a GTX
has 100 MFlops dedicated to graphics; the CPU performance is yours to use
or abuse as you wish).

Part of this is the result of a custom bus cycle and small block DMA facility
which the processor uses to send geometry to the pipeline.  We call this
feature the "3-way-transfer".  More below ...

> - Chapter "8.2 Graphics Notes" in the 4D-3.1 release notes states that
>   some of the graphics routines (c3*, c4*, n3f, v2*, v3*, v4*) should be
>   called with quadword-aligned data to get full GTX performance.
>   Does this mean all the variables have to be "double" (which I don't
>   beleave) or that the first byte of a "float x[3]" vector has to start
>   on a quadword-address? In the latter case I only have to rearrange our
>   data-structures.

As you surmised, the quadword alignment is just for the first byte of the
data structure you are sending.  The reason for doing this to get full
performance is related to the 3-way-transfer and the MP backplane.

As in most multiprocessors, memory data is transferred in large blocks for
efficiency, and then cached at each CPU.  The POWERSeries uses a 4-word
(16-byte) cache line, which is also the basic unit of transfer to the
graphics pipeline.  The 3-way-transfer is designed to allow the programmer
to lay out his data in an arbitrary way without alignment restrictions.
Thus, if your vertex crosses a 4-word boundary, two bus cycles will be
necessary to send the data (thus the "3-way": the first part of the data
may come from cache or memory, and the second part may come from some other
cache or memory, or the initiating CPU may own none of the data, in which
case other cache(s) or memory will supply the data). [Sorry if this is
confusing; remember that the POWERSeries uses write-back cacheing, so the
"real" memory image is distributed between caches and memory.]

Quad word aliging the vertex assures that the transfer happens in a single
bus cycle, giving you the best performance (but remember, your code will
still work, no matter how the data is aligned).

> - does shademodel(FLAT) work again under 3.1?

I hope so.

-- Jim Barton
Silicon Graphics Computing Systems    "UNIX: Live Free Or Die!"
j...@sgi.sgi.com, sgi!...@decwrl.dec.com, ...{decwrl,sun}!sgi!jmb

  "I used to be disgusted, now I'm just amused."
			- Elvis Costello, 'Red Shoes'
--