From tony@wucmd.wustl.edu  Tue Oct 20 05:28:48 1992
Date: Tue, 20 Oct 92 10:28:48 -0500
From: tony@wucmd.wustl.edu (Tony Dueben)
To: chemistry@oscsunb.ccl.net
Subject: 1993 Summer Computer Simulation Conf.


During the 1993 Summer Computer Simulation Conference
to be held in Boston 19-21 July 1993, sessions on the 
application of computer simulation to chemical, physical,
and engineering problems will be held.  

I have been asked to solicit submissions in computational
chemistry.  If you are interested in presenting a paper,
please send me an abstract (NOT a full paper) by 15 Nov.

I am also interest in proposals for panel discussions on
topics in computational chemistry.  Again, please send me
a proposal by 15 Nov.

Anthony J. Duben
Center for Molecular Design
Washington University 
Box 1099 -- 1 Brookings Drive
St. Louis MO 63130-4899

e-mail: tony@wucmd.wustl.edu


From U15327@UICVM.bitnet  Tue Oct 20 12:50:02 1992
Date: 20 October 1992 10:49:33 CDT
From: "Steve Roy 6-2489" <U15327%UICVM.BITNET@OHSTVMA.ACS.OHIO-STATE.EDU>
To: <UMSUPER@UMDD>, <SUPER-L@MCGILL1>, <SUPERSIG@UTORONTO>, <S-COMPUT@MARIST>,
Subject: Supercomputing Workshops Announcement


              UIC Workshops on Scientific Supercomputing


    Need to perform a large calculation, but your program is just too
 slow?  Modern supercomputers are a powerful addition to the research-
 er's arsenal, but like other tools, full benefit cannot be obtained
 without a good working knowledge of the major techniques and limita-
 tions.  The University of Illinois at Chicago Workshops on Scientific
 Supercomputing form a hands-on, immersion program that provides
 participants with the necessary expertise for successful, state-of-
 the-art computations, developed over a six year period at UIC.

    Each participant attends seminars and then apply the techniques
 discussed to his or her own program.  Set-piece exercises are not
 used:  instead, each  participant will bring a real, substantial pro-
 gram to work on.  Enrollment is limited so that each participant can
 spend substantial time with our consultants.  This combination of
 formal presentations, open discussions and practical experience with
 real programs on several different supercomputers is designed to give
 the participant the ability to deal with today's and tomorrow's
 supercomputing challenges.

    The workshops have been divided into three modules, each four
 weeks long, and cover vectorization, parallelization and graphics.
 In addition, we offer a one week course, principally for researchers
 who supervise students and others doing large scale calculations.

    Participants in the Workshops are provided with office space at
 UIC, accounts on supercomputers at the Cornell National Supercomputer
 Center and the National Center for Supercomputing Applications,
 accounts on UIC's mainframe and workstations, extensive consultation
 and a library of documentation.  The UIC Workshops on Scientific
 Supercomputing program is supported by the Joint Center for Scien-
 tific and Technical Computing.

    For a syllabus or further details, contact Dr. Stephen Roy at
 312-996-2489 or u15327 at uicvm.bitnet or uicvm.uic.edu.

From fredvc@esvax.dnet.dupont.com  Tue Oct 20 09:04:54 1992
Date: Tue, 20 Oct 92 13:04:54 -0400
From: fredvc@esvax.dnet.dupont.com
To: ,chem@chem
Subject: NONVECTOR CODE ON VECTOR MACHINES - II


Sunday (18-OCT-92) night I wrote:

-----------------------------------------------------------------------------
*Joe Leonard, in connection with some performance disappointments on on Stardent
*Titan machines writes:
*
*>>Since the Titan is a vector machine, taking non-vector code and telling
*>>all of it to run vectorized can result in slowing the code down.  Assuming
*>>that you are running the latest software and compilers, and run a P3
*>>(R3000) version of the processors cards, one should expect to see perfoemance
*>>equivalent to an IBM 530 (1 proc).  
*
*I am having some problems understanding just what he is driving at.  Perhaps
*he could expand upon this.
-----------------------------------------------------------------------------

I have since gotten several responses and had a chat with Joe Leonard. The
responses (see below) suggest that things I have come to take for granted are
best described as "well-known, but not widely-known", i.e., users who are not
heavily into the "code porting" business may not appreciate the land mines
along the road. Since I was mostly told things that I already knew, it follows
that:

    (1)	I have led a sheltered life; all the vector machines I have worked on
	have been CRAYs (1A, X-MP, Y-MP).  I find the tool set there for code 
	optimization to be exceeding rich.  There is little justification for
	having poorly optimized code in this environment.

    (2)	Since I have been in the "code porting" business for a number of years,
	there are things I do as a matter of course that are unlikely to occur 
	to the novice.  For example, assuming the code compiles successfully, 
	I make 3 runs before spending any appreciable time on the code: 

	    (a)	Run with vectorization OFF (check output).

	    (b)	Run with vectorization ON (check output, check speedup)

	    (c)	Run with Vectorization & Flowtrace ON (find "hot" routines,
		check for excessive subroutine calling, etc)

	Only at this point I feel I am ready to begin to think about how to
	optimize the code.  I know better than to "leave it up to the 
	compiler".  

The responses (unedited) are attached.
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
                           FREDERIC A. VAN-CATLEDGE

Scientific Computing Division         ||   Office: (302) 695-1187
Central Research & Development Dept.  ||      FAX: (302) 695-9658
The Du Pont company                   ||
P. O. Box 80320                       || Internet: fredvc@esvax.dnet.dupont.com
Wiilmington DE 19880-0320             || 
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++


From: egaines@bermuda.triangle.sgi.com

He is referring to the fact that if one allows a vectorizing compiler to make
its own decisions regarding which loops to vectorize (or parallize, for that
matter), it may choose loops for which the vectorization payback does not
exceed the vectorization overhead (i.e., breaking up the unit of work, 
synchronizing temporary variables after the execution of the loop, etc.).  
In this event, the performance will actually degrade, sometimes markedly.

While it is true that vectorizing/parallelizing compilers have become 
markedly better in the last 5 years, it is also true that the very best 
results proceed from profiling code to determine those code segments which
account for the majority of the execution (wallclock) time, then 
hand-inserting compiler directives to effect the vectorization/parallelization.

Please forgive me if I misunderstood your question.


From: virtual@quantum.larc.nasa.gov "Don H. Phillips" 

           In reply to Frederic Van-Catledge's question on vectorization
           slowdowns:

           Vectorizing compilers are capable of doing significant code
           rearrangement in order to vectorize do loops.  Advanced
           preprocessors, such as KAP, do similar work in order to
           minimize memory accesses on scalar machines.  

           The slow downs which sometimes (frequently on old code) occur
           when a lot of work is done to vectorize do loops with variable
           trip counts.  If the loop limits are smaller than the
           the vector length break even point, the code will run slower
           as vector operations than scalar.  Even at or above the break
           even point, the extra code inserted to enable vectorization
           (or efficient memory utilization) may cause a net slowdown
           of the processing of the code.  We have seen one widely
           used preprocessor insert 49 consecutive gather statements
           in order to vectorize a nested do loop for the Cyber 205.
           In that case the vectorized code ran several (7 I believe)
           times slower than the scalar code for a test case.

           It is advisable to check the speed of the most compute intensive
           routines with and without vectorization turned on.

From: padrick@gibbs.oit.unc.edu "Mike Padrick"

Although I have never used a Stardent, I have managed a Convex vector
supercomputer for several years, and I think I know what Joe Leonard
was talking about.  All vector architectures have a "break-even" point
below which it's not worth bothering to set up a vector load/store and
loop sequence, but is quicker just to do a scalar loop.  On a Cray Y-MP
this can be as little as 2 or 3; on a Convex C240 it's around 5-7.
Smart compilers won't vectorize an otherwise vectorizable loop if they
can tell that the number of iterations wouldn't make it worthwhile
(e.g., DO I=1,4).  But sometimes they can't tell (DO I=1,N) and if the
loop is vectorized and N is always 2 or 3 execution time can really
suffer.

Sometimes these are the inner loops and the next level is a big loop
that SHOULD be vectorized.  If the inner loop is vectorized then the
outer loop can't be.  But if the inner loop is forced to be scalar,
then vectorizing the outer loop can be a big win.

(I'm not a chemist, nor do I portray one on TV.  I just monitor this
list to help my users.)


From: gawboy@sodium.mps.ohio-state.edu

 I am also curious about what Joe Leonard may have to say. The COLUMBUS
package has its own library of number crunching routines which have been
optimized for a wide variety of machines including the TITAN......  If we look 
at the titan blocks for a general matrix multiply, and compare it to the code
for scalar machines, I think some insight can be gained about what Joe was
driving at. 

 Notice that it was found optimal to only use the library routine above a 
cutoff threshold. Also for optimal performance no tests are performed inside
the do loops and that micro tasking is required. Based on my experience with
optimizing cray codes, the structure of this optimized code tells me that great
care must be taken to insure that clock cycles are not wasted. In other words
there is a significant overhead involved for running in non vector mode.

  The all fortran version is for sequential machines. Notice that there is
an if block inside the nest of do loops in the matrix multiply. This will
kill any possibility of vectorization. Now if the compiler is told to run this
version of gmxm as if it were vectorized, then you will still be paying the 
overhead of vector mode, yet getting none of the benefits. The computer will
do only one flop per clock cycle rather than n where n is the number of
addresses in the pipe.

 So here are two possible guesses for Joe Leonard's statements.

-1. If a procedure requires multiplying a lot of small matrices, the gmxma over
head will outweigh the benefits of vectorization, (I think of this type
of phenomenon as bubbles of air in pipe, you still pay for the pumping but
nothing useful comes out at the end). Also the compiler may not always
recognise that it is supposed to vectorise the loop unless it is explicitly
told to inline it, but I am not sure, since I don't use the titan.

-2. For a scalar processor you usually win by determining whether or not a 
column of values is zero or not, whereas on a vector machine making the
necessary test can kill vectorization. So by giving the compiler an optimized
all fortran routine and telling it to run in vector mode at best will only
give you the scalar speed of the processor.

-Regards

Galen F. Gawboy

*deck gmxm
      subroutine gmxm(a,nar, b,nbr, c,ncc, alpha)
c
c  generalized matrix-matrix product to compute c = c + alpha*(a * b)
c
c  input:
c  a(*),b(*) = input matrices.
c  nar = number of rows in a(*) and c(*).
c  nbr = number of columns of a(*) and rows of b(*).
c  ncc = number of columns of b(*) and c(*).
c        all dimensions must be > 0.
c  alpha = matrix product scale factor.
c
c  output:
c  c(*) = c(*) + alpha * (a * b ) updated matrix.
c
*mdc*elseif titan
*c     16-feb-90 tuned for 16mhz r2000 single processor. -rls
*      if ( nar*nbr .ge. 400 .or. nar*nbr*ncc .ge. 4096 ) then
*c        # interface to the library gmxma().
*         call gmxma(a,1,nar, b,1,nbr, c,1,nar, nar,nbr,ncc, alpha,one)
*      else
*c$doit   vbest
*         do 50 i = 1, nar
*            do 40 j = 1, ncc
*               do 30 k = 1, nbr
*                  c(i,j) = c(i,j) + alpha * a(i,k) * b(k,j)
*30             continue
*40          continue
*50       continue
*      endif
*mdc*else
*c
*c all-fortran version...
*c note:*** this code should not be modified ***
*c a saxpy inner loop is used to exploit sparseness in the matrix
*c b(*), and this results in sequenctial access to a(*), b(*),
*c and c(*). -rls
*c
*      do 50 j = 1, ncc
*         do 40 k = 1, nbr
*             bkj = b(k,j)*alpha
*             if ( bkj .ne. zero ) then
*                  do 30 k = 1, nar
*                     c(i,j) = c(i,j) + a(i,k) * bkj
*30                continue
*             endif
*40       continue
*50    continue
*mdc*endif
     return
     end


From molsol@wucmd.wustl.edu  Tue Oct 20 08:30:48 1992
Date: Tue, 20 Oct 92 13:30:48 -0500
From: molsol@wucmd.wustl.edu (Molecular Solutions)
To: chemistry@ccl.net
Subject: ACS meeting - Denver


The Computers in Chemistry Division of the American Chemical Society
is currently planning its sessions for the ACS meeting in Denver
(March 28 through April 2).  I am organizing a session which will
explore the topic of Solvation Models.  If you are interested in
submitting a paper, please contact me.  Abstracts are due by
December 15, 1992.

	Allen Richon
	Molecular Solutions, Inc
	412 Carolina Blvd.
	Isle of Palms, SC  29451
	803-886-8775, 803-886-5924 (FAX)
	molsol@wucmd.wustl.edu

Thanks