From tony@wucmd.wustl.edu Tue Oct 20 05:28:48 1992 Date: Tue, 20 Oct 92 10:28:48 -0500 From: tony@wucmd.wustl.edu (Tony Dueben) To: chemistry@oscsunb.ccl.net Subject: 1993 Summer Computer Simulation Conf. During the 1993 Summer Computer Simulation Conference to be held in Boston 19-21 July 1993, sessions on the application of computer simulation to chemical, physical, and engineering problems will be held. I have been asked to solicit submissions in computational chemistry. If you are interested in presenting a paper, please send me an abstract (NOT a full paper) by 15 Nov. I am also interest in proposals for panel discussions on topics in computational chemistry. Again, please send me a proposal by 15 Nov. Anthony J. Duben Center for Molecular Design Washington University Box 1099 -- 1 Brookings Drive St. Louis MO 63130-4899 e-mail: tony@wucmd.wustl.edu From U15327@UICVM.bitnet Tue Oct 20 12:50:02 1992 Date: 20 October 1992 10:49:33 CDT From: "Steve Roy 6-2489" To: , , , , Subject: Supercomputing Workshops Announcement UIC Workshops on Scientific Supercomputing Need to perform a large calculation, but your program is just too slow? Modern supercomputers are a powerful addition to the research- er's arsenal, but like other tools, full benefit cannot be obtained without a good working knowledge of the major techniques and limita- tions. The University of Illinois at Chicago Workshops on Scientific Supercomputing form a hands-on, immersion program that provides participants with the necessary expertise for successful, state-of- the-art computations, developed over a six year period at UIC. Each participant attends seminars and then apply the techniques discussed to his or her own program. Set-piece exercises are not used: instead, each participant will bring a real, substantial pro- gram to work on. Enrollment is limited so that each participant can spend substantial time with our consultants. This combination of formal presentations, open discussions and practical experience with real programs on several different supercomputers is designed to give the participant the ability to deal with today's and tomorrow's supercomputing challenges. The workshops have been divided into three modules, each four weeks long, and cover vectorization, parallelization and graphics. In addition, we offer a one week course, principally for researchers who supervise students and others doing large scale calculations. Participants in the Workshops are provided with office space at UIC, accounts on supercomputers at the Cornell National Supercomputer Center and the National Center for Supercomputing Applications, accounts on UIC's mainframe and workstations, extensive consultation and a library of documentation. The UIC Workshops on Scientific Supercomputing program is supported by the Joint Center for Scien- tific and Technical Computing. For a syllabus or further details, contact Dr. Stephen Roy at 312-996-2489 or u15327 at uicvm.bitnet or uicvm.uic.edu. From fredvc@esvax.dnet.dupont.com Tue Oct 20 09:04:54 1992 Date: Tue, 20 Oct 92 13:04:54 -0400 From: fredvc@esvax.dnet.dupont.com To: ,chem@chem Subject: NONVECTOR CODE ON VECTOR MACHINES - II Sunday (18-OCT-92) night I wrote: ----------------------------------------------------------------------------- *Joe Leonard, in connection with some performance disappointments on on Stardent *Titan machines writes: * *>>Since the Titan is a vector machine, taking non-vector code and telling *>>all of it to run vectorized can result in slowing the code down. Assuming *>>that you are running the latest software and compilers, and run a P3 *>>(R3000) version of the processors cards, one should expect to see perfoemance *>>equivalent to an IBM 530 (1 proc). * *I am having some problems understanding just what he is driving at. Perhaps *he could expand upon this. ----------------------------------------------------------------------------- I have since gotten several responses and had a chat with Joe Leonard. The responses (see below) suggest that things I have come to take for granted are best described as "well-known, but not widely-known", i.e., users who are not heavily into the "code porting" business may not appreciate the land mines along the road. Since I was mostly told things that I already knew, it follows that: (1) I have led a sheltered life; all the vector machines I have worked on have been CRAYs (1A, X-MP, Y-MP). I find the tool set there for code optimization to be exceeding rich. There is little justification for having poorly optimized code in this environment. (2) Since I have been in the "code porting" business for a number of years, there are things I do as a matter of course that are unlikely to occur to the novice. For example, assuming the code compiles successfully, I make 3 runs before spending any appreciable time on the code: (a) Run with vectorization OFF (check output). (b) Run with vectorization ON (check output, check speedup) (c) Run with Vectorization & Flowtrace ON (find "hot" routines, check for excessive subroutine calling, etc) Only at this point I feel I am ready to begin to think about how to optimize the code. I know better than to "leave it up to the compiler". The responses (unedited) are attached. ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ FREDERIC A. VAN-CATLEDGE Scientific Computing Division || Office: (302) 695-1187 Central Research & Development Dept. || FAX: (302) 695-9658 The Du Pont company || P. O. Box 80320 || Internet: fredvc@esvax.dnet.dupont.com Wiilmington DE 19880-0320 || ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ From: egaines@bermuda.triangle.sgi.com He is referring to the fact that if one allows a vectorizing compiler to make its own decisions regarding which loops to vectorize (or parallize, for that matter), it may choose loops for which the vectorization payback does not exceed the vectorization overhead (i.e., breaking up the unit of work, synchronizing temporary variables after the execution of the loop, etc.). In this event, the performance will actually degrade, sometimes markedly. While it is true that vectorizing/parallelizing compilers have become markedly better in the last 5 years, it is also true that the very best results proceed from profiling code to determine those code segments which account for the majority of the execution (wallclock) time, then hand-inserting compiler directives to effect the vectorization/parallelization. Please forgive me if I misunderstood your question. From: virtual@quantum.larc.nasa.gov "Don H. Phillips" In reply to Frederic Van-Catledge's question on vectorization slowdowns: Vectorizing compilers are capable of doing significant code rearrangement in order to vectorize do loops. Advanced preprocessors, such as KAP, do similar work in order to minimize memory accesses on scalar machines. The slow downs which sometimes (frequently on old code) occur when a lot of work is done to vectorize do loops with variable trip counts. If the loop limits are smaller than the the vector length break even point, the code will run slower as vector operations than scalar. Even at or above the break even point, the extra code inserted to enable vectorization (or efficient memory utilization) may cause a net slowdown of the processing of the code. We have seen one widely used preprocessor insert 49 consecutive gather statements in order to vectorize a nested do loop for the Cyber 205. In that case the vectorized code ran several (7 I believe) times slower than the scalar code for a test case. It is advisable to check the speed of the most compute intensive routines with and without vectorization turned on. From: padrick@gibbs.oit.unc.edu "Mike Padrick" Although I have never used a Stardent, I have managed a Convex vector supercomputer for several years, and I think I know what Joe Leonard was talking about. All vector architectures have a "break-even" point below which it's not worth bothering to set up a vector load/store and loop sequence, but is quicker just to do a scalar loop. On a Cray Y-MP this can be as little as 2 or 3; on a Convex C240 it's around 5-7. Smart compilers won't vectorize an otherwise vectorizable loop if they can tell that the number of iterations wouldn't make it worthwhile (e.g., DO I=1,4). But sometimes they can't tell (DO I=1,N) and if the loop is vectorized and N is always 2 or 3 execution time can really suffer. Sometimes these are the inner loops and the next level is a big loop that SHOULD be vectorized. If the inner loop is vectorized then the outer loop can't be. But if the inner loop is forced to be scalar, then vectorizing the outer loop can be a big win. (I'm not a chemist, nor do I portray one on TV. I just monitor this list to help my users.) From: gawboy@sodium.mps.ohio-state.edu I am also curious about what Joe Leonard may have to say. The COLUMBUS package has its own library of number crunching routines which have been optimized for a wide variety of machines including the TITAN...... If we look at the titan blocks for a general matrix multiply, and compare it to the code for scalar machines, I think some insight can be gained about what Joe was driving at. Notice that it was found optimal to only use the library routine above a cutoff threshold. Also for optimal performance no tests are performed inside the do loops and that micro tasking is required. Based on my experience with optimizing cray codes, the structure of this optimized code tells me that great care must be taken to insure that clock cycles are not wasted. In other words there is a significant overhead involved for running in non vector mode. The all fortran version is for sequential machines. Notice that there is an if block inside the nest of do loops in the matrix multiply. This will kill any possibility of vectorization. Now if the compiler is told to run this version of gmxm as if it were vectorized, then you will still be paying the overhead of vector mode, yet getting none of the benefits. The computer will do only one flop per clock cycle rather than n where n is the number of addresses in the pipe. So here are two possible guesses for Joe Leonard's statements. -1. If a procedure requires multiplying a lot of small matrices, the gmxma over head will outweigh the benefits of vectorization, (I think of this type of phenomenon as bubbles of air in pipe, you still pay for the pumping but nothing useful comes out at the end). Also the compiler may not always recognise that it is supposed to vectorise the loop unless it is explicitly told to inline it, but I am not sure, since I don't use the titan. -2. For a scalar processor you usually win by determining whether or not a column of values is zero or not, whereas on a vector machine making the necessary test can kill vectorization. So by giving the compiler an optimized all fortran routine and telling it to run in vector mode at best will only give you the scalar speed of the processor. -Regards Galen F. Gawboy *deck gmxm subroutine gmxm(a,nar, b,nbr, c,ncc, alpha) c c generalized matrix-matrix product to compute c = c + alpha*(a * b) c c input: c a(*),b(*) = input matrices. c nar = number of rows in a(*) and c(*). c nbr = number of columns of a(*) and rows of b(*). c ncc = number of columns of b(*) and c(*). c all dimensions must be > 0. c alpha = matrix product scale factor. c c output: c c(*) = c(*) + alpha * (a * b ) updated matrix. c *mdc*elseif titan *c 16-feb-90 tuned for 16mhz r2000 single processor. -rls * if ( nar*nbr .ge. 400 .or. nar*nbr*ncc .ge. 4096 ) then *c # interface to the library gmxma(). * call gmxma(a,1,nar, b,1,nbr, c,1,nar, nar,nbr,ncc, alpha,one) * else *c$doit vbest * do 50 i = 1, nar * do 40 j = 1, ncc * do 30 k = 1, nbr * c(i,j) = c(i,j) + alpha * a(i,k) * b(k,j) *30 continue *40 continue *50 continue * endif *mdc*else *c *c all-fortran version... *c note:*** this code should not be modified *** *c a saxpy inner loop is used to exploit sparseness in the matrix *c b(*), and this results in sequenctial access to a(*), b(*), *c and c(*). -rls *c * do 50 j = 1, ncc * do 40 k = 1, nbr * bkj = b(k,j)*alpha * if ( bkj .ne. zero ) then * do 30 k = 1, nar * c(i,j) = c(i,j) + a(i,k) * bkj *30 continue * endif *40 continue *50 continue *mdc*endif return end From molsol@wucmd.wustl.edu Tue Oct 20 08:30:48 1992 Date: Tue, 20 Oct 92 13:30:48 -0500 From: molsol@wucmd.wustl.edu (Molecular Solutions) To: chemistry@ccl.net Subject: ACS meeting - Denver The Computers in Chemistry Division of the American Chemical Society is currently planning its sessions for the ACS meeting in Denver (March 28 through April 2). I am organizing a session which will explore the topic of Solvation Models. If you are interested in submitting a paper, please contact me. Abstracts are due by December 15, 1992. Allen Richon Molecular Solutions, Inc 412 Carolina Blvd. Isle of Palms, SC 29451 803-886-8775, 803-886-5924 (FAX) molsol@wucmd.wustl.edu Thanks