From owner-chemistry@ccl.net Tue Oct 26 08:15:00 2010 From: "George Lawrence geoe2##hotmail.com" To: CCL Subject: CCL: Descriptors Message-Id: <-43003-101026065128-18090-njxQlyXgkExmTJLOxkWSyw]=[server.ccl.net> X-Original-From: "George Lawrence" Date: Tue, 26 Oct 2010 06:51:26 -0400 Sent to CCL by: "George Lawrence" [geoe2%hotmail.com] While building a model for a set of compounds, how does one make the choice of molecular descriptors, I am using MOE which has about 333 different descriptors. I noticed that some have the same suffix or prefix. For example: GCUT (could be SlogP, SMR or PEOE) and then there is SlogP_ vsa, SMR_ vsa, PEOE_vsa which have different numbers attach to them. What does this mean? Do they describe the same thing? How does the numbers relate to each descriptor? What are the best methods to use to decide the right choice of descriptors? George Lawrence Geoe2]_[hotmail.com Kent U.K. From owner-chemistry@ccl.net Tue Oct 26 08:50:00 2010 From: "George Lawrence geoe2__hotmail.com" To: CCL Subject: CCL: Descriptor defination Message-Id: <-43004-101026061924-27099-8jiVtsqljuZtRc5Gs2Jbrw^_^server.ccl.net> X-Original-From: "George Lawrence" Date: Tue, 26 Oct 2010 06:19:17 -0400 Sent to CCL by: "George Lawrence" [geoe2(!)hotmail.com] I am new to computational chemistry and would like some information on descriptors definiation. I have formulated several QSAR models with equations I don't understand well. For example having a + or a - AM1-HOMO what does that mean? Where can I get simple to understand informations on descriptor defination? Thank you From owner-chemistry@ccl.net Tue Oct 26 09:31:00 2010 From: "Chris Swain swain|a|mac.com" To: CCL Subject: CCL: Descriptors Message-Id: <-43005-101026092940-24569-O94Snn+rO+ShSnqXfe7hxQ%server.ccl.net> X-Original-From: Chris Swain Content-transfer-encoding: 7BIT Content-type: text/plain; CHARSET=US-ASCII Date: Tue, 26 Oct 2010 14:29:16 +0100 MIME-version: 1.0 Sent to CCL by: Chris Swain [swain%%mac.com] Use the help function in MOE, There is a search function, just type in descriptors to get a details of each. Chris On 26 Oct 2010, at 11:51, George Lawrence geoe2##hotmail.com wrote: > > Sent to CCL by: "George Lawrence" [geoe2%hotmail.com] > While building a model for a set of compounds, how does one make the choice of molecular descriptors, I am using MOE which has about 333 different descriptors. I noticed that some have the same suffix or prefix. > For example: GCUT (could be SlogP, SMR or PEOE) and then there is SlogP_ vsa, SMR_ vsa, PEOE_vsa which have different numbers attach to them. What does this mean? > > Do they describe the same thing? How does the numbers relate to each descriptor? > What are the best methods to use to decide the right choice of descriptors? > > George Lawrence > Geoe2[a]hotmail.com > Kent U.K.> > From owner-chemistry@ccl.net Tue Oct 26 12:33:00 2010 From: "Kimberley Cousins kcousins=csusb.edu" To: CCL Subject: CCL: Descriptors Message-Id: <-43006-101026095014-26056-oj7Mg8XXulgAHnqT96mKKw\a/server.ccl.net> X-Original-From: Kimberley Cousins Content-disposition: inline Content-language: en Content-transfer-encoding: 7BIT Content-type: text/plain; charset=us-ascii Date: Tue, 26 Oct 2010 06:50:05 -0700 MIME-version: 1.0 Sent to CCL by: Kimberley Cousins [kcousins%csusb.edu] George Since you are an MOE user, I suggest you consult their rather robust online documentation (that came with it) for definitions of each discriptor. You should also do the tutorial provided for QSAR, as it will lead you through an example of descriptor selection. I have asked several of my research students to start by doing the tutorials, and it has been a good introduction. In fact, selecting descriptors is usually done either statistically (best fit, or functions that normalize sets of descriptors, while discarding descriptors that are not orthogonal) or with neural networks. There is no one "best way" that I know of; in that is the "art" of QSAR. Kimberley R. Cousins Professor of Chemistry California State University, San Bernardino kcousins_+_csusb.edu ----- Original Message ----- > From: "George Lawrence geoe2##hotmail.com" Date: Tuesday, October 26, 2010 5:35 am Subject: CCL: Descriptors To: "Cousins, Kimberley " > > Sent to CCL by: "George Lawrence" [geoe2%hotmail.com] > While building a model for a set of compounds, how does one make > the choice of molecular descriptors, I am using MOE which has about > 333 different descriptors. I noticed that some have the same suffix > or prefix. > For example: GCUT (could be SlogP, SMR or PEOE) and then there is > SlogP_ vsa, SMR_ vsa, PEOE_vsa which have different numbers attach > to them. What does this mean? > > Do they describe the same thing? How does the numbers relate to > each descriptor? > What are the best methods to use to decide the right choice of > descriptors? > > George Lawrence > Geoe2[a]hotmail.com > Kent U.K. > > > > -= This is automatically added to each message by the mailing > script =- > To recover the email address of the author of the message, please > changethe strange characters on the top line to the _+_ sign. You can > also> Conferences: > http://server.ccl.net/chemistry/announcements/conferences/> > > From owner-chemistry@ccl.net Tue Oct 26 13:07:00 2010 From: "Victor Rosas Garcia rosas.victor * gmail.com" To: CCL Subject: CCL: NBO+ECP GAMESS error Message-Id: <-43007-101026123052-3982-biH/79e7en1uwlNl2OIuVw{=}server.ccl.net> X-Original-From: Victor Rosas Garcia Content-Transfer-Encoding: 8bit Content-Type: text/plain; charset=ISO-8859-1 Date: Tue, 26 Oct 2010 11:30:40 -0500 MIME-Version: 1.0 Sent to CCL by: Victor Rosas Garcia [rosas.victor%x%gmail.com] To answer my own question, and to keep this for future reference: It turns out that I had forgotten to make a change in the source code of the DAWRIT subroutine. I had not made the change: iolib.src: In the routine DAWRIT, change the following instruction IF (N .GT. 0 .AND. LEN .NE. IFILEN(NREC)) GO TO 800 to read IF (N .GT. 0 .AND. LEN .GT. IFILEN(NREC)) GO TO 800 Now the original error has dissapeared (and a new one has appeared, but I am working on that one). Victor 2010/10/25 Victor Rosas Garcia rosas.victor=gmail.com : > > Sent to CCL by: Victor Rosas Garcia [rosas.victor],[gmail.com] > Hello everybody, > > I am trying to run a Natural Energy Decomposition Analysis on an > RHF/LANL2DZdp wavefunction using GAMESS VERSION = 25 MAR 2010 (R2) > but the job crashes.  The smallest case I have found that reproduces > the error is as follows: > >  $CONTRL SCFTYP=RHF RUNTYP=ENERGY COORD=ZMT NOSYM=1 $END >  $CONTRL UNITS=ANGS PP=READ EXETYP=RUN $END >  $INTGRL NOPK=1 $END >  $BASIS EXTFIL=.TRUE. GBASIS=LANL2DZD $END >  $NBO $END >  $DEL NEDA END >       NEDA (1,2) (3) END >  $END >  $DATA > LiCl..H2O...rhf/lanl2dzdp > C1 >  LI >  CL   1   2.0993427 >  O    1   1.9020655   2  179.9605863 >  H    3    .9500458   1  126.3708465  2  178.7919157 >  H    3    .9500458   4  107.2871778  1 -179.9941078 >  $END >  $ECP >   NONE >   CL-ECP GEN   10   2 >   5           ----- d   potential            ----- >         -10.0000000        1             94.8130000 >          66.2729170        2            165.6440000 >         -28.9685950        2             30.8317000 >         -12.8663370        2             10.5841000 >          -1.7102170        2              3.7704000 >   5           ----- s-d potential            ----- >           3.0000000        0            128.8391000 >          12.8528510        1            120.3786000 >         275.6723980        2             63.5622000 >         115.6777120        2             18.0695000 >          35.0606090        2              3.8142000 >   6           ----- p-d potential            ----- >           5.0000000        0            216.5263000 >           7.4794860        1             46.5723000 >         613.0320000        2            147.4685000 >         280.8006850        2             48.9869000 >         107.8788240        2             13.2096000 >          15.3439560        2              3.1831000 >   NONE >   NONE >   NONE >  $END > > which is basically an example from the NBO manual modified to use the > LANL2DZdp basis set.  If I use the 6-31G* basis set, no problem.  The > job fails with the following error message: > >  -------------- >   Fragment  1: >  -------------- > >           ECP ANGULAR INTS.........     0.00 SECONDS >  *** ERROR *** IN -DAWRIT- ROUTINE ON NODE   0 >  DAWRIT HAS REQUESTED A RECORD WITH LENGTH DIFFERENT THAN BEFORE - ABORT FORCED. >  DAF RECORD    92 NEW LENGTH =         0 OLD LENGTH =       162 >  EXECUTION OF GAMESS TERMINATED -ABNORMALLY- AT Mon Oct 25 08:16:00 2010 >       580000  WORDS OF DYNAMIC MEMORY USED >  STEP CPU TIME =     0.00 TOTAL CPU TIME =        1.1 (    0.0 MIN) >  TOTAL WALL CLOCK TIME=        1.4 SECONDS, CPU UTILIZATION IS  82.22% >  ddikick.x: application process 0 quit unexpectedly. >  ddikick.x: Sending kill signal to DDI processes. >  ddikick.x: Execution terminated due to error(s). > > so the NBO analysis has no problem, it fails only on reaching the NEDA > part.  Looking through the CCL archives, I found a message from 2002 > (http://ftp.ccl.net/chemistry/resources/messages/2002/05/27.005-dir/index.html) > with an almost identical problem, but I have not located any responses > to it. > > Any ideas? > > Victor>      http://www.ccl.net/cgi-bin/ccl/send_ccl_message>      http://www.ccl.net/cgi-bin/ccl/send_ccl_message>      http://www.ccl.net/chemistry/sub_unsub.shtml>      http://www.ccl.net/spammers.txt> > > From owner-chemistry@ccl.net Tue Oct 26 13:42:00 2010 From: "Aniko Simon aniko+*+simbiosys.ca" To: CCL Subject: CCL: CLiDE (Chemical Literature Data Extraction) version 4 was released Message-Id: <-43008-101026091414-9667-j8kt1egSn+HN11jh5UgsEw[a]server.ccl.net> X-Original-From: "Aniko Simon" Date: Tue, 26 Oct 2010 09:14:13 -0400 Sent to CCL by: "Aniko Simon" [aniko*o*simbiosys.ca] CLiDE is a sophisticated chemical OCR application, that can extract molecular structures from images or PDF files, and can save them in several chemical file formats like: MOL, SDF and XML, or transfer them seamlessly to various chemical editors. CLiDE is available in three packages: Standard, Professional or Batch. Each of them is serving a slightly different need at different prices. CLiDE-Standard and CLiDE-Professional offer a contemporary graphical user interface, with all the recognition features embedded in a user-friendly document/image viewer, for more info see: http://www.simbiosys.com/products/leaflets/CLiDE_folder.pdf Release 4 contained major improvements to the product including: * improved recognition accuracy * improved input output file format handling: introduced support for several new raster image file formats (GIF, JPEG, PBM, PGM, PNG, PNM, PPM, XBM, XPM); and introduced XML output file format that can be loaded for later editing * introduced capability to extract chemical structures from tables * saving molecules with super atoms in expanded or contracted form Complete listing of changes is available upon request. Feel free to contact me if you have any questions about this new CLiDE release, or sign up for a no-fee evaluation of CLiDE on our website: http://www.simbiosys.com/products/demo_request.html Best wishes, Aniko -- Aniko Simon, Ph.D. | SimBioSys Inc. | Tel: 1-416-741-4263 http://www.simbiosys.com/ | blog: http://www.simbiosys.com/blog/ Check out the most recent customer quotes about SimBioSys: http://www.simbiosys.com/support/index.html#quotes From owner-chemistry@ccl.net Tue Oct 26 14:25:00 2010 From: "Erik-Jan Ras Erik-Jan.Ras..avantium.com" To: CCL Subject: CCL: Descriptors Message-Id: <-43009-101026142328-26538-gJGN5W2lUYJGB82wYne4Bg_._server.ccl.net> X-Original-From: Erik-Jan Ras Content-Language: en-US Content-Transfer-Encoding: 8bit Content-Type: text/plain; charset="us-ascii" Date: Tue, 26 Oct 2010 20:23:15 +0200 MIME-Version: 1.0 Sent to CCL by: Erik-Jan Ras [Erik-Jan.Ras~~avantium.com] Dear George, As already indicated by others, there is no uniform selection method for choosing which descriptors to use. Some guidelines, depending on the modeling method you use may still be helpfull. If you're using PLS models, a good starting point is the variable importance (VIP) for each of the variables in your model. A variable with a high VIP will have a high impact on your model performance. Typically you start your modeling exercise with all available variables. After that, in small iterative steps, you reduce your model. At each stage you have to carefully evaluate predictive power of your model. Ideally you would use a substantially large external validation set to assess predictive power. Also keep in mind the fact that per response (Y) in theory only one latent variable should be required in your model. If (many) more latent variables are required you're dealing with variations in your descriptor space (X) that are orthogonal (uncorrelated) to your response (Y). In this case you may want to consider using OPLS in stead of PLS. Generally speaking, these methods are implemented in commercial packages like Simca-P and work quite well (also pretty well documented and referenced). With a bit more effort in environments like Matlab, Scilab or R many open source libraries are available as well. Regards, Erik-Jan ________________________________________ > From: owner-chemistry+erikjan.ras==avantium.com**ccl.net [owner-chemistry+erikjan.ras==avantium.com**ccl.net] On Behalf Of George Lawrence geoe2##hotmail.com [owner-chemistry**ccl.net] Sent: Tuesday, October 26, 2010 12:51 PM To: Erik-Jan Ras Subject: CCL: Descriptors Sent to CCL by: "George Lawrence" [geoe2%hotmail.com] While building a model for a set of compounds, how does one make the choice of molecular descriptors, I am using MOE which has about 333 different descriptors. I noticed that some have the same suffix or prefix. For example: GCUT (could be SlogP, SMR or PEOE) and then there is SlogP_ vsa, SMR_ vsa, PEOE_vsa which have different numbers attach to them. What does this mean? Do they describe the same thing? How does the numbers relate to each descriptor? What are the best methods to use to decide the right choice of descriptors? George Lawrence Geoe2[a]hotmail.com Kent U.K.http://www.ccl.net/cgi-bin/ccl/send_ccl_messagehttp://www.ccl.net/chemistry/sub_unsub.shtmlhttp://www.ccl.net/spammers.txtThis email (including its attached files and other content) is confidential and intended only for the use by named addressee. Unauthorized use, dissemination, disclosure and/or copying are prohibited. This email, attachments and (any part of) its content are (1) intended for the named addressee(s) only, and (2) strictly confidential and proprietary. All rights are reserved by Avantium Holding B.V. and its subsidiaries ('Avantium'). Any unauthorized use, dissemination, disclosure and/or copying is strictly prohibited, except after prior and express written permission by Avantium. Avantium is not responsible for the correct transmission and timely receipt of this email and its content. Should you have received this email, attachments and its content by mistake, please bring this to our attention and destroy this email in full. Thank you. http://www.avantium.com/about/legal-disclaimer/ From owner-chemistry@ccl.net Tue Oct 26 16:28:01 2010 From: "Andreas Klamt klamt~~cosmologic.de" To: CCL Subject: CCL: Descriptors Message-Id: <-43010-101026162652-14602-BtDTh5VEM6DS1zO811ALQw]~[server.ccl.net> X-Original-From: Andreas Klamt Content-Transfer-Encoding: 8bit Content-Type: text/plain; charset=UTF-8; format=flowed Date: Tue, 26 Oct 2010 22:26:51 +0200 MIME-Version: 1.0 Sent to CCL by: Andreas Klamt [klamt~~cosmologic.de] Dear George, I like to send a kind of warning: The large number of molecular descriptors which nowadays are easily made available by some programs also provide a kind of danger. If you have thousands of descriptors available for a property for which you may have lets say 50 exp. data, then the chance that some of them correlate just accidentally is quite high. If they correlate accidentally, no statistical method will detect that the correlation is accidental. Therefore I strongly recommend that you first decide rationally whether a descriptor may have any reasonable relation to the target property. There are few criteria wich can be used: If you want to describe a local property of a molecule, maybe a certain reactivity of a functional group, do not use global molecular descriptors, because they cannot be the right descriptors. Vice versa, do not use local descriptors for global properties (e.g. a logP). Do not use orbital descriptors when you want to describe molecular mobility/viscosity, diffusion coefficients, ..) Best use a small set of descriptors which is known to include the relevant information, e.g. for any kind of log-partition coefficient you may either use the 5 Abraham descriptors or the 5 COSMO-RS sigma-moments. ... Blind QSAR based on large numbers of descriptors just selected by sophisticated statistical methods will lead to QSAR equations, which look significant, but most often include no physics. They will fail as soon as you apply them to a novel situation. Best regards Andreas Am 26.10.2010 20:23, schrieb Erik-Jan Ras Erik-Jan.Ras..avantium.com: > Sent to CCL by: Erik-Jan Ras [Erik-Jan.Ras~~avantium.com] > Dear George, > > As already indicated by others, there is no uniform selection method for choosing which descriptors to use. Some guidelines, depending on the modeling method you use may still be helpfull. > > If you're using PLS models, a good starting point is the variable importance (VIP) for each of the variables in your model. A variable with a high VIP will have a high impact on your model performance. Typically you start your modeling exercise with all available variables. After that, in small iterative steps, you reduce your model. At each stage you have to carefully evaluate predictive power of your model. Ideally you would use asubstantially large external validation set to assess predictive power. > > Also keep in mind the fact that per response (Y) in theory only one latent variable should be required in your model. If (many) more latent variables are required you're dealing with variations in your descriptor space (X) that are orthogonal (uncorrelated) to your response (Y). In this case you may want to consider using OPLS in stead of PLS. > > Generally speaking, these methods are implemented in commercial packages like Simca-P and work quite well (also pretty well documented and referenced). With a bit more effort in environments like Matlab, Scilab or R many open source libraries are available as well. > > Regards, > Erik-Jan > > > ________________________________________ >> From: owner-chemistry+erikjan.ras==avantium.com_-_ccl.net [owner-chemistry+erikjan.ras==avantium.com_-_ccl.net] On Behalf Of George Lawrence geoe2##hotmail.com [owner-chemistry_-_ccl.net] > Sent: Tuesday, October 26, 2010 12:51 PM > To: Erik-Jan Ras > Subject: CCL: Descriptors > > Sent to CCL by: "George Lawrence" [geoe2%hotmail.com] > While building a model for a set of compounds, how does one make the choice of molecular descriptors, I am using MOE which has about 333 different descriptors. I noticed that some have the same suffix or prefix. > For example: GCUT (could be SlogP, SMR or PEOE) and then there is SlogP_ vsa, SMR_ vsa, PEOE_vsa which have different numbers attach to them. What does this mean? > > Do they describe the same thing? How does the numbers relate to each descriptor? > What are the best methods to use to decide the right choice of descriptors? > > George Lawrence > Geoe2[a]hotmail.com > Kent U.K.http://www.ccl.net/cgi-bin/ccl/send_ccl_messagehttp://www.ccl.net/chemistry/sub_unsub.shtmlhttp://www.ccl.net/spammers.txtThis email (including its attached files and other content) is confidential and intended only for the use by named addressee. Unauthorized use, dissemination, disclosure and/or copying are prohibited. This email, attachments and (any part of) its content are (1) intended for the named addressee(s) only, and (2) strictly confidential and proprietary. All rights are reserved byAvantium Holding B.V. and its subsidiaries ('Avantium'). Any unauthorized use, dissemination, disclosure and/or copying is strictly prohibited, except after prior and express written permission by Avantium. Avantium isnot responsible for the correct transmission and timely receipt of this email and its content. Should you have received this email, attachments and its content by mistake, please bring this to our attention and destroythis email in full. Thank you. http://www! > .avantium.com/about/legal-disclaimer/> > > -- PD. Dr. Andreas Klamt CEO / Geschäftsführer COSMOlogic GmbH& Co. KG Burscheider Strasse 515 D-51381 Leverkusen, Germany phone +49-2171-731681 fax +49-2171-731689 e-mail klamt,cosmologic.de web www.cosmologic.de HRA 20653 Amtsgericht Koeln, GF: Dr. Andreas Klamt Komplementaer: COSMOlogic Verwaltungs GmbH HRB 49501 Amtsgericht Koeln, GF: Dr. Andreas Klamt