From chemistry-request _-at-_)server.ccl.net Thu Sep 9 08:53:59 1999 Received: from ccl.net (atlantis.ccl.net [192.148.249.4]) by server.ccl.net (8.8.7/8.8.7) with ESMTP id IAA20011 for ; Thu, 9 Sep 1999 08:53:59 -0400 Received: from mailhub2.shef.ac.uk (mailhub2.shef.ac.uk [143.167.2.154]) by ccl.net (8.8.6/8.8.6/OSC 1.1) with ESMTP id IAA08896 for ; Thu, 9 Sep 1999 08:49:27 -0400 (EDT) From: d.turner -A_T- sheffield.ac.uk Received: from pc100182.shef.ac.uk ([143.167.100.182] helo=davets_pc) by mailhub2.shef.ac.uk with smtp (Exim 3.02 #2) id 11P3dX-0005n1-00; Thu, 09 Sep 1999 13:49:11 +0100 To: "Dr. Dave Winkler" Date: Thu, 9 Sep 1999 13:50:33 +0100 MIME-Version: 1.0 Content-type: text/enriched; charset=US-ASCII Content-transfer-encoding: 7BIT Subject: Re: CCL:cross-validation and prediction with PLS CC: chemistry&$at$&www.ccl.net, qsar_society&$at$&unil.ch, davel(+ at +)chmqst.demon.co.uk Priority: normal In-reply-to: References: <37D4FCC1.40AF6E5E&$at$&stark.udg.es> X-mailer: Pegasus Mail for Win32 (v3.01a) Message-Id: Dear Dr Winkler I have noted your comments with interest. I am very curious about your suggestion that Bayesian NN QSAR methods don't (may not) require validation; that crossvalidation (CV) may not be needed I can understand but NO validation at all? How do you know that your model has any value without some external criterion? Or have I missed something here? I have some more questions/comments:- 7F00,0000,0000> There are a whole raft of problems which arise with cross-validation and testing, Agree. 7F00,0000,0000> for example: > > 1. What is the best way to chose a validation set (or test for that matter) is > leave-one-out, leave-N-out etc? As far as I know one doesn't use CV (LOO/LNO) to choose validation/test sets. What do you mean here? The only thing I can think of is that CV can be used to help identify training set outliers. > 2. Should clustering be used to choose a representative test set and , if so, 7F00,0000,0000> what kind of clustering algorithm and do you cluster on X, Y or X/Y? Are you using the term "clustering" here to cover PCA/PLS score-based experimental design as well as what is more usually called clustering? 7F00,0000,0000> 3. Effort involved in the cross validation process scales as the square of the > number of compounds and the square of the number of independent variables, > which has important implications for large data sets (eg combichem/HTS data) With the SAMPLS algorithm fitted/CV PLS run-times are independent of the number of independent variables (P). (There will be some overhead for calculating the inter-compound covariance matrix). SAMPLS should be used where P>>M (M=No. of compounds). PLS with large numbers of compounds will always be a problem; one's definition of "large" will obviously change as more computational power becomes available ... 7F00,0000,0000> 4. Once a cross validation process has been completed, there are N QSAR > models, all slightly different. Which is the 'best' or 'true' model? What do you mean N models? With PLS, CV is used to establish the number of PLS latent variables (LV) to use in a final all-observations-included analysis the regression coefficients from which predictions can be made. Or is your "N" an LV indicator? I look forward to hearing your comments Regards Dave Turner 7F00,0000,0000> My colleague, Frank Burden (Monash University) and I have used Bayesian > regularized neural nets for QSAR. We find that they overcome virtually all of > the problems with PLS QSAR models as they give the single statistically best > model possible for the data set. In addition there are good theoretical reasons > why they do not require cross validation or test sets. We are investigating this > for QSAR and preliminary results suggest that this is the case. > > We have published some of this work recently: > > [74] New QSAR Methods Applied to Structure-Activity Mapping and > Combinatorial Chemistry, Burden, F.R. and Winkler, D.A. J. Chem. Inf. > Comput. Sci. 39, 236 (1999). > [75] The Computer Simulation of High Throughput Screening of Bioactive > Molecules, F.R. Burden, D.A. Winkler, in Molecular Modelling and Prediction > of Bioactivity (K. Gundertofte and F.S. Jorgensen eds), Plenum Press 1998. > [80] Robust QSAR Models Using Bayesian Regularised Artificial Neural > Networks, Burden, F.R. and Winkler, D.A. J. Med. Chem., 1999; 42(16); 3183- > 3187 (1999). > [81] A QSAR Model for the Acute Toxicity of Substituted Benzenes towards > Tetrahymena Pyriformis using Bayesian Regularized Neural Networks. F R. > Burden* David A. Winkler, Chem. Res. Toxicol., in press. > [82] Robust QSAR Models from Novel Descriptors and Bayesian Regularized > Neural Networks, Winkler, D.A, Burden, F.R. Mol. Simul. 1999 in press. > [87] Do QSAR Models using Bayesian Regularized Artificial Neural Networks > Really Need Validation? Winkler, D.A. and Burden, F.R. J.Chem. Inf. > Comput. Sci in preparation. > > Cheers, > > Dave > > Dr. David A. Winkler Email: dave.winkler&$at$&molsci.csiro.au > Senior Principal Research Scientist Voice: 61-3-9545-2477 > CSIRO Molecular Science Fax: 61-3-9545-2446 > Private Bag 10,Clayton South MDC 3169 http://www.csiro.au > Australia http://www.molsci.csiro.au > Dr David Turner Dept of Information Studies, Sheffield University Sheffield, S10 2TN, UK Tel. 0114 2 222 650 E-mail: D.Turner;at;sheffield.ac.uk Fax: 0114 2 780 300