Re: CCL:cross-validation and prediction with PLS

Dear Dr Winkler

I have noted your comments with interest. I am very curious about your
suggestion that Bayesian NN QSAR methods don't (may not) require
validation; that crossvalidation (CV) may not be needed I can understand
but NO validation at all? How do you know that your model has any value
without some external criterion? Or have I missed something here?

I have some more questions/comments:-

> There are a whole raft of problems which arise with cross-validation and testing,

> for example:
> 1. What is the best way to chose a validation set (or test for that matter) is
> leave-one-out, leave-N-out etc?
As far as I know one doesn't use CV (LOO/LNO) to choose validation/test sets.
What do you mean here? The only thing I can think of is that CV can be used to help identify training set outliers.

> 2. Should clustering be used to choose a representative test set and , if so,
> what kind of clustering algorithm and do you cluster on X, Y or X/Y?
Are you using the term "clustering" here to cover PCA/PLS score-based
experimental design as well as what is more usually called clustering?

> 3. Effort involved in the cross validation process scales as the square of the
> number of compounds and the square of the number of independent variables,
> which has important implications for large data sets (eg combichem/HTS data)
With the SAMPLS algorithm fitted/CV PLS run-times are independent of the number of independent variables (P). (There will be some overhead for calculating the inter-compound covariance matrix). SAMPLS should be used where P>>M (M=No. of compounds). PLS with large numbers of compounds will always be a problem; one's definition of "large" will obviously change as more computational power becomes available ...

> 4. Once a cross validation process has been completed, there are N QSAR
> models, all slightly different. Which is the 'best' or 'true' model?
What do you mean N models? With PLS, CV is used to establish the number of
PLS latent variables (LV) to use in a final all-observations-included analysis the
regression coefficients from which predictions can be made. Or is your
"N" an LV indicator?

I look forward to hearing your comments


Dave Turner

> My colleague, Frank Burden (Monash University) and I have used Bayesian
> regularized neural nets for QSAR. We find that they overcome virtually all of
> the problems with PLS QSAR models as they give the single statistically best
> model possible for the data set. In addition there are good theoretical reasons
> why they do not require cross validation or test sets. We are investigating this
> for QSAR and preliminary results suggest that this is the case.
> We have published some of this work recently:
> [74] New QSAR Methods Applied to Structure-Activity Mapping and
> Combinatorial Chemistry, Burden, F.R. and Winkler, D.A. J. Chem. Inf.
> Comput. Sci. 39, 236 (1999).
> [75] The Computer Simulation of High Throughput Screening of Bioactive
> Molecules, F.R. Burden, D.A. Winkler, in Molecular Modelling and Prediction
> of Bioactivity (K. Gundertofte and F.S. Jorgensen eds), Plenum Press 1998.
> [80] Robust QSAR Models Using Bayesian Regularised Artificial Neural
> Networks, Burden, F.R. and Winkler, D.A. J. Med. Chem., 1999; 42(16); 3183-
> 3187 (1999).
> [81] A QSAR Model for the Acute Toxicity of Substituted Benzenes towards
> Tetrahymena Pyriformis using Bayesian Regularized Neural Networks. F R.
> Burden* David A. Winkler, Chem. Res. Toxicol., in press.
> [82] Robust QSAR Models from Novel Descriptors and Bayesian Regularized
> Neural Networks, Winkler, D.A, Burden, F.R. Mol. Simul. 1999 in press.
> [87] Do QSAR Models using Bayesian Regularized Artificial Neural Networks
> Really Need Validation? Winkler, D.A. and Burden, F.R. J.Chem. Inf.
> Comput. Sci in preparation.
> Cheers,
> Dave
> Dr. David A. Winkler Email: dave.winkler -AatT-
> Senior Principal Research Scientist Voice: 61-3-9545-2477
> CSIRO Molecular Science Fax: 61-3-9545-2446
> Private Bag 10,Clayton South MDC 3169
> Australia

 Dr David Turner
 Dept of Information Studies, Sheffield University
 Sheffield, S10 2TN, UK        Tel. 0114 2 222 650
 E-mail: D.Turner -AatT-
 Fax: 0114 2 780 300