Re: CCL:cross-validation and prediction with PLS

Dear Dr Winkler

I have noted your comments with interest. I am very curious about your
suggestion that Bayesian NN QSAR methods don't (may not) require
validation; that crossvalidation (CV) may not be needed I can understand
but NO validation at all? How do you know that your model has any value
without some external criterion? Or have I missed something here?

I have some more questions/comments:-

> There are a whole raft of problems which arise with cross-validation and testing,

> for example:
> 1. What is the best way to chose a validation set (or test for that matter) is
> leave-one-out, leave-N-out etc?
As far as I know one doesn't use CV (LOO/LNO) to choose validation/test sets.
What do you mean here? The only thing I can think of is that CV can be used to help identify training set outliers.

> 2. Should clustering be used to choose a representative test set and , if so,
> what kind of clustering algorithm and do you cluster on X, Y or X/Y?
Are you using the term "clustering" here to cover PCA/PLS score-based
experimental design as well as what is more usually called clustering?

> 3. Effort involved in the cross validation process scales as the square of the
> number of compounds and the square of the number of independent variables,
> which has important implications for large data sets (eg combichem/HTS data)
With the SAMPLS algorithm fitted/CV PLS run-times are independent of the number of independent variables (P). (There will be some overhead for calculating the inter-compound covariance matrix). SAMPLS should be used where P>>M (M=No. of compounds). PLS with large numbers of compounds will always be a problem; one's definition of "large" will obviously change as more computational power becomes available ...

> 4. Once a cross validation process has been completed, there are N QSAR
> models, all slightly different. Which is the 'best' or 'true' model?
What do you mean N models? With PLS, CV is used to establish the number of
PLS latent variables (LV) to use in a final all-observations-included analysis the
regression coefficients from which predictions can be made. Or is your
"N" an LV indicator?

I look forward to hearing your comments


Dave Turner

> My colleague, Frank Burden (Monash University) and I have used Bayesian
> regularized neural nets for QSAR. We find that they overcome virtually all of
> the problems with PLS QSAR models as they give the single statistically best
> model possible for the data set. In addition there are good theoretical reasons
> why they do not require cross validation or test sets. We are investigating this
> for QSAR and preliminary results suggest that this is the case.
We have published some of this work recently:
Cheers,
Dave
