For detailed information, please see Yale NHLBI/Proteomics Center

Relative performance of different discriminant methods applied to the ovarian cancer MS data

The ovarian cancer data used in the paper

The data can be downloaded at http://bioinformatics.med.yale.edu/MSDATA2.

Discriminant methods and internal validation

please see our previous Bioinformatics paper

External validation

The data are partitioned into training+testing set, with testing consisting of 1/3 of the total samples. Different methods are trained on the training data, including preselecting important m/z features, the classifier are then applied to the testing set. The splitting is repeated 50 times and different methods are compared considering preselecting 10,20,30,50,100,300 m/z features.

QDA needs to estimate covariance matrix and hence the number of selected m/z features is limited by the sample size.
LDA may have singularity problems in estimating variance matrix, we also included the stabilized LDA (SLDA) analysis methods.

This plot summaries the classification error estimations for different methods. The last page compares the best from each method using different number of features, with the numbers on the bottom being the classification error averaged over 50 splits. We can see that the RF indeed has the best performance.

R functions for all the discriminant methods used in the comparison

All analyses were done within R, which provides a flexible environment for statistical modeling.

1) LDA (Linear Discriminant Analysis), QDA (Quadratic Discriminant Analysis), SLDA
R package: MASS, ipred
function: lda, qda, slda

2) KNN (k-nearest neighbor)
R package: class
function: knn

3) classification trees, Bagging, boosting
R package: rpart, tree, ipred, boost
function: rpart, tree, bagging, adaboost

4) SVM (Support Vector Machine)
R package: e1071
function: svm
The underlying C code is from libsvm

5) RF (Random forest)
R package: randomForest
function: randomForest
The underlying Fortran code is from Leo Breiman

Inverse-power-law learning curve estimation

This plot shows the parameter estimations. This file contains the numerical values of the parameter estimations. The estimations are very stable.

Sample R codes for quantifying the randomness of classification errors

Purely by chance each sample has an equal chance of being classified as a disease or normal sample. The total number of errors E1 for the disease group is a Binom(N1,1/2); and is E2 ~ Binom(N2,1/2) for the normal group (N1=93,N2=77). The pvalue for 25% classification error can be estimated by
(1) Pr((E1/N1+E2/N2)/2<=0.25) for balanced error
(2) Pr((E1+E2)/(N1+N2)<=0.25) for counting error
For (1), the pvalue is the convolution of two binomial probabilities; for (2), it is just the tail probabilities of binomial distributions since E1+E2 ~ Binom(N1+N2,1/2).
R codes for the calculations:
n1 = 93; n2 = 77
pr = rep(0,47)
for(i in 0:46){
e2 = as.integer(n2*(0.5i/n1))
pr[i+1] = pbinom(i,n1,1/2)*sum(pbinom(0:e2,n2,1/2))
}
sum(pr)
## 3.4e-11; highly significant
pbinom(as.integer(0.25*(n1+n2)),n1+n2,1/2)
## 1.3e-11
The difference reflects the effects of unbalanced sample size (which is not dramatic for this dataset).
As an extreme example
n1 = 100; n2 = 10
pr = rep(0,51)
for(i in 0:50){
e2 = as.integer(n2*(0.5i/n1))
pr[i+1] = pbinom(i,n1,1/2)*sum(pbinom(0:e2,n2,1/2))
}
sum(pr)
## 0.003
pbinom(as.integer(0.25*(n1+n2)),n1+n2,1/2)
## 4.2e-8

Sample R codes for some of the preprocessing

Assume x is the m/z ratio, and y is the log intensity profile for one sample

## bakground subtraction
yb = y - lowess(x, y, f=1000/length(y))$y
## normalization
## Yb is the matrix combining the background subtracted profile for n subjects: yb1, ..., ybn
## first calculate median intensity profile
ym = apply(Yb, 1, median)
## normalization factor and normalized intensity profile
nf = Yb%*%ym
Ybn = t(t(Yb)/nf)
### peak identification for sample ybni
pk = rep(0, n-9) ## n is the length of intensity profile
for(j in 5:(n-5)){
yL = ybni[(j-5):j]
iL = sum(diff(yL)>0)
yR = ybni[j:(j+5)]
iR = sum(diff(yR)<0)
pk[j-4] = (iL>=4)*(iR>=4)
}