For detailed information, please see Yale NHLBI/Proteomics Center
Relative performance of different discriminant methods applied to the ovarian cancer MS data
The ovarian cancer data used in the paper
The data can be downloaded at http://bioinformatics.med.yale.edu/MSDATA2.Discriminant methods and internal validation
please see our previous Bioinformatics paperExternal validation
The data are partitioned into training+testing set, with testing consisting of 1/3 of the total samples. Different methods are trained on the training data, including preselecting important m/z features, the classifier are then applied to the testing set. The splitting is repeated 50 times and different methods are compared considering preselecting 10,20,30,50,100,300 m/z features.
QDA needs to estimate covariance matrix and hence the number of selected m/z features is limited by the sample size.
LDA may have singularity problems in estimating variance matrix, we also included
the stabilized LDA (SLDA) analysis methods.
This plot summaries the classification error estimations for different methods. The last page compares the best from each method using different number of features, with the numbers on the bottom being the classification error averaged over 50 splits. We can see that the RF indeed has the best performance.
R functions for all the discriminant methods used in the comparison
All analyses were done within R, which provides a flexible environment for statistical modeling.
1) LDA (Linear Discriminant Analysis), QDA (Quadratic Discriminant Analysis), SLDA
R package: MASS,
ipred
function: lda, qda, slda
2) KNN (k-nearest neighbor)
R package: class
function: knn
3) classification trees, Bagging, boosting
R package:
rpart,
tree,
ipred,
boost
function: rpart, tree, bagging, adaboost
4) SVM (Support Vector Machine)
R package: e1071
function: svm
The underlying C code is from libsvm
5) RF (Random forest)
R package:
randomForest
function: randomForest
The underlying Fortran code is from
Leo Breiman
Inverse-power-law learning curve estimation
This plot shows the parameter estimations. This file contains the numerical values of the parameter estimations. The estimations are very stable.Sample R codes for quantifying the randomness of classification errors
Purely by chance each sample has an equal chance of being classified as a disease or normal sample. The total number of errors E1 for the disease group is a Binom(N1,1/2); and is E2 ~ Binom(N2,1/2) for the normal group (N1=93,N2=77). The pvalue for 25% classification error can be estimated bySample R codes for some of the preprocessing
Assume x is the m/z ratio, and y is the log intensity profile for one sample
## bakground subtraction
yb = y - lowess(x, y, f=1000/length(y))$y
## normalization
## Yb is the matrix combining the background subtracted profile for n subjects: yb1, ..., ybn
## first calculate median intensity profile
ym = apply(Yb, 1, median)
## normalization factor and normalized intensity profile
nf = Yb%*%ym
Ybn = t(t(Yb)/nf)
### peak identification for sample ybni
pk = rep(0, n-9) ## n is the length of intensity profile
for(j in 5:(n-5)){
yL = ybni[(j-5):j]
iL = sum(diff(yL)>0)
yR = ybni[j:(j+5)]
iR = sum(diff(yR)<0)
pk[j-4] = (iL>=4)*(iR>=4)
}