For detailed information, please see Yale NHLBI/Proteomics Center

Algorithms used for analysis of MALDI-MS spectra

All analyses were done within R, which provides a flexible environment for statistical modeling.

0) R statistical analyses software
see http://www.r-project.org/ for general information, and http://cran.r-project.org/ for downloading.

1) LDA (Linear Discriminant Analysis), QDA (Quadratic Discriminant Analysis)
R package: MASS
function: lda, qda

2) KNN (k-nearest neighbor)
R package: class
function: knn

3) Bagging, boosting classification trees
R package: rpart, tree
function: rpart, tree
Our bagging/boosting programs are based on functions "rpart, tree" from these two packages.

4) SVM (Support Vector Machine)
R package: e1071
function: svm
The underlying C code is from libsvm

5) RF (Random forest)
R package: randomForest
function: randomForest
The underlying Fortran code is from Leo Breiman

6) In the paper, we used two error estimations:
cv-10 (10-fold cross-validation); .632+
These two errors can be readily estimated using the ipred package within R, which provides very convenient wrappers to various statistical methods.

A sample R session for predicting error estimations
Note: all error estimations need to be repeated N times (we used N=100) to generate an error summary.

### start R and load necessary R packages
library(MASS); library(class); library(e1071); library(randomForest); library(rpart); library(tree); library(ipred)
### load the famous Edgar Anderson's Iris Data; replace this data with your actual MALDI-MS dataset
data(iris)
### parameter values setup
cv.k = 10 ## 10-fold cross-validation
B = 100 ## using 100 Bootstrap samples in .632+ error estimation
C.svm = 10 ## Cost parameters for svm, needs to be tuned for different datasets
### setup a random seed for reproducing results later, the seed can be any integer
set.seed(800325)

################## LDA ####################
ip.lda <- function(object, newdata) predict(object, newdata = newdata)$class
## cv-10
errorest(Species ~ ., data=iris, model=lda, estimator="cv", est.para=control.errorest(k=cv.k), predict=ip.lda)$err
## .632+
errorest(Species ~ ., data=iris, model=lda, estimator="632plus", est.para=control.errorest(nboot=B), predict=ip.lda)$err

################## QDA ####################
ip.qda <- function(object, newdata) predict(object, newdata = newdata)$class
## cv-10
errorest(Species ~ ., data=iris, model=qda, estimator="cv", est.para=control.errorest(k=cv.k), predict=ip.qda)$err
## .632+
errorest(Species ~ ., data=iris, model=qda, estimator="632plus", est.para=control.errorest(nboot=B), predict=ip.qda)$err

################## KNN ####################
Note:
at the time of writing this draft, there was an error in the underlying wrapper code for "knn" in package ipred. The error was due to the name conflict of variable "k" used in the wrapper function "ipredknn" and the original function "knn". The underlying R code is "ipredknn <- function(formula, data, subset, na.action, k=5, ...)", and function "knn" also has the same variable "k". We needed to change variable "k" to something else (I used "ik") to avoid conflict.

bwpredict.knn <- function(object, newdata) predict.ipredknn(object, newdata, type="class")
## cv-10
errorest(Species ~ ., data=iris, model=ipredknn, estimator="cv", est.para=control.errorest(k=cv.k), predict=bwpredict.knn, ik=1)$err
errorest(Species ~ ., data=iris, model=ipredknn, estimator="cv", est.para=control.errorest(k=cv.k), predict=bwpredict.knn, ik=3)$err
## .632+
errorest(Species ~ ., data=iris, model=ipredknn, estimator="632plus", est.para=control.errorest(nboot=B), predict=bwpredict.knn, ik=1)$err
errorest(Species ~ ., data=iris, model=ipredknn, estimator="632plus", est.para=control.errorest(nboot=B), predict=bwpredict.knn, ik=3)$err

################## RF #####################
## oob error estimation
randomForest(Species ~ ., data=iris, mtry=2, ntree=B, keep.forest=FALSE)$err.rate[B]
## we can also try cv-10
errorest(Species ~ ., data=iris, model=randomForest, estimator = "cv", est.para=control.errorest(k=cv.k), ntree=B, mtry=2)$err

################## bagging #####################
## For complete flexibility programs can be written to explore a wide range of options.
## Besides implementing bagging using functions "rpart, tree" ourselves, there is a
## convenient function "bagging" in package "ipred" which calls "rpart" for classification.
## The error returned is out-of-bag estimation.
bagging(Species ~ ., data=iris, nbagg=B, control=rpart.control(minsplit=2, cp=0, xval=0), comb=NULL, coob=TRUE, ns=dim(iris)[1], keepX=TRUE)

################## boosting #####################
## To our knowledge, there are no ready-to-use boosting functions implemented within R.
## It's not difficult to program boosting functions using "rpart, tree" following the general algorithm descriptions.

################## SVM #####################
## cv-10
errorest(Species ~ ., data=iris, model=svm, estimator="cv", est.para=control.errorest(k = cv.k), cost=C.svm)$error
## .632+
errorest(Species ~ ., data=iris, model=svm, estimator="632plus", est.para=control.errorest(nboot = B), cost=C.svm)$error