VALIDATION OF CLASSIFICATION MODELS AND DATA REDUCTION METHODS BASED ONGENE EXPRESSION DATA
Background
The microarray technology has provided the simultaneous monitoring of the expression levels for thousands of genes. The analysis of these datasets is a problem in the century of bioinformatics revolution. The classifier methods such as data mining, machine learning and regression have been applied to differentiate between normal and abnormal samples in gene expression datasets, copiously.
Method
In this study, the classification accuracy of support vector machine (SVM), least square support vector machine (LSSVM), radial base function neural network (RBFNN), Bayesian probit kernel regression (BPKR) and Bayesian logistic kernel regression (BLKR) models on normal and abnormal samples was calculated based on two gene expression datasets and three reduced dimension sets multivariate median gene set analysis (MMGSA), PCA with Karhunen-Loeve transform (PCA-KL) and auto-encoder networks.
Results
The BKPR method, in full and PCA-KL data with Gaussian and linear kernel, has a high accuracy (up to 94%) and in encoder data with Gaussian kernel has 83% accuracy and in MMGSA data with linear kernel has 92% accuracy. The SVM method in full, PCA-KL and MMGSA data has accuracy up to 94%. The LSSVM method in full and MMGSA data have an acceptable implementation. In MMGSA data, the highest accuracy is 85% related to the SVM method and the BKPR method with Gaussian kernel.
Conclusion
The MMGSA or other gene set analysis approaches are recommended for data reduction (if needed), because they improve the interpretability of the results, and the BKPR and SVM methods are recommended for classification.
data mining, data reduction, gene expression.