An integrative ChIP-chip and gene expression profiling to model SMAD regulatory modules

BMC Systems Biology

Table 1 Misclassification rates by CART and RF modeling

		Error rate
	Number of Independent variables	Class 1	Class 2
Dataset 1: Down/Up		Down	Up
Sample Size		51	65
CART	164	0.41	0.46
RF	164	0.59	0.31
RF + CART	4	0.37	0.23
Dataset 2: Transient/Sustained		Transient	Sustained
Sample Size		23	41
CART	159	0.22	0.68
RF	159	0.86	0.19
RF + CART	3	0.17	0.27

For each dataset, the synexpression group labeling was the dependent variable and the TFBSs were the independent variables. CART model was derived by using Gini splitting criterion, equal prior setting, unitary cost and a 10-fold cross validation. The best tree was selected by minimum cost. The error rates were the rates on the test sample by cross validation. RF was run with stratified sampling with an equal sample size for both classes, whereas the sample size was set to the one of the class with smaller number of observations. The error rates were the average of out-of-bag error rates of 100 runs of RF, each with 1000 trees. RF + CART was to build a CART model on the top most important variables selected by RF. For both datasets, RF + CART provided the best classification results with lowest misclassification rates.

ISSN: 1752-0509