我有一个缺少值的小数据集(280行) . 我使用多个插补(鼠标包,m = 5)来估算我的数据集 .

然后,我使用10倍交叉验证对每个插补数据集应用不同的回归算法(即SVM,rpart..etc) . 我将使用得到的RMSE(均方根误差)值来比较回归算法 .

事情是我最终将为每个特定算法提供5种RMSE方法,因为数据集已经被估算了5次,我的问题是如何组合属于一种算法的五种RMSE?所以我可以进行算法之间的比较 . 换句话说,我想计算平均系数,我知道pool()函数可以做到这一点,但我不确定我是否可以将其用于机器学习,如SVM和随机森林 .

我想到的一个解决方案是使用长格式组合所有数据帧然后应用我的算法,我最终得到一个RMSE的意思,但我担心过度拟合问题,因为长格式可能有重复记录,请纠正我,如果我错了 ?

非常感谢,希望你能帮助我 .

以下是我的代码 .

x <- data 
form <- data$target
fold <- 10  # number of fold for cross validation

imp <- mice(x, meth = "pmm", m=5) # Imputation using mice pmm (5 iteration)

impSetsVector <- list(); # will include the 5 imputed sets
for(i in seq(5))
{
  impSetsVector[[i]] <- complete(imp, action = i, include = FALSE)
}


## Next I Applied RandomForest using 10 fold cross validation to each imputed set
## I Computed rmse for each dataset

avg.rmse <- matrix(data = NA,nrow=10, ncol=1) # include the mean of rmse for each imputed dataset.

for(j in seq(5))  # as we have 5 imputed dataset
{
  x <- impSetsVector[[j]] # x will include the j iteration of imputed dataset
n <- nrow(x)
prop <- n%/%fold
set.seed(7)
newseq <- rank(runif(n))
k <- as.factor((newseq - 1)%/%prop + 1)
y <- unlist(strsplit(as.character(form), " "))[2] 
vec.error <- vector(length = fold)
## start modeling with 10 fold cross validation
for (i in seq(fold)) {
  # Perfrom RandomForest method
  fit <- randomForest(form ~., data = x[k != i, ],ntree=500,keep.forest=TRUE,importance=TRUE,na.action = na.omit)

  fcast <- predict(fit, newdata = x[k == i, ]) # predict using test set
  rmse <-  sqrt(mean((x[k == i, ]$y - fcast)^2)) 
  vec.error[i] <- rmse # rmse for test set
}# end of the inner loop

avg.rmse[j] <- mean(vec.error) ## The mean of 10 rmse 

}#end of loop