首页 文章

Caret重复CV Kappa不匹配Home Coded Foreach重复CV Kappa

提问于
浏览
0

我一直在使用10X10倍的交叉验证来完成我的大部分建模工作,并希望通过让插件为我做这件事来简化我的生活 .

但是,当我尝试在插入符号中运行重复的cv时,结果看起来很奇怪 .

最值得注意的是,Kappa值远远超出了我的预期 .

  • caret repeatedcv kappa = 0.0308791

  • home组装重复cv kappa = 0.4137178

(公平地说,家庭版本也使用来自插入符号的一些函数调用...但是交叉验证是明确完成的,不是嵌入在插入符培训调用中)

这是一个很大的不同 .

关于这里发生了什么的任何想法?

数据集位于here .

# --- Begin caret cv test ---

library(caret)

dataset <- read.csv("Sample Data.csv")

my_control <- trainControl(
  method="repeatedcv",
  number=10,
  repeats = 10,
  savePredictions="final",
  classProbs=TRUE
)

dataset$Temp <- "Yes"
dataset$Temp[which(dataset$Dep.Var=="0")] <- "No"
dataset$Temp <- as.factor(dataset$Temp)

my.formula <- as.formula("Temp ~ Param.F + Param.C")

testmodel <- train(my.formula, data = dataset,
               method = "glm",
               trControl = my_control,
               metric = "Kappa")

# --- End of caret cv test ---
# --- will reference the model "testmodel" later to show comparison
# --- with home built version

# --- Now for the home built version: ---

library(foreach)

out <- foreach(i = 1:10, .combine = rbind, .inorder = FALSE) %do% {
  folds <- caret::createFolds(dataset$Temp, k = 10, list = FALSE)

  part.out <- foreach(j = 1:10, .combine = rbind, .inorder = FALSE) %do% {
    deve <- dataset[folds != j, ]
    test <- dataset[folds == j, ]

    temp_model <- glm(my.formula, data=deve, family=binomial(link='logit'))
    pred <- predict(temp_model,newdata=test,type="response")
    data.frame(y = test$Dep.Var, prob = pred)
  }
  part.out
}

c.kappa <- foreach (i = 1:1000, .combine = rbind) %do% {
  pred2 <- as.factor((out$prob>(quantile(out$prob, i/1000)))*1)
  c(quantile(out$prob, i/1000), confusionMatrix(pred2, out$y)$overall[2])
}

pred2 <- as.factor((out$prob>c.kappa[which.max(c.kappa[,2]),1])*1)

# --- End of home built version ---

# --- Now to see the results of each: ---

# --- Home Built ---
caret::confusionMatrix(pred2, out$y)$overall[2]

# --- Caret Repeated CV ---
testmodel$results[3]

1 回答

  • 2

    您没有将种子设置在任何位置,因此无法确认重新取样结果 .

    如果在运行 train 之前设置种子,则可以通过引用 control 对象使用相同的重采样索引

    suffix <- paste0("Rep", gsub(" ", "0", format(1:10)), "$")
    out <- foreach(i = 1:10, .combine = rbind, .inorder = FALSE) %do% {
        in_model <- testmodel$control$index[grepl(suffix[i], names(testmodel$control$index))]
    

    然后使用 in_model[[j]] 获取用于建模的数据, -in_model[[j]] 为您提供负整数以获得相同的保持集 .

    此外,您似乎错误地分配了预测 . 你可能想要使用类似的东西

    pred <- predict(temp_model,newdata=test,type="response")
    pred <- factor(ifelse(pred > .5, "Yes", "No"))
    

    马克斯

相关问题