我一直在使用10X10倍的交叉验证来完成我的大部分建模工作,并希望通过让插件为我做这件事来简化我的生活 .
但是,当我尝试在插入符号中运行重复的cv时,结果看起来很奇怪 .
最值得注意的是,Kappa值远远超出了我的预期 .
-
caret repeatedcv kappa = 0.0308791
-
home组装重复cv kappa = 0.4137178
(公平地说,家庭版本也使用来自插入符号的一些函数调用...但是交叉验证是明确完成的,不是嵌入在插入符培训调用中)
这是一个很大的不同 .
关于这里发生了什么的任何想法?
数据集位于here .
# --- Begin caret cv test ---
library(caret)
dataset <- read.csv("Sample Data.csv")
my_control <- trainControl(
method="repeatedcv",
number=10,
repeats = 10,
savePredictions="final",
classProbs=TRUE
)
dataset$Temp <- "Yes"
dataset$Temp[which(dataset$Dep.Var=="0")] <- "No"
dataset$Temp <- as.factor(dataset$Temp)
my.formula <- as.formula("Temp ~ Param.F + Param.C")
testmodel <- train(my.formula, data = dataset,
method = "glm",
trControl = my_control,
metric = "Kappa")
# --- End of caret cv test ---
# --- will reference the model "testmodel" later to show comparison
# --- with home built version
# --- Now for the home built version: ---
library(foreach)
out <- foreach(i = 1:10, .combine = rbind, .inorder = FALSE) %do% {
folds <- caret::createFolds(dataset$Temp, k = 10, list = FALSE)
part.out <- foreach(j = 1:10, .combine = rbind, .inorder = FALSE) %do% {
deve <- dataset[folds != j, ]
test <- dataset[folds == j, ]
temp_model <- glm(my.formula, data=deve, family=binomial(link='logit'))
pred <- predict(temp_model,newdata=test,type="response")
data.frame(y = test$Dep.Var, prob = pred)
}
part.out
}
c.kappa <- foreach (i = 1:1000, .combine = rbind) %do% {
pred2 <- as.factor((out$prob>(quantile(out$prob, i/1000)))*1)
c(quantile(out$prob, i/1000), confusionMatrix(pred2, out$y)$overall[2])
}
pred2 <- as.factor((out$prob>c.kappa[which.max(c.kappa[,2]),1])*1)
# --- End of home built version ---
# --- Now to see the results of each: ---
# --- Home Built ---
caret::confusionMatrix(pred2, out$y)$overall[2]
# --- Caret Repeated CV ---
testmodel$results[3]
1 回答
您没有将种子设置在任何位置,因此无法确认重新取样结果 .
如果在运行
train
之前设置种子,则可以通过引用control
对象使用相同的重采样索引然后使用
in_model[[j]]
获取用于建模的数据,-in_model[[j]]
为您提供负整数以获得相同的保持集 .此外,您似乎错误地分配了预测 . 你可能想要使用类似的东西
马克斯