首页 文章

使用R插入包获得CV测试折叠分区的预测?

提问于
浏览
0

我正在使用插入符号来查找和比较多个模型的预测 . 我首先将我的数据划分为5个交叉验证折叠,然后在5个训练数据集的每个中使用10倍CV来选择最佳模型参数 .

单个 glmnet 模型的小(n = 400)测试数据集上的示例代码:

# Load data & factor admit variable.
> mydata <- read.csv("http://www.ats.ucla.edu/stat/data/binary.csv")
> mydata$admit <- as.factor(mydata$admit)

# Create levels yes/no to make sure the the classprobs get a correct name.
 levels(mydata$admit) = c("yes", "no")

# Partition data into 5 folds.
> set.seed(123)
> folds <- createFolds(mydata$admit, k=5)

# Train elastic net logistic regression via 10-fold CV on each of 5 training folds using index argument.
> set.seed(123)
> train_control <- trainControl( method="cv",
 number=10,
 index=folds,
 classProbs = TRUE,
 savePredictions = TRUE)

> glmnetGrid <- expand.grid(alpha=c(0, .5, 1), lambda=c(.1, 1, 10))
 model<- train(admit ~ .,
 data=mydata,
 trControl=train_control,
 method="glmnet",
 family="binomial",
 tuneGrid=glmnetGrid,
 metric="Accuracy",
 preProcess=c("center","scale"))

> model
glmnet 

400 samples
  3 predictor
  2 classes: 'yes', 'no' 

Pre-processing: centered (3), scaled (3) 
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 79, 80, 80, 81, 80 
Resampling results across tuning parameters:

  alpha  lambda  Accuracy      Kappa          Accuracy SD     Kappa SD     
  0.0     0.1    0.6918972780  0.08970669720  0.016425551472  0.08416581606
  0.0     1.0    0.6825007141  0.00000000000  0.001368477994  0.00000000000
  0.0    10.0    0.6825007141  0.00000000000  0.001368477994  0.00000000000
  0.5     0.1    0.6818893800  0.04127002380  0.008252409699  0.04052581228
  0.5     1.0    0.6825007141  0.00000000000  0.001368477994  0.00000000000
  0.5    10.0    0.6825007141  0.00000000000  0.001368477994  0.00000000000
  1.0     0.1    0.6800085023  0.02149826881  0.005876570847  0.04807159045
  1.0     1.0    0.6825007141  0.00000000000  0.001368477994  0.00000000000
  1.0    10.0    0.6825007141  0.00000000000  0.001368477994  0.00000000000

Accuracy was used to select the optimal model using  the largest value.
The final values used for the model were alpha = 0 and lambda = 0.1. 
> summary(model$pred)
  pred        obs          rowIndex           yes                  no                 alpha         lambda       Resample        
 yes:14192   yes:9828   Min.   :  1.00   Min.   :0.2650250   Min.   :0.03333769   Min.   :0.0   Min.   : 0.1   Length:14400      
 no :  208   no :4572   1st Qu.:100.75   1st Qu.:0.6750000   1st Qu.:0.31250000   1st Qu.:0.0   1st Qu.: 0.1   Class :character  
                        Median :200.50   Median :0.6835443   Median :0.31645570   Median :0.5   Median : 1.0   Mode  :character  
                        Mean   :200.50   Mean   :0.6840322   Mean   :0.31596777   Mean   :0.5   Mean   : 3.7                     
                        3rd Qu.:300.25   3rd Qu.:0.6875000   3rd Qu.:0.32500000   3rd Qu.:1.0   3rd Qu.:10.0                     
                        Max.   :400.00   Max.   :0.9666623   Max.   :0.73497501   Max.   :1.0   Max.   :10.0

Question: 插入符号语法是否允许我为5个训练折叠分区中的每一个获得相应最佳拟合模型的5个测试折叠预测?

实际上, model$pred 返回了14,400个预测和整个数据集的最佳拟合模型 . 我希望 model$pred 返回n = 5 x 80 = 400个预测,适用于每个训练折叠的5个独立模型 .

1 回答

  • 1

    你只需要设置savePredictions =“final” . 这应该将输出限制为您需要的输出 .

相关问题