首页 文章

当因子在测试集中具有新级别时,避免失败

提问于
浏览
2

我有一个数据集,我按以下方式分为训练和测试子集:

train_ind <- sample(seq_len(nrow(dataset)), size=(2/3)*nrow(dataset))
train <- dataset[train_ind]
test <- dataset[-train_ind]

然后,我用它训练一个glm:

glm.res <- glm(response ~ ., data=dataset, subset=train_ind, family = binomial(link=logit))

最后,我用它来预测我的测试集:

preds <- predict(glm.res, test, type="response")

根据示例,这会失败并显示错误:

model.frame.default中的错误(Terms,newdata,na.action = na.action,xlev = object $ xlevels):factor有新的级别

请注意,该值显示在完整数据集中,但显然不在训练集上 . 我想做的是让预测函数忽略这些新因素 . 即使它已经对因子进行了二值化,我也不明白为什么它可以假设新值(因此,不是线性模型中的变量)只是0,这将产生正确的行为 .

有没有办法做到这一点?

1 回答

  • 1

    我从以下数据生成过程开始(二元响应变量,一个数值自变量和3个分类独立变量):

    set.seed(1)
    n <- 500
    y <- factor(rbinom(n, size=1, p=0.7))
    x1 <- rnorm(n)
    x2 <- cut(runif(n), breaks=seq(0,1,0.2))
    x3 <- cut(runif(n), breaks=seq(0,1,0.25))
    x4 <- cut(runif(n), breaks=seq(0,1,0.1))
    df <- data.frame(y, x1, x2, x3, x4)
    

    在这里,我构建了训练和测试集,以便在测试集中具有一些分类协变量( x2x3 ),其中包含的类别多于训练集中的类别:

    idx <- which(df$x2!="(0.6,0.8]" & df$x3!="(0,0.25]")
    train_ind <- sample(idx, size=(2/3)*length(idx))
    train <- df[train_ind,]
    train$x2 <- droplevels(train$x2)
    train$x3 <- droplevels(train$x3)
    test <- df[-train_ind,]
    
    table(train$x2)
    (0,0.2] (0.2,0.4] (0.4,0.6]   (0.8,1] 
         55        40        53        49 
    
    table(test$x2)
    (0,0.2] (0.2,0.4] (0.4,0.6] (0.6,0.8]   (0.8,1] 
         58        48        45        90        62 
    
    table(train$x3)
    (0.25,0.5] (0.5,0.75]   (0.75,1] 
            66         61         70 
    
    table(test$x3)
    (0,0.25] (0.25,0.5] (0.5,0.75]   (0.75,1] 
         131         63         47         62
    

    当然, predict 会产生@ Setzer22上面描述的消息错误:

    glm.res <- glm(y ~ ., data=train, family = binomial(link=logit)) 
    preds <- predict(glm.res, test, type="response")
    

    model.frame.default中的错误(条款,newdata,na.action = na.action,xlev = object $ xlevels):因子x2有新级别(0.6,0.8)

    这是一种(不优雅)删除 train 行的方法,这些行在协变量中有新的级别:

    dropcats <- function(k) {
       xtst <- test[,k]
       xtrn <- train[,k]
       cmp.tst.trn <- (unique(xtst) %in% unique(xtrn))
       if (is.factor(xtst) & any(!cmp.tst.trn)) {
          cat.tst <- unique(xtst)
          apply(test[,k]==matrix(rep(cat.tst[cmp.tst.trn],each=nrow(test)),
                          nrow=nrow(test)),1,any)
       } else {
          rep(TRUE,nrow(test))
       }
    }   
    filt <- apply(sapply(2:ncol(df),dropcats),1,all)
    subset.test <- test[filt,]
    

    在测试集 x2x3 的子集 subset.test 中没有新类别:

    table(subset.test[,"x2"])
      (0,0.2] (0.2,0.4] (0.4,0.6] (0.6,0.8]   (0.8,1] 
           26        25        20         0        28
    
    table(subset.test[,"x3"])
      (0,0.25] (0.25,0.5] (0.5,0.75]   (0.75,1] 
             0         29         29         41
    

    现在 predict 运作良好:

    preds <- predict(glm.res, subset(test,filt), type="response")
    head(preds)
    
           30        39        41        49        55        56 
    0.7732564 0.8361226 0.7576259 0.5589563 0.8965357 0.8058025
    

    希望这可以帮到你 .

相关问题