首页 文章

R - 从插入符号和glmnet套索模型对象中提取因子预测符名称

提问于
浏览
3

在下面的例子中,我设置了一个带有3个变量的df,预测,var1和var2(一个因子) .

当我在插入符或glmnet中运行模型时,该因子将转换为虚拟变量,例如var2b .

我想以编程方式提取变量名称并匹配原始变量名称,而不是虚拟变量名称 - 有没有办法做到这一点?

这只是一个例子,我的真实世界问题有许多不同级别的变量,因此,我想避免手动执行此操作,例如尝试子串出“b” .

谢谢!

library(caret)
library(glmnet)

df <- data.frame(predict = c('Y','Y','N','Y','N','Y','Y','N','Y','N'), var1 = c(1,2,5,1,6,7,3,4,5,6),
              var2 = c('a','a','b','b','a','a','a','b','b','a'))

str(df)

# 'data.frame': 10 obs. of  3 variables:
# $ predict: Factor w/ 2 levels "N","Y": 2 2 1 2 1 2 2 1 2 1
# $ var1   : num  1 2 5 1 6 7 3 4 5 6
# $ var2   : Factor w/ 2 levels "a","b": 1 1 2 2 1 1 1 2 2 1

test <- train(predict ~ .,
           data = df,
           method = 'glmnet',
           trControl = trainControl(classProbs = TRUE,
                                    summaryFunction = twoClassSummary,
                                    allowParallel = FALSE),
           metric = 'ROC',
           tuneGrid = expand.grid(alpha = 1,
                                  lambda = .005))

predictors(test)
# [1] "var1"  "var2b"
varImp(test)
# glmnet variable importance

# Overall
# var2b     100
# var1        0

coef(test)
# NULL
#################
x <- model.matrix(as.formula(predict~.),data=df)
x <-  x[,-1] ##remove intercept

df$predict <- ifelse(df$predict == 'Y', TRUE, FALSE)

glmnet1 <- glmnet::cv.glmnet(x = x,
                          y = df$predict,
                          type.measure='auc',
                          nfolds=3,
                          alpha=1,
                          parallel = FALSE)

rownames(coef(glmnet1))
# [1] "(Intercept)" "var1"        "var2b

2 回答

  • 1

    'train'对象的 formula 方法返回一个'formula'对象,其中包含您要查找的属性 .

    f1 <- formula(test)
    f1
    # predict ~ var1 + var2
    # attr(,"variables")
    # list(predict, var1, var2)
    # attr(,"factors")
    #         var1 var2
    # predict    0    0
    # var1       1    0
    # var2       0    1
    # attr(,"term.labels")
    # [1] "var1" "var2"
    # attr(,"order")
    # [1] 1 1
    # attr(,"intercept")
    # [1] 1
    # attr(,"response")
    # [1] 1
    # attr(,"predvars")
    # list(predict, var1, var2)
    # attr(,"dataClasses")
    #   predict      var1      var2 
    #  "factor" "numeric"  "factor" 
    attr(f1, "term.labels")
    # [1] "var1" "var2"
    

    似乎'cv.glmnet'对象中的变量名称不可用 . 我不知道收集这些的优雅方式 . glmnetUtils 包可能具有一些生活质量功能 .

    这是你可以尝试的一些代码;请注意,这将返回误报,因为它是按输入数据中的模式搜索列名(例如“var11”将匹配“var1”) .

    # a generic method
    termLabels <- function(object, ...) {
        UseMethod("termLabels")
    }
    # add for the train object too to save typing
    termLabels.train <- function(object, ...) {
        attr(formula(object), "term.labels")
    }
    # try to find term labels for cv.glmnet object
    # lambda must be provided and snaps to search grid
    # allowed column names must be provided from corresponding data object
    termLabels.cv.glmnet <- function(object, lambda, names, ...) {
        if (missing(lambda)) { stop("lambda is missing") }
        if (missing(names)) { stop("names is missing") }
        # match lambda
        lambdaArray <- object$glmnet.fit$a0
        if (lambda > max(lambdaArray) || lambda < min(lambdaArray)) {
            stop(paste("lambda must be in range", 
                paste(range(lambdaArray), collapse = ":")))
        }
        # find closest lambda
        whichLambda <- which.min(abs(lambdaArray - lambda))
        message(paste("using lambda", lambdaArray[whichLambda]))
        # matrix of parameter estimates
        betaLambda <- object$glmnet.fit$beta[, whichLambda, drop = FALSE]
        # non-zero estimates
        betaLambda <- betaLambda[betaLambda[, 1] != 0, , drop = FALSE]
        vars <- rownames(betaLambda)
        # search with names as pattern
        # note, does not account for nested names, e.g. var1 and var11
        matchNames <- apply(matrix(names), MARGIN = 1, FUN = grepl, x = vars)
        names[apply(matchNames, MARGIN = 2, FUN = any)]
    }
    termLabels(glmnet1, lambda = 1, names = colnames(df))
    # using lambda 0.998561314952713
    # [1] "var1" "var2"
    
  • 1

    Per @ CSJCampbell的回答:glmnetUtils包允许你使用glmnet和cv.glmnet对象执行此操作 .

    library(glmnetUtils)
    m <- glmnet(mpg ~ ., data=mtcars)
    all.vars(m$terms)
    
    m2 <- cv.glmnet(mpg ~ ., data=mtcars)
    all.vars(m2$terms)
    

    请注意, all.vars 也适用于大多数其他R模型对象:

    m3 <- lm(mpg ~ ., data=mtcars)
    all.vars(delete.response(m3$terms))
    

    glmnetUtils可用on CRAN,或者你可以从Github获得dev version . 我目前正在完成一项重大更新,该更新应尽快发布到CRAN .

相关问题