具有许多变量的数据集的bestglm替代品-Java 学习之路

R版本2.15.0（2012-03-30）RStudio 0.96.316 Win XP，最后更新

我有一个包含40个变量和15.000个观测值的数据集 . 我想使用bestglm来搜索可能的好模型（逻辑回归） . 我尝试过bestglm，但它不适用于这样的中型数据集 . 经过几次试验，我认为当有超过30个变量时，bestglm会失败，至少在我的电脑上是这样的（4G ram，双核） .

您可以自己尝试bestglm限制：

library(bestglm)

bestBIC_test <- function(number_of_vars) {

# Simulate data frame for logistic regression
glm_sample <- as.data.frame(matrix(rnorm(100*number_of_vars), 100))

# Get some 1/0 variable
glm_sample[,number_of_vars][glm_sample[,number_of_vars] > mean(glm_sample[,number_of_vars]) ] <- 1
glm_sample[,number_of_vars][glm_sample[,number_of_vars] != 1 ] <- 0

# Try to calculate best model
bestBIC  <- bestglm(glm_sample, IC="BIC", family=binomial)

}

# Test bestglm with increasing number of variables
bestBIC_test(10) # OK, running
bestBIC_test(20) # OK, running
bestBIC_test(25) # OK, running
bestBIC_test(28) # Error: cannot allocate vector of size 1024.0 Mb
bestBIC_test(30) # Error: cannot allocate vector of size 2.0 Gb
bestBIC_test(40) # Error in rep(-Inf, 2^p) : invalid 'times' argument

我可以在R中使用任何替代方案来搜索可能的好模型吗？

2 回答

你可以尝试探索包caret，它也有模型选择工具 . 我能够在没有问题的情况下安装具有15000个观测值的模型：

number_of_vars <- 40

dat <- as.data.frame(matrix(rnorm(15000*number_of_vars), 15000))
dat[,number_of_vars][dat[,number_of_vars] > mean(dat[,number_of_vars]) ] <- 1
dat[,number_of_vars][dat[,number_of_vars] != 1 ] <- 0

library(caret)
result <- train(dat[,1:39], dat[,40], family = "binomial", method = "glm")
result$finalModel

我会查阅大量文档，以便更好地控制模型拟合 .

回复于 2024-04-27T07:46:40+08:00

5

那么，对于初学者来说，详尽搜索40个变量的最佳子集需要创建超过一万亿的2 ^ 40个模型 . 这可能是你的问题 .

对于超过20个左右的变量，穷举最佳子集搜索通常不被认为是最佳的 .

更好的选择是向前逐步选择，大约是（40 ^ 2 40）/ 2模型，所以大约800 .

甚至更好（我认为最好）使用套索通过 glmnet 包进行正则化逻辑回归 .

好概述here .

回复于 2024-04-27T07:46:40+08:00

具有许多变量的数据集的bestglm替代品

2 回答

相关问题