首页 文章

如何在R中绘制逻辑glm预测值和置信区间

提问于
浏览
3

我有一个存在/不存在响应变量的二项式 glm 和一个9级的因子变量,如下所示:

data$y<-factor(data$y,levels=c(0,1),labels=c("absent","present"))
table(data$y,data$site_name)

          Andulay Antulang Basak Dauin Poblacion District 1 Guinsuan Kookoo's Nest Lutoban Pier Lutoban South Malatapay Pier
  absent        4        4     1                          0        3             1            5             5              2
  present       2        2     5                          6        3             5            1             1              4

model <- glm(y~site_name,data=data,binomial)

为了简洁起见,只是跳过模型推理和验证,如何在每个站点上绘制一个以其置信区间在箱图中获得"present"的概率?我想要的是Plot predicted probabilities and confidence intervals in R中显示的内容,但我想用箱线图显示它,因为我的回归变量site_name是一个9级的因子,而不是连续变量 .

我想我可以按如下方式计算必要的值(但不是100%肯定正确性):

将模型系数转换回成功概率的函数:

calc_val <- function(x){return(round(1/(1+1/(exp(x))),3))}

基于模型的预测概率:

prob <- tapply(predict(model,type="response"),data$site_name,function(x){round(mean(x),3)})
means <- as.data.frame(prob)

预测概率的75%和95%置信区间:

ci <- cbind(confint(model,level=0.9),confint(model,level=0.5))
rownames(ci) <- gsub("site_name","",rownames(ci))
ci <- t(apply(ci,1,calc_val))

将它们一起加入一个表中

ci<-cbind(means,ci)
ci
                            prob   5 %  95 %  25 %  75 %   Pr(>|z|) stderr
Andulay                    0.333 0.091 0.663 0.214 0.469 0.42349216  0.192
Antulang                   0.333 0.112 0.888 0.304 0.696 1.00000000  0.192
Basak                      0.833 0.548 0.993 0.802 0.964 0.09916496  0.152
Dauin Poblacion District 1 1.000 0.000    NA 0.000 1.000 0.99097988  0.000
Guinsuan                   0.500 0.223 0.940 0.474 0.819 0.56032414  0.204
Kookoo's Nest              0.833 0.548 0.993 0.802 0.964 0.09916496  0.152
Lutoban Pier               0.167 0.028 0.788 0.130 0.501 0.51171512  0.152
Lutoban South              0.167 0.028 0.788 0.130 0.501 0.51171512  0.152
Malatapay Pier             0.667 0.364 0.972 0.640 0.903 0.25767454  0.192

所以我的问题是双重的:

  • 概率和置信区间的计算是否正确?

  • 如何在bloxplot中绘制此图(框和胡须图)?

EDIT 以下是一些来自 dput 的示例数据(它还修改了上面的表格以匹配数据):

# dput(data[c("y", "site_name")])
data <- structure(list(y = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 1L, 2L, 2L, 2L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 2L, 2L, 1L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 2L, 2L, 2L), .Label = c("absent", "present"), class = "factor"), site_name = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 9L, 9L, 9L, 9L, 9L, 9L, 4L, 4L, 4L, 4L, 4L, 4L, 8L, 8L, 8L, 8L, 8L, 8L, 7L, 7L, 7L, 7L, 7L, 7L, 5L, 5L, 5L, 5L, 5L, 5L, 1L, 1L, 1L, 1L, 1L, 1L, 3L, 3L, 3L, 3L, 3L, 3L, 6L, 6L, 6L, 6L, 6L, 6L), .Label = c("Andulay", "Antulang", "Basak", "Dauin Poblacion District 1", "Guinsuan", "Kookoo's Nest", "Lutoban Pier", "Lutoban South", "Malatapay Pier"), class = "factor")), .Names = c("y", "site_name"), row.names = c(125L, 123L, 126L, 124L, 128L, 127L, 154L, 159L, 157L, 158L, 156L, 155L, 111L, 114L, 116L, 115L, 112L, 113L, 152L, 151L, 148L, 150L, 153L, 149L, 143L, 146L, 144L, 147L, 142L, 145L, 164L, 165L, 161L, 163L, 160L, 162L, 120L, 122L, 121L, 117L, 118L, 119L, 137L, 136L, 139L, 141L, 140L, 138L, 129L, 134L, 131L, 135L, 133L, 130L), class = "data.frame")
#

1 回答

  • 5

    这是一种最低通用分母,仅基于包的解决方案 .

    适合模型:

    mm <- glm(y~site_name,data=dd,family=binomial)
    

    使用站点名称组成预测框架:

    pframe <- data.frame(site_name=unique(dd$site_name))
    

    预测(在logit /线性预测器标度上),带有标准误差

    pp <- predict(mm,newdata=pframe,se.fit=TRUE)
    linkinv <- family(mm)$linkinv ## inverse-link function
    

    将预测,下限和上限以及后向变换放在概率比例中:

    pframe$pred0 <- pp$fit
    pframe$pred <- linkinv(pp$fit)
    alpha <- 0.95
    sc <- abs(qnorm((1-alpha)/2))  ## Normal approx. to likelihood
    alpha2 <- 0.5
    sc2 <- abs(qnorm((1-alpha2)/2))  ## Normal approx. to likelihood
    pframe <- transform(pframe,
                        lwr=linkinv(pred0-sc*pp$se.fit),
                        upr=linkinv(pred0+sc*pp$se.fit),
                        lwr2=linkinv(pred0-sc2*pp$se.fit),
                        upr2=linkinv(pred0+sc2*pp$se.fit))
    

    情节 .

    with(pframe,
    {
        plot(site_name,pred,ylim=c(0,1))
        arrows(as.numeric(site_name),lwr,as.numeric(site_name),upr,
               angle=90,code=3,length=0.1)
    })
    

    作为箱线图:

    with(pframe,
    {
        bxp(list(stats=rbind(lwr,lwr2,pred,upr2,upr),
                 n = rep(1,nrow(pframe)),
                 conf = NA,
                 out = NULL,
                 group = NULL,
                 names=as.character(site_name)))
    })
    

    还有很多其他方法可以做到这一点;我会推荐

    library("ggplot2")
    ggplot(pframe,aes(site_name,pred))+
         geom_pointrange(aes(ymin=lwr,ymax=upr))+
         geom_linerange(aes(ymin=lwr2,ymax=upr2),lwd=1.5)+
         coord_flip()
    

    另一种解决方案是通过 y~site_name-1 拟合模型,在这种情况下,它将为每个站点的概率分配一个单独的参数,并使用 profile() / confint() 来查找置信区间;这将比依赖于上面答案中所做的参数/预测的采样分布的正态性稍微准确一些 .

相关问题