首页 文章

线性回归和R中的分组

提问于
浏览
74

我想使用 lm() 函数在R中进行线性回归 . 我的数据是一年一度的时间序列,一年(22年),另一个州(50个州) . 我想为每个状态拟合一个回归,以便最后我有一个lm响应的向量 . 我可以想象为每个状态做循环然后在循环内进行回归并将每个回归的结果添加到向量 . 然而,这似乎并不像R一样 . 在SAS中我会做一个'by'语句,在SQL中我会做一个'group by' . R的做法是什么?

10 回答

  • 10

    这是使用plyr包的方法:

    d <- data.frame(
      state = rep(c('NY', 'CA'), 10),
      year = rep(1:10, 2),
      response= rnorm(20)
    )
    
    library(plyr)
    # Break up d by state, then fit the specified model to each piece and
    # return a list
    models <- dlply(d, "state", function(df) 
      lm(response ~ year, data = df))
    
    # Apply coef to each model and return a data frame
    ldply(models, coef)
    
    # Print the summary of each model
    l_ply(models, summary, .print = TRUE)
    
  • 30

    在我看来,混合线性模型是这类数据的更好方法 . 下面的代码给出了固定效应的整体趋势 . 随机效应表明每个州的趋势与全球趋势的差异 . 相关结构考虑了时间自相关 . 看看Pinheiro&Bates(S和S-Plus中的混合效果模型) .

    library(nlme)
    lme(response ~ year, random = ~year|state, correlation = corAR1(~year))
    
  • 23

    使用 data.table 的一个很好的解决方案在@Zach的CrossValidated中发布了here . 我只想补充一点,也可以迭代获得回归系数r ^ 2:

    ## make fake data
        library(data.table)
        set.seed(1)
        dat <- data.table(x=runif(100), y=runif(100), grp=rep(1:2,50))
    
    ##calculate the regression coefficient r^2
        dat[,summary(lm(y~x))$r.squared,by=grp]
           grp         V1
        1:   1 0.01465726
        2:   2 0.02256595
    

    以及 summary(lm) 的所有其他输出:

    dat[,list(r2=summary(lm(y~x))$r.squared , f=summary(lm(y~x))$fstatistic[1] ),by=grp]
       grp         r2        f
    1:   1 0.01465726 0.714014
    2:   2 0.02256595 1.108173
    
  • 3

    这是使用 lme4 包的一种方法 .

    library(lme4)
     d <- data.frame(state=rep(c('NY', 'CA'), c(10, 10)),
                     year=rep(1:10, 2),
                     response=c(rnorm(10), rnorm(10)))
    
     xyplot(response ~ year, groups=state, data=d, type='l')
    
     fits <- lmList(response ~ year | state, data=d)
     fits
    #------------
    Call: lmList(formula = response ~ year | state, data = d)
    Coefficients:
       (Intercept)        year
    CA -1.34420990  0.17139963
    NY  0.00196176 -0.01852429
    
    Degrees of freedom: 20 total; 16 residual
    Residual standard error: 0.8201316
    
  • 56

    自2009年以来, dplyr 已经发布,实际上提供了一种非常好的方式来进行这种分组,与SAS的做法非常相似 .

    library(dplyr)
    
    d <- data.frame(state=rep(c('NY', 'CA'), c(10, 10)),
                    year=rep(1:10, 2),
                    response=c(rnorm(10), rnorm(10)))
    fitted_models = d %>% group_by(state) %>% do(model = lm(response ~ year, data = .))
    # Source: local data frame [2 x 2]
    # Groups: <by row>
    #
    #    state   model
    #   (fctr)   (chr)
    # 1     CA <S3:lm>
    # 2     NY <S3:lm>
    fitted_models$model
    # [[1]]
    # 
    # Call:
    # lm(formula = response ~ year, data = .)
    # 
    # Coefficients:
    # (Intercept)         year  
    #    -0.06354      0.02677  
    #
    #
    # [[2]]
    # 
    # Call:
    # lm(formula = response ~ year, data = .)
    # 
    # Coefficients:
    # (Intercept)         year  
    #    -0.35136      0.09385
    

    要检索系数和Rsquared / p.value,可以使用 broom 包 . 该套餐提供:

    三个S3泛型:整洁,它总结了模型的统计结果,如回归系数; augment,它为原始数据添加了列,例如预测,残差和集群分配;和glance,它提供了模型级统计的一行摘要 .

    library(broom)
    fitted_models %>% tidy(model)
    # Source: local data frame [4 x 6]
    # Groups: state [2]
    # 
    #    state        term    estimate  std.error  statistic   p.value
    #   (fctr)       (chr)       (dbl)      (dbl)      (dbl)     (dbl)
    # 1     CA (Intercept) -0.06354035 0.83863054 -0.0757668 0.9414651
    # 2     CA        year  0.02677048 0.13515755  0.1980687 0.8479318
    # 3     NY (Intercept) -0.35135766 0.60100314 -0.5846187 0.5749166
    # 4     NY        year  0.09385309 0.09686043  0.9689519 0.3609470
    fitted_models %>% glance(model)
    # Source: local data frame [2 x 12]
    # Groups: state [2]
    # 
    #    state   r.squared adj.r.squared     sigma statistic   p.value    df
    #   (fctr)       (dbl)         (dbl)     (dbl)     (dbl)     (dbl) (int)
    # 1     CA 0.004879969  -0.119510035 1.2276294 0.0392312 0.8479318     2
    # 2     NY 0.105032068  -0.006838924 0.8797785 0.9388678 0.3609470     2
    # Variables not shown: logLik (dbl), AIC (dbl), BIC (dbl), deviance (dbl),
    #   df.residual (int)
    fitted_models %>% augment(model)
    # Source: local data frame [20 x 10]
    # Groups: state [2]
    # 
    #     state   response  year      .fitted   .se.fit     .resid      .hat
    #    (fctr)      (dbl) (int)        (dbl)     (dbl)      (dbl)     (dbl)
    # 1      CA  0.4547765     1 -0.036769875 0.7215439  0.4915464 0.3454545
    # 2      CA  0.1217003     2 -0.009999399 0.6119518  0.1316997 0.2484848
    # 3      CA -0.6153836     3  0.016771076 0.5146646 -0.6321546 0.1757576
    # 4      CA -0.9978060     4  0.043541551 0.4379605 -1.0413476 0.1272727
    # 5      CA  2.1385614     5  0.070312027 0.3940486  2.0682494 0.1030303
    # 6      CA -0.3924598     6  0.097082502 0.3940486 -0.4895423 0.1030303
    # 7      CA -0.5918738     7  0.123852977 0.4379605 -0.7157268 0.1272727
    # 8      CA  0.4671346     8  0.150623453 0.5146646  0.3165112 0.1757576
    # 9      CA -1.4958726     9  0.177393928 0.6119518 -1.6732666 0.2484848
    # 10     CA  1.7481956    10  0.204164404 0.7215439  1.5440312 0.3454545
    # 11     NY -0.6285230     1 -0.257504572 0.5170932 -0.3710185 0.3454545
    # 12     NY  1.0566099     2 -0.163651479 0.4385542  1.2202614 0.2484848
    # 13     NY -0.5274693     3 -0.069798386 0.3688335 -0.4576709 0.1757576
    # 14     NY  0.6097983     4  0.024054706 0.3138637  0.5857436 0.1272727
    # 15     NY -1.5511940     5  0.117907799 0.2823942 -1.6691018 0.1030303
    # 16     NY  0.7440243     6  0.211760892 0.2823942  0.5322634 0.1030303
    # 17     NY  0.1054719     7  0.305613984 0.3138637 -0.2001421 0.1272727
    # 18     NY  0.7513057     8  0.399467077 0.3688335  0.3518387 0.1757576
    # 19     NY -0.1271655     9  0.493320170 0.4385542 -0.6204857 0.2484848
    # 20     NY  1.2154852    10  0.587173262 0.5170932  0.6283119 0.3454545
    # Variables not shown: .sigma (dbl), .cooksd (dbl), .std.resid (dbl)
    
  • 38
    ## make fake data
     ngroups <- 2
     group <- 1:ngroups
     nobs <- 100
     dta <- data.frame(group=rep(group,each=nobs),y=rnorm(nobs*ngroups),x=runif(nobs*ngroups))
     head(dta)
    #--------------------
      group          y         x
    1     1  0.6482007 0.5429575
    2     1 -0.4637118 0.7052843
    3     1 -0.5129840 0.7312955
    4     1 -0.6612649 0.9028034
    5     1 -0.5197448 0.1661308
    6     1  0.4240346 0.8944253
    #------------ 
    ## function to extract the results of one model
     foo <- function(z) {
       ## coef and se in a data frame
       mr <- data.frame(coef(summary(lm(y~x,data=z))))
       ## put row names (predictors/indep variables)
       mr$predictor <- rownames(mr)
       mr
     }
     ## see that it works
     foo(subset(dta,group==1))
    #=========
                  Estimate Std..Error   t.value  Pr...t..   predictor
    (Intercept)  0.2176477  0.1919140  1.134090 0.2595235 (Intercept)
    x           -0.3669890  0.3321875 -1.104765 0.2719666           x
    #----------
    ## one option: use command by
     res <- by(dta,dta$group,foo)
     res
    #=========
    dta$group: 1
                  Estimate Std..Error   t.value  Pr...t..   predictor
    (Intercept)  0.2176477  0.1919140  1.134090 0.2595235 (Intercept)
    x           -0.3669890  0.3321875 -1.104765 0.2719666           x
    ------------------------------------------------------------ 
    dta$group: 2
                   Estimate Std..Error    t.value  Pr...t..   predictor
    (Intercept) -0.04039422  0.1682335 -0.2401081 0.8107480 (Intercept)
    x            0.06286456  0.3020321  0.2081387 0.8355526           x
    
    ## using package plyr is better
     library(plyr)
     res <- ddply(dta,"group",foo)
     res
    #----------
      group    Estimate Std..Error    t.value  Pr...t..   predictor
    1     1  0.21764767  0.1919140  1.1340897 0.2595235 (Intercept)
    2     1 -0.36698898  0.3321875 -1.1047647 0.2719666           x
    3     2 -0.04039422  0.1682335 -0.2401081 0.8107480 (Intercept)
    4     2  0.06286456  0.3020321  0.2081387 0.8355526           x
    
  • 7

    我认为值得为此问题添加 purrr::map 方法 .

    library(tidyverse)
    
    d <- data.frame(state=rep(c('NY', 'CA'), c(10, 10)),
                                     year=rep(1:10, 2),
                                     response=c(rnorm(10), rnorm(10)))
    
    d %>% 
      group_by(state) %>% 
      nest() %>% 
      mutate(model = map(data, ~lm(response ~ year, data = .)))
    

    有关将 broom 包与这些结果一起使用的进一步想法,请参阅@Paul Hiemstra的答案 .

  • 0

    问题似乎是关于如何使用在循环内修改的公式调用回归函数 .

    以下是使用钻石数据集的方法:

    attach(ggplot2::diamonds)
    strCols = names(ggplot2::diamonds)
    
    formula <- list(); model <- list()
    for (i in 1:1) {
      formula[[i]] = paste0(strCols[7], " ~ ", strCols[7+i])
      model[[i]] = glm(formula[[i]]) 
    
      #then you can plot the results or anything else ...
      png(filename = sprintf("diamonds_price=glm(%s).png", strCols[7+i]))
      par(mfrow = c(2, 2))      
      plot(model[[i]])
      dev.off()
      }
    
  • 5

    上面的 lm() 函数就是一个简单的例子 . 顺便说一句,我想你的数据库有如下列形式的列:

    年状态var1 var2 y ...

    在我看来,您可以使用以下代码:

    require(base) 
    library(base) 
    attach(data) # data = your data base
                 #state is your label for the states column
    modell<-by(data, data$state, function(data) lm(y~I(1/var1)+I(1/var2)))
    summary(modell)
    
  • 2

    我现在的答案有点迟了,但我正在寻找类似的功能 . 看起来R中的内置函数'by'也可以轻松地进行分组:

    ?by包含以下示例,该示例适合每个组并使用sapply提取系数:

    require(stats)
    ## now suppose we want to extract the coefficients by group 
    tmp <- with(warpbreaks,
                by(warpbreaks, tension,
                   function(x) lm(breaks ~ wool, data = x)))
    sapply(tmp, coef)
    

相关问题