首页 文章

为什么我的回归系数在我的R模型中没有意义?

提问于
浏览
1

我正在构建一个具有相当大量数据的回归模型(2146个观测值) . 这些是重复测量,所以我将使用混合模型,但是,我总是喜欢从更简单的模型开始,以帮助查看数据的样子 . 问题是我的回归系数对我来说没有意义,我无法弄清楚为什么它们在添加到模型时会发生如此剧烈的变化 .

以下是第一个简单回归模型的示例:

fit1 <- lm(Outcome.Variable ~ Group, data = dat)
summary(fit1)

Call:
lm(formula = Outcome.Variable ~ Group, data = dat)

Residuals:
    Min      1Q  Median      3Q     Max 
-225.63  -75.96   -4.60   67.78  356.84 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  364.104      4.677  77.847  < 2e-16 ***
GroupB       -65.187      7.268  -8.969  < 2e-16 ***
GroupC       -31.776      6.982  -4.551 5.63e-06 ***
GroupD       -37.268      6.337  -5.881 4.73e-09 ***
GroupE       -11.172      7.661  -1.458 0.144902    
GroupF       -29.707      8.188  -3.628 0.000292 ***
GroupG       -10.443      6.963  -1.500 0.133853    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 91.42 on 2139 degrees of freedom
Multiple R-squared:  0.0464,    Adjusted R-squared:  0.04372 
F-statistic: 17.35 on 6 and 2139 DF,  p-value: < 2.2e-16

这些系数对我来说是有意义的,因为截距是GroupA的平均值,并且每个其他组的估计值代表与GroupA的差异 . 快速检查数据表明这种解释是正确的:

library(dplyr)

dat %>%
    group_by(Group) %>%
    summarize(Outcome.Variable.Mean = mean(Outcome.Variable))
# A tibble: 7 × 2
  Group Outcome.Variable.Mean
  <chr>                 <dbl>
1     A              364.1045
2     B              298.9173
3     C              332.3286
4     D              326.8360
5     E              352.9324
6     F              334.3972
7     G              353.6617

我可以用我的第二个变量Day构建另一个简单的线性回归:

fit2 <- lm(Outcome.Variable ~ Day, data = dat)
summary(fit2)

Call:
lm(formula = Outcome.Variable ~ Day, data = dat)

Residuals:
    Min      1Q  Median      3Q     Max 
-228.56  -43.45   -4.70   44.41  321.77 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  388.003      2.598 149.367   <2e-16 ***
Day2          -5.278      3.668  -1.439     0.15    
Day3        -136.108      3.589 -37.921   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 68.28 on 2143 degrees of freedom
Multiple R-squared:  0.4669,    Adjusted R-squared:  0.4664 
F-statistic: 938.6 on 2 and 2143 DF,  p-value: < 2.2e-16

同样,快速检查数据表明这些回归系数被正确解释:

dat %>%
    group_by(Day) %>%
    summarize(Outcome.Variable.Mean = mean(Outcome.Variable))

# A tibble: 3 × 2
     Day Outcome.Variable.Mean
  <fctr>                 <dbl>
1      1              388.0027
2      2              382.7242
3      3              251.8942

现在,当我将它们两者合并到模型中时,问题出现了:

fit3 <- lm(Outcome.Variable ~ Day + Group, data = dat)
summary(fit3)

Call:
lm(formula = Outcome.Variable ~ Day + Group, data = dat)

Residuals:
     Min       1Q   Median       3Q      Max 
-212.456  -43.442   -2.864   41.000  305.607 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  413.942      3.912 105.806  < 2e-16 ***
Day2          -5.801      3.504  -1.656   0.0979 .  
Day3        -136.663      3.429 -39.859  < 2e-16 ***
GroupB       -66.126      5.185 -12.753  < 2e-16 ***
GroupC       -31.813      4.980  -6.388 2.06e-10 ***
GroupD       -37.654      4.521  -8.329  < 2e-16 ***
GroupE        -9.777      5.465  -1.789   0.0738 .  
GroupF       -24.570      5.842  -4.206 2.71e-05 ***
GroupG       -10.067      4.967  -2.027   0.0428 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 65.21 on 2137 degrees of freedom
Multiple R-squared:  0.5152,    Adjusted R-squared:  0.5134 
F-statistic: 283.9 on 8 and 2137 DF,  p-value: < 2.2e-16

这些回归系数对我来说没有意义 . 拦截应该是第1天GroupA的平均值,但是,对数据的检查表明根本不是这种情况:

as.data.frame(dat %>%
    group_by(Day, Group) %>%
    summarize(Outcome.Variable.Mean = mean(Outcome.Variable)))

   Day Group Outcome.Variable.Mean
1    1     A              420.5681
2    1     B              331.6633
3    1     C              380.9213
4    1     D              382.2743
5    1     E              405.1115
6    1     F              392.5020
7    1     G              400.5005
8    2     A              405.3756
9    2     B              339.2346
10   2     C              389.3252
11   2     D              374.0798
12   2     E              388.7488
13   2     F              377.9685
14   2     G              395.5381
15   3     A              273.7767
16   3     B              229.6742
17   3     C              234.4119
18   3     D              230.6635
19   3     E              275.2313
20   3     F              254.7107
21   3     G              272.6063

这里发生了什么?我不想在没有先了解这个更基本的模型中发生的事情的情况下进入混合模型 . 为什么截距不代表第一天的GroupA平均值?甚至拦截和其他估计之间的差异似乎也不正确 . 例如,拦截与第2天之间的差异为-5.8 . 但是,第1天的GroupA和第2天的GroupA之间的差异是15分 .

任何帮助了解这里发生的事情将不胜感激 .

1 回答

  • 4

    你忽略了这些术语之间的相互作用 . 让我演示使用 mtcars 数据:

    首先,我运行回归 disp ~ factor(cyl) (我必须调用 factor 因为默认情况下 mtcars 中的所有变量都是数字):

    library(dplyr)
    
    lm(disp ~ factor(cyl), mtcars)
    #> 
    #> Call:
    #> lm(formula = disp ~ factor(cyl), data = mtcars)
    #> 
    #> Coefficients:
    #>  (Intercept)  factor(cyl)6  factor(cyl)8  
    #>       105.14         78.18        247.96
    
    mtcars %>% group_by(cyl) %>% summarize(mean = mean(disp))
    #> # A tibble: 3 x 2
    #>     cyl     mean
    #>   <dbl>    <dbl>
    #> 1     4 105.1364
    #> 2     6 183.3143
    #> 3     8 353.1000
    

    如您所见,回归将截距设置为组cyl = 4的平均disp .

    接下来,我运行回归 disp ~ factor(gear)

    lm(disp ~ factor(gear), mtcars)
    #> 
    #> Call:
    #> lm(formula = disp ~ factor(gear), data = mtcars)
    #> 
    #> Coefficients:
    #>   (Intercept)  factor(gear)4  factor(gear)5  
    #>         326.3         -203.3         -123.8
    
    mtcars %>% group_by(gear) %>% summarize(mean = mean(disp))
    #> # A tibble: 3 x 2
    #>    gear     mean
    #>   <dbl>    <dbl>
    #> 1     3 326.3000
    #> 2     4 123.0167
    #> 3     5 202.4800
    

    再一次,回归的输出是群体均值 .

    现在结合它们我的回归公式是 disp ~ factor(cyl) * factor(gear) ,相当于 disp ~ factor(cyl) + factor(gear) + factor(cyl):factor(gear)

    lm(disp ~ factor(cyl)*factor(gear), mtcars)
    #> 
    #> Call:
    #> lm(formula = disp ~ factor(cyl) * factor(gear), data = mtcars)
    #> 
    #> Coefficients:
    #>                (Intercept)                factor(cyl)6  
    #>                     120.10                      121.40  
    #>               factor(cyl)8               factor(gear)4  
    #>                     237.52                      -17.47  
    #>              factor(gear)5  factor(cyl)6:factor(gear)4  
    #>                     -12.40                      -60.23  
    #> factor(cyl)8:factor(gear)4  factor(cyl)6:factor(gear)5  
    #>                         NA                      -84.10  
    #> factor(cyl)8:factor(gear)5  
    #>                     -19.22
    
    
    mtcars %>% group_by(cyl, gear) %>% summarize(mean(disp))
    #> # A tibble: 8 x 3
    #> # Groups:   cyl [?]
    #>     cyl  gear `mean(disp)`
    #>   <dbl> <dbl>        <dbl>
    #> 1     4     3     120.1000
    #> 2     4     4     102.6250
    #> 3     4     5     107.7000
    #> 4     6     3     241.5000
    #> 5     6     4     163.8000
    #> 6     6     5     145.0000
    #> 7     8     3     357.6167
    #> 8     8     5     326.0000
    

相关问题