首页 文章

优化样条回归的自由度

提问于
浏览
2

我有两个基因表达时间过程数据集:

首先,在4组中的14个时间点测量基因表达:

df1 <- structure(list(val = c(-0.1, -0.13, -0.4, -0.3, -0.3, -0.2, -0.24, 
                            0.1, 0.2, 0.13, 0, 0.63, 0.83, 0.85, -0.07, -0.07, -0.27, -0.2, 
                            -0.2, -0.1, 0.2, 0.1, 0.07, 0.17, 0.6, 0.75, 1.1, 1.1, -0.13, 
                            -0.15, -0.26, -0.25, -0.14, 0.04, 0.2, 0.24, 0.23, 0.2, 0.1, 
                            0.73, 1, 1.3, 0, 0.06, -0.24, -0.17, -0.17, -0.04, 0.16, 0.1, 
                            0.14, 0.27, 0.34, 0.9, 0.97, 1.04), 
                    time = c(-1, 0, 1, 1.58,2, 2.58, 3, 3.32, 3.58, 4.17, 4.58, 5.58, 6.17, 7.39, 
                             -1, 0, 1, 1.58, 2, 2.58, 3, 3.32, 3.58, 4.17, 4.58, 5.58, 6.17, 7.39, 
                             -1, 0, 1, 1.58, 2, 2.58, 3, 3.32, 3.58, 4.17, 4.58, 5.58, 6.17,7.39, 
                             -1, 0, 1, 1.58, 2, 2.58, 3, 3.32, 3.58, 4.17, 4.58, 5.58,6.17, 7.39), 
                    group = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,1L, 1L, 1L, 1L, 1L, 1L, 1L, 
                                        2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,2L, 2L, 2L, 2L, 2L, 
                                        3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,3L, 3L, 3L, 
                                        4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L,4L), 
                                      .Label = c("a", "b", "c", "d"), class = "factor")), .Names = c("val","time", "group"), 
               row.names = c(NA, -56L), class = "data.frame")


df1$group <- factor(df1$group,levels=c("a","b","c","d"))

看起来像这样(添加 loess 平滑的趋势线):

library(ggplot2)
ggplot(df1,aes(x=time,y=val,color=group))+geom_point()+theme_minimal()+geom_smooth(se=F)+theme(legend.position="top",legend.title=element_blank())

enter image description here

其次,在相似的14个时间点测量基因表达,但现在来自2个不同的组,每个组由两个性别代表:

df2 <- structure(list(val = c(-0.23, -0.01, -0.14, -0.01, -0.21, -0.16, 
                       -0.24, -0.11, 0.02, -0.11, -0.01, -0.25, -0.47, -1.25, 0.02, 
                       -0.3, -0.02, 0.14, 0.25, -0.05, 0.15, 0.11, -0.24, -0.18, -0.39, 
                       -0.49, -0.5, -0.65, -0.06, 0.09, 0.1, 0.15, 0.08, 0.15, 0.4, 
                       0.24, 0.07, 0.08, -0.18, -0.35, -0.19, -0.81, -0.16, 0.29, -0.05, 
                       0.14, 0.14, 0.48, 0.34, 0.11, -0.07, -0.13, -0.41, -0.22, -0.54, 
                       -0.76, 0.35, 0.34, -0.06, 0.21, 0.14, 0.14, 0.25, 0.22, 0.25, 
                       0.16, 0.3, 0.44, 0.08, 0.48, 0.1, 0.16, -0.03, -0.22, 0.2, 0.01, 
                       -0.09, -0.02, -0.01, 0.06, -0.13, 0.19, 0.11, -0.04, -0.39, 0.03, 
                       -0.01, 0.09, 0.1, -0.14, -0.12, -0.1, 0.36, 0.08, 0.09, 0.09, 
                       0.42, 0.37, -0.14, 0.12, 0.09, 0.03, 0.06, -0.25, 0.2, -0.06, 
                       -0.44, 0.23, 0.03, 0.16, 0.81, 0.83),
               time = c(-1, 0, 1, 1.58,2, 2.58, 3, 3.32, 3.58, 4.17, 4.58, 5.58, 6.17, 7.39, 
                        -1, 0,1, 1.58, 2, 2.58, 3, 3.32, 3.58, 4.17, 4.58, 5.58, 6.17, 7.39, 
                        -1, 0, 1, 1.58, 2, 2.58, 3, 3.32, 3.58, 4.17, 4.58, 5.58, 6.17,7.39, 
                        -1, 0, 1, 1.58, 2, 2.58, 3, 3.32, 3.58, 4.17, 4.58, 5.58,6.17, 7.39, 
                        -1, 0, 1, 1.58, 2, 2.58, 3, 3.32, 3.58, 4.17, 4.58,5.58, 6.17, 7.39, 
                        -1, 0, 1, 1.58, 2, 2.58, 3, 3.32, 3.58, 4.17,4.58, 5.58, 6.17, 7.39, 
                        -1, 0, 1, 1.58, 2, 2.58, 3, 3.32, 3.58, 4.17, 4.58, 5.58, 6.17, 7.39, 
                        -1, 0, 1, 1.58, 2, 2.58, 3, 3.32, 3.58, 4.17, 4.58, 5.58, 6.17, 7.39), 
               sex = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
                                 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
                                 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), 
                               .Label = c("F", "M"), class = "factor"), group = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
                                                                                            2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), 
                                                                                          .Label = c("a", "b"), class = "factor")), .Names = c("val", "time", "sex", "group"), row.names = c(NA, -112L), class = "data.frame")
df2$sex <- ordered(df2$sex,levels=c("M","F"))

df2$group <- ordered(df2$group,levels=c("a","b"))

df2$col <- factor(paste0(df2$group,":",df2$sex))

看起来像这样(添加黄土平滑趋势线):

ggplot(df2,aes(x=time,y=val,color=col))+geom_point()+theme_minimal()+geom_smooth(se=F)+theme(legend.position="top",legend.title=element_blank())

enter image description here

对于 df1 ,我想估计 timeval 的影响,调整 group .

对于 df2 ,我想估计 time:groupval 的影响,调整 sex .

查看我认为使用 spline regression 的数据是合适的,所以我使用了 mgcv 包中的 gam 函数,据我所知,它可以优化适合数据的 spline 的自由度 .

这是我适合 df1

mgcv1.fit <- mgcv::gam(val ~ group+s(time),data=df1)

这使:

Family: gaussian 
Link function: identity 

Formula:
val ~ group + s(time)

Estimated degrees of freedom:
7.18  total = 11.18 

GCV score: 0.01258176

但这些数据的7.18自由度似乎太多了 .

对于 df2

mgcv2.fit <- mgcv::gam(val ~ sex+s(time,by=group),data=df2)

这使:

Family: gaussian 
Link function: identity 

Formula:
val ~ sex + s(time, by = group)

Estimated degrees of freedom:
1.72  total = 3.72 

GCV score: 0.08522094

我想在这种情况下我会想象自由度略高一些 .

还有一点 . 绘制这两个数据集的拟合值:

df1$mgcv <- mgcv1.fit$fitted.values
ggplot(df1,aes(x=time,y=mgcv,color=group))+geom_point()+theme_minimal()+geom_smooth(se=F)+theme(legend.position="top",legend.title=element_blank())

enter image description here

看起来很好 .

但对于 df2

df2$mgcv <- mgcv2.fit$fitted.values
ggplot(df2,aes(x=time,y=mgcv,color=col))+geom_point()+theme_minimal()+geom_smooth(se=F)+theme(legend.position="top",legend.title=element_blank())

enter image description here

看起来它翻转了组标签 .

所以我的问题是:

  • 我是否正确使用 mgcv::gam 来优化我的问题的样条自由度?

  • mgcv 重新排序 fitted.values 中的样本吗?

1 回答

  • 4

    首先, mgcv 在因子水平上做了正确的事情 . 如果你检查 str(df2$sex) ,你会看到"M"(男性)是第一级,"F"(女性)是第二级 . 但是从 str(df2$col) 开始,"F"是第一个,因此在制作情节时你会得到一些错误的标签 .

    其次,你的第二个模型没有正确指定 .

    • 当没有"by"变量或"by"是一个因子时,样条曲线 s(time) 处于居中约束下 . 因此,您需要在模型公式中将"by"变量 group 作为单独的术语来捕捉其边际效应;

    • 由于"by"变量 group 是有序变量, mgcv 对其应用对比,在构造 s(time, by = group) 时删除第一级"a" . 因此,您需要提供单独的 s(time) 作为基线平滑 .

    你当前的 mgcv2.fit 是一个相当差的模型(并不奇怪),给出了9%的解释偏差 . 但如果你这样做,你得到64% .

    gam(val ~ sex + s(time) + group + s(time, by = group), data = df2, method = "REML")
    

    ggplot 现在看起来正确(我没有更改 df2$col 所以着色仍然可能会逆转) .

    gam 默认使用"GCV.Cp"作为平滑参数选择方法 . 但建议使用"REML",因为它不太容易过度拟合 .


    Remark 1

    如果"by"变量 group 是(非有序)因子,则不受对比影响 . 所以模型公式应该是:

    val ~ sex + group + s(time, by = group)
    

    以下引用自 ?gam.models 的'by'变量部分:

    If a ‘by’ variable is a ‘factor’ then it generates an indicator
     vector for each level of the factor, unless it is an ‘ordered’
     factor. In the non-ordered case, the model matrix for the smooth
     term is then replicated for each factor level, and each copy has
     its rows multiplied by the corresponding rows of its indicator
     variable. The smoothness penalties are also duplicated for each
     factor level.  In short a different smooth is generated for each
     factor level (the ‘id’ argument to ‘s’ and ‘te’ can be used to
     force all such smooths to have the same smoothing parameter).
     ‘ordered’ ‘by’ variables are handled in the same way, except that
     no smooth is generated for the first level of the ordered factor
     (see ‘b3’ example below).  This is useful for setting up
     identifiable models when the same smooth occurs more than once in
     a model, with different factor ‘by’ variables.
    

    Remark 2

    我不是要判断你的模型,但是"F"和"M"之间似乎存在明显的组内差异 . 从您的数据中我们看到"F"和"M"在组"b"中的区别比组"a"中的差异更大 . 目前, sex 的效果在两组中都是相同的,它只是一个垂直移位 . 您可以在上面的答案中的 ggplot 中观察到这一点 . 由您决定最终的模型,但是如果您想要建模这个 sex-group 交互,您可以做

    df2$sex_group <- with(df2, interaction(sex, group))  ## the new variable is unordered
    test <- gam(val ~ sex + group + s(time, by = sex_group), data = df2, method = "REML")
    

    请注意我如何为 by 提供两个因子变量 . 创建辅助变量 sex_group .

相关问题