首页 文章

R之间/之内的标准偏差

提问于
浏览
4

在处理分层/多级/面板数据集时,采用一个包可以返回可用变量的组内和组之间标准偏差可能非常有用 .

这可以通过命令轻松完成 Stata 中的以下数据

xtsum, i(momid)

我做了一个研究,但我找不到任何 R 包可以做到这一点..

edit:

只是为了解决问题,分层数据集的一个例子可能是这样的:

son_id       mom_id      hispanic     mom_smoke     son_birthweigth

  1            1            1            1              3950
  2            1            1            0              3890
  3            1            1            0              3990
  1            2            0            1              4200
  2            2            0            1              4120
  1            3            0            0              2975
  2            3            0            1              2980

“多级”结构由每个母亲(较高级别)具有两个或更多个儿子(较低级别)的事实给出 . 因此,每个母亲定义一组观察 .

因此,每个数据集变量可以在母亲之间和母亲之间变化,或者仅在母亲之间变化 . birtweigth 因母亲而异,但也在同一位母亲中 . 相反, hispanic 对于同一位母亲是固定的 .

例如, son_birthweigth 的母内方差为:

# mom1 means
    bwt_mean1 <- (3950+3890+3990)/3
    bwt_mean2 <- (4200+4120)/2
    bwt_mean3 <- (2975+2980)/2

# Within-mother variance for birthweigth
    ((3950-bwt_mean1)^2 + (3890-bwt_mean1)^2 + (3990-bwt_mean1)^2 + 
    (4200-bwt_mean2)^2 + (4120-bwt_mean2)^2 + 
    (2975-bwt_mean3)^2 + (2980-bwt_mean3)^2)/(7-1)

而母亲之间的差异是:

# overall mean of birthweigth:
# mean <- sum(data$son_birthweigth)/length(data$son_birthweigth)
    mean <- (3950+3890+3990+4200+4120+2975+2980)/7

# within variance:
    ((bwt_mean1-mean)^2 + (bwt_mean2-mean)^2 + (bwt_mean3-mean)^2)/(3-1)

2 回答

  • 1

    我不知道你的stata命令应该重现什么,但是为了回答关于层次结构的问题的第二部分,用 list 很容易做到这一点 . 例如,您定义如下结构:

    tree = list(
          "var1" = list(
             "panel" = list(type ='p',mean = 1,sd=0)
             ,"cluster" = list(type = 'c',value = c(5,8,10)))
          ,"var2" = list(
              "panel" = list(type ='p',mean = 2,sd=0.5)
             ,"cluster" = list(type="c",value =c(1,2)))
    )
    

    要创建这个 lapply 很有意义使用list

    tree <- lapply(list('var1','var2'),function(x){ 
      ll <- list(panel= list(type ='p',mean = rnorm(1),sd=0), ## I use symbol here not name
                 cluster= list(type = 'c',value = rnorm(3)))  ## R prefer symbols
    })
    names(tree) <-c('var1','var2')
    

    你可以用 str 查看他的结构

    str(tree)
    List of 2
     $ var1:List of 2
      ..$ panel  :List of 3
      .. ..$ type: chr "p"
      .. ..$ mean: num 0.284
      .. ..$ sd  : num 0
      ..$ cluster:List of 2
      .. ..$ type : chr "c"
      .. ..$ value: num [1:3] 0.0722 -0.9413 0.6649
     $ var2:List of 2
      ..$ panel  :List of 3
      .. ..$ type: chr "p"
      .. ..$ mean: num -0.144
      .. ..$ sd  : num 0
      ..$ cluster:List of 2
      .. ..$ type : chr "c"
      .. ..$ value: num [1:3] -0.595 -1.795 -0.439
    

    OP澄清后编辑

    我认为包 reshape2 就是你想要的 . 我将在这里证明这一点 .

    这里的想法是为了进行我们需要重新整形数据的多级分析 .

    首先将变量分为两组:标识符和测量变量 . library(reshape2)dat.m < - melt(dat,id.vars = c('son_id','mom_id'))##其他列被测量

    str(dat.m)
    'data.frame':   21 obs. of  4 variables:
     $ son_id  : Factor w/ 3 levels "1","2","3": 1 2 3 1 2 1 2 1 2 3 ...
     $ mom_id  : Factor w/ 3 levels "1","2","3": 1 1 1 2 2 3 3 1 1 1 ...
     $ variable: Factor w/ 3 levels "hispanic","mom_smoke",..: 1 1 1 1 1 1 1 2 2 2 ...
     $ value   : num  1 1 1 0 0 0 0 1 0 0 ..
    

    一旦你有“moten”形式的数据,你可以“施放”以你想要的形状重新排列它:

    # mom1 means for all variable
     acast(dat.m,variable~mom_id,mean)
                               1    2      3
    hispanic           1.0000000    0    0.0
    mom_smoke          0.3333333    1    0.5
    son_birthweigth 3943.3333333 4160 2977.5
    # Within-mother variance for birthweigth
    
    acast(dat.m,variable~mom_id,function(x) sum((x-mean(x))^2))
                               1    2    3
    hispanic           0.0000000    0  0.0
    mom_smoke          0.6666667    0  0.5
    son_birthweigth 5066.6666667 3200 12.5
    
    ## overall mean of each variable
    acast(dat.m,variable~.,mean)
    [,1]
    hispanic           0.4285714
    mom_smoke          0.5714286
    son_birthweigth 3729.2857143
    
  • 1

    我知道这个问题已经有四年了,但是最近我想在R中做同样的事情并提出以下功能 . 这取决于 dplyrtibble . 其中:df是数据框,列是数据框的子集的数字向量,而个体是具有个体的列 .

    xtsumR<-function(df,columns,individuals){
      df<-dplyr::arrange_(df,individuals)
      panel<-tibble::tibble()
      for (i in columns){
        v<-df %>% dplyr::group_by_() %>%
          dplyr::summarize_(
            mean=mean(df[[i]]),
            sd=sd(df[[i]]),
            min=min(df[[i]]),
            max=max(df[[i]])
          )
        v<-tibble::add_column(v,variacao="overal",.before=-1)
        v2<-aggregate(df[[i]],list(df[[individuals]]),"mean")[[2]]
        sdB<-sd(v2)
        varW<-df[[i]]-rep(v2,each=12) #
        varW<-varW+mean(df[[i]])
        sdW<-sd(varW)
        minB<-min(v2)
        maxB<-max(v2)
        minW<-min(varW)
        maxW<-max(varW)
        v<-rbind(v,c("between",NA,sdB,minB,maxB),c("within",NA,sdW,minW,maxW))
        panel<-rbind(panel,v)
      }
      var<-rep(names(df)[columns])
      n1<-rep(NA,length(columns))
      n2<-rep(NA,length(columns))
      var<-c(rbind(var,n1,n1))
      panel$var<-var
      panel<-panel[c(6,1:5)]
      names(panel)<-c("variable","variation","mean","standard.deviation","min","max")
      panel[3:6]<-as.numeric(unlist(panel[3:6]))
      panel[3:6]<-round(unlist(panel[3:6]),2)
      return(panel)
    }
    

相关问题