首页 文章

根据列值按组对行进行聚类

提问于
浏览
4

我有以下内容:

df <- data.frame(ID = c(1,1,1,1,1,1,1,1,1,1,2,2,2),
             Obs = c(0,1, 1, 0, 1,0,0, 1, 1, 1, 0,0,1))

我想要这个:

df <- data.frame(ID = c(1,1,1,1,1,1,1,1,1,1,2,2,2),
             Obs = c(0,1, 1, 0, 1,0,0, 1, 1, 1, 0,0,1),
             Cluster = c(0,1,1,1,2,2,2,3,3,3,0,0,1))

我怎样才能获得'Cluster'列,其中我必须按顺序排列数字1,直到出现第一个0,dplyr?

连续0必须保持该值,直到出现新值 .

EDIT

我怎么能用很多列做到这一点?

假设我有99个obs列,我想创建99个簇,每列一个 . 像这样:

df <- data.frame(ID = c(1,1,1,1,1,1,1,1,1,1,2,2,2),
Obs1 = c(0,1, 1, 0, 1,0,0, 1, 1, 1, 0,0,1),
Obs2 = c(0,0, 0, 1, 1,1,0, 1, 0, 1, 0,0,1),
ClusterObs1 = c(0,1,1,1,2,2,2,3,3,3,0,0,1),
ClusterObs2 = c(0,0,0,1,1,1,1,2,2,3,0,0,1))

2 回答

  • 7

    这是使用 rle 的选项:

    df %>% 
      group_by(ID) %>% 
      mutate(clust = with(rle(Obs), rep(cumsum(values == 1), lengths)))
    # # A tibble: 13 x 4
    # # Groups:   ID [2]
    # ID   Obs Cluster clust
    # <dbl> <dbl>   <dbl> <int>
    # 1    1.    0.      0.     0
    # 2    1.    1.      1.     1
    # 3    1.    1.      1.     1
    # 4    1.    0.      1.     1
    # 5    1.    1.      2.     2
    # 6    1.    0.      2.     2
    # 7    1.    0.      2.     2
    # 8    1.    1.      3.     3
    # 9    1.    1.      3.     3
    # 10    1.    1.      3.     3
    # 11    2.    0.      0.     0
    # 12    2.    0.      0.     0
    # 13    2.    1.      1.     1
    

    这是它的主要部分:

    rle(df$Obs)
    #Run Length Encoding
    #  lengths: int [1:8] 1 2 1 1 2 3 2 1
    #  values : num [1:8] 0 1 0 1 0 1 0 1
    

    这告诉你每一段1或0在Obs列中有多长(我现在忽略ID分组) .

    我们现在需要的是累计计算1s的strectches的次数,并且为了做到这一点,我们只是简单地计算出值为1的位置:

    with(rle(df$Obs), cumsum(values == 1))
    #[1] 0 1 1 2 2 3 3 4
    

    到目前为止一直很好,现在我们需要重复那些值,因为这些值很长,因此我们使用 rep 和来自rle的 lengths 信息:

    with(rle(df$Obs), rep(cumsum(values == 1), lengths))
    # [1] 0 1 1 1 2 2 2 3 3 3 3 3 4
    

    最后,我们通过ID组进行此操作 .


    如果需要为不同的obs-columns创建多个cluster-column,可以按如下方式轻松完成:

    df %>% 
      group_by(ID) %>% 
      mutate_at(vars(starts_with("Obs")), 
                funs(cluster= with(rle(.), rep(cumsum(values == 1), lengths))))
    
    # # A tibble: 13 x 7
    # # Groups:   ID [2]
    # ID  Obs1  Obs2 ClusterObs1 ClusterObs2 Obs1_cluster Obs2_cluster
    # <dbl> <dbl> <dbl>       <dbl>       <dbl>        <int>        <int>
    # 1    1.    0.    0.          0.          0.            0            0
    # 2    1.    1.    0.          1.          0.            1            0
    # 3    1.    1.    0.          1.          0.            1            0
    # 4    1.    0.    1.          1.          1.            1            1
    # 5    1.    1.    1.          2.          1.            2            1
    # 6    1.    0.    1.          2.          1.            2            1
    # 7    1.    0.    0.          2.          1.            2            1
    # 8    1.    1.    1.          3.          2.            3            2
    # 9    1.    1.    0.          3.          2.            3            2
    # 10    1.    1.    1.          3.          3.            3            3
    # 11    2.    0.    0.          0.          0.            0            0
    # 12    2.    0.    0.          0.          0.            0            0
    # 13    2.    1.    1.          1.          1.            1            1
    

    其中df是:

    df <- data.frame(ID = c(1,1,1,1,1,1,1,1,1,1,2,2,2), Obs1 = c(0,1, 1, 0, 1,0,0, 1, 1, 1, 0,0,1), Obs2 = c(0,0, 0, 1, 1,1,0, 1, 0, 1, 0,0,1), ClusterObs1 = c(0,1,1,1,2,2,2,3,3,3,0,0,1), ClusterObs2 = c(0,0,0,1,1,1,1,2,2,3,0,0,1))
    
  • 2

    这是一个非常有趣的问题所以这里有一个data.table解决方案:

    # Packages used
    library(data.table)
    library(magrittr)
    
    # Setup
    setDT(df)
    df[, Obs := as.integer(Obs)]
    
    # Calculations
    df[, Cluster := cumsum(!Obs), by = ID] %>%
      .[, Cluster := Cluster - rowid(Obs) * !Obs, by = rleid(Obs)] %>%
      .[, Cluster := frank(Cluster, ties.method = "dense") - 1L, by = ID]
    
    df
        ID Obs Cluster
     1:  1   0       0
     2:  1   1       1
     3:  1   1       1
     4:  1   0       1
     5:  1   1       2
     6:  1   0       2
     7:  1   0       2
     8:  1   1       3
     9:  1   1       3
    10:  1   1       3
    11:  2   0       0
    12:  2   0       0
    13:  2   1       1
    

相关问题