首页 文章

过滤分组变量维护序列

提问于
浏览
4

我有一个数据帧:

df <- data.frame(
        Group=c('A','A','A','A','B','B','B','B'),
        Activity = c('EOSP','NOR','EOSP','COSP','NOR','EOSP','WL','NOR'),
        TimeLine=c(1,2,3,4,1,2,3,4)
      )

我想过滤******每个组的两个活动以及我过滤的顺序。例如,我只是在寻找活动EOSPNOR,但也在顺序中。这段代码:

df %>% group_by(Group) %>% 
        filter(all(c('EOSP','NOR') %in% Activity) & Activity %in% c('EOSP','NOR'))

结果是:

# A tibble: 6 x 3
# Groups:   Group [2]
  Group Activity TimeLine
  <fct> <fct>       <dbl>
1 A     EOSP            1
2 A     NOR             2
3 A     EOSP            3
4 B     NOR             1
5 B     EOSP            2
6 B     NOR             4

我不希望在NOR之后发生第 3 行EOSP。同样对于 B 组,我不想要第 4 行,因为NOR发生在EOSP之前。我该如何实现这一目标?

3 回答

  • 3

    您可以使用match获取Activity == EOSP的第一个实例,并使用slice删除之前的所有内容。一旦你这样做,那么你可以删除重复项并过滤EOSPNOR,i.e。

    library(tidyverse)
    
    df %>% 
     group_by(Group) %>% 
     mutate(new = match('EOSP', Activity)) %>% 
     slice(new:n()) %>% 
     distinct(Activity, .keep_all = TRUE) %>% 
     filter(Activity %in% c('EOSP', 'NOR'))
    

    这使,

    
    

    #A tibble:4 x 4

    团体:小组[10]

    Group Activity TimeLine new
    <fct> <fct> <dbl> <int>
    1 A EOSP 1 1
    2 A NOR 2 1
    3 B EOSP 2 2
    4 B NOR 4 2

     
     **NOTE 1:**  You can  `ungroup()`  and  `select(-new)` 
     
     **NOTE 2:**  The warning messages being issued here 
     
    
    > (Warning messages: 1: In new:4L : numerical expression has 4 elements: only the first used 2: In new:4L : numerical expression has 4 elements: only the first used )
     
    do not affect us since we only need it to use the first element since all are the same anyway
  • 3

    here is an option with data.table package: you join df with itself, subsetted it to keep only EOSP Activity and computing the min of TimeLine by group, then you can keep only the rows with TimeLine greater or equal to this TimeLine , in order to be sure you keep NOR only if there is EOSP before. Then you drop duplicated Group and Activity if you want to only keep 2 activities per group:

    df[df[Activity=="EOSP", min(TimeLine), by=Group], on="Group"][Activity %in% c("NOR", "EOSP") & TimeLine >= V1][!duplicated(paste(Group, Activity))]
    
    #   Group Activity TimeLine V1
    #1:     A     EOSP        1  1
    #2:     A      NOR        2  1
    #3:     B     EOSP        2  2
    #4:     B      NOR        4  2
    
  • 1

    这是一个dplyr想法:

    df %>%
      filter(Activity %in% c('EOSP','NOR')) %>%
      group_by(Group) %>%
      mutate(tmp = which(Activity == 'EOSP' & !duplicated(Activity))) %>%
      filter(row_number() %in%  c(tmp, tmp+1)) 
    
    # A tibble: 4 x 4
    # Groups:   Group [2]
      Group Activity TimeLine   tmp
      <fct> <fct>       <dbl> <int>
    1 A     EOSP            1     1
    2 A     NOR             2     1
    3 B     EOSP            2     2
    4 B     NOR             4     2
    

相关问题