首页 文章

从分组数据中选择第一行和最后一行

提问于
浏览
97

Question

使用 dplyr ,如何在一个语句中选择分组数据的顶部和底部观察/行?

Data & Example

给定一个数据框架

df <- data.frame(id=c(1,1,1,2,2,2,3,3,3), 
                 stopId=c("a","b","c","a","b","c","a","b","c"), 
                 stopSequence=c(1,2,3,3,1,4,3,1,2))

我可以使用 slice 从每个组中获取顶部和底部观察结果,但使用两个单独的参数:

firstStop <- df %>%
  group_by(id) %>%
  arrange(stopSequence) %>%
  slice(1) %>%
  ungroup

lastStop <- df %>%
  group_by(id) %>%
  arrange(stopSequence) %>%
  slice(n()) %>%
  ungroup

我可以将这两个statmenet组合成一个选择顶部和底部观察的平台吗?

6 回答

  • 79

    就像是:

    library(dplyr)
    
    df <- data.frame(id=c(1,1,1,2,2,2,3,3,3),
                     stopId=c("a","b","c","a","b","c","a","b","c"),
                     stopSequence=c(1,2,3,3,1,4,3,1,2))
    
    first_last <- function(x) {
      bind_rows(slice(x, 1), slice(x, n()))
    }
    
    df %>%
      group_by(id) %>%
      arrange(stopSequence) %>%
      do(first_last(.)) %>%
      ungroup
    
    ## Source: local data frame [6 x 3]
    ## 
    ##   id stopId stopSequence
    ## 1  1      a            1
    ## 2  1      c            3
    ## 3  2      b            1
    ## 4  2      c            4
    ## 5  3      b            1
    ## 6  3      a            3
    

    使用 do ,您几乎可以对该组执行任意数量的操作,但@ jeremycg的答案更适合此任务 .

  • 6

    可能有更快的方法:

    df %>%
      group_by(id) %>%
      arrange(stopSequence) %>%
      filter(row_number()==1 | row_number()==n())
    
  • 6

    只是为了完整性:您可以传递 slice 索引向量:

    df %>% arrange(stopSequence) %>% group_by(id) %>% slice(c(1,n()))
    

    这使

    id stopId stopSequence
    1  1      a            1
    2  1      c            3
    3  2      b            1
    4  2      c            4
    5  3      b            1
    6  3      a            3
    
  • 14

    dplyr ,但使用 data.table 更直接:

    library(data.table)
    setDT(df)
    df[ df[order(id, stopSequence), .I[c(1L,.N)], by=id]$V1 ]
    #    id stopId stopSequence
    # 1:  1      a            1
    # 2:  1      c            3
    # 3:  2      b            1
    # 4:  2      c            4
    # 5:  3      b            1
    # 6:  3      a            3
    

    更详细的解释:

    # 1) get row numbers of first/last observations from each group
    #    * basically, we sort the table by id/stopSequence, then,
    #      grouping by id, name the row numbers of the first/last
    #      observations for each id; since this operation produces
    #      a data.table
    #    * .I is data.table shorthand for the row number
    #    * here, to be maximally explicit, I've named the variable V1
    #      as row_num to give other readers of my code a clearer
    #      understanding of what operation is producing what variable
    first_last = df[order(id, stopSequence), .(row_num = .I[c(1L,.N)]), by=id]
    idx = first_last$row_num
    
    # 2) extract rows by number
    df[idx]
    

    请务必查看Getting Started wiki以获取 data.table 基础知识

  • 166

    我知道指定的问题 dplyr . 但是,由于其他人已经使用其他软件包发布了解决方案,我也决定使用其他软件包:

    基础包:

    df <- df[with(df, order(id, stopSequence, stopId)), ]
    merge(df[!duplicated(df$id), ], 
          df[!duplicated(df$id, fromLast = TRUE), ], 
          all = TRUE)
    

    data.table:

    df <-  setDT(df)
    df[order(id, stopSequence)][, .SD[c(1,.N)], by=id]
    

    sqldf:

    library(sqldf)
    min <- sqldf("SELECT id, stopId, min(stopSequence) AS StopSequence
          FROM df GROUP BY id 
          ORDER BY id, StopSequence, stopId")
    max <- sqldf("SELECT id, stopId, max(stopSequence) AS StopSequence
          FROM df GROUP BY id 
          ORDER BY id, StopSequence, stopId")
    sqldf("SELECT * FROM min
          UNION
          SELECT * FROM max")
    

    在一个查询中:

    sqldf("SELECT * 
            FROM (SELECT id, stopId, min(stopSequence) AS StopSequence
                  FROM df GROUP BY id 
                  ORDER BY id, StopSequence, stopId)
            UNION
            SELECT *
            FROM (SELECT id, stopId, max(stopSequence) AS StopSequence
                  FROM df GROUP BY id 
                  ORDER BY id, StopSequence, stopId)")
    

    输出:

    id stopId StopSequence
    1  1      a            1
    2  1      c            3
    3  2      b            1
    4  2      c            4
    5  3      a            3
    6  3      b            1
    
  • 0

    在2018年使用 data.table

    # convert to data.table
    setDT(df) 
    # order, group, filter
    df[order(stopSequence)][, .SD[c(1, .N)], by = id]
    
       id stopId stopSequence
    1:  1      a            1
    2:  1      c            3
    3:  2      b            1
    4:  2      c            4
    5:  3      b            1
    6:  3      a            3
    

相关问题