dplyr :: filter（）基于dplyr :: lag（）而不会丢失第一个值-Java 学习之路

当我基于lag（）函数过滤数据集时，我丢失了每个组中的第一行（因为这些行没有滞后值） . 我怎么能避免这种情况，以便尽管没有任何滞后值，我仍保留第一行？

ds <- 
  structure(list(mpg = c(21, 21, 21.4, 18.7, 14.3, 16.4), cyl = c(6, 
  6, 6, 8, 8, 8), hp = c(110, 110, 110, 175, 245, 180)), class = c("tbl_df", 
  "tbl", "data.frame"), row.names = c(NA, -6L), .Names = c("mpg", 
  "cyl", "hp"))

# example of filter based on lag that drops first rows
ds %>% 
  group_by(cyl) %>% 
  arrange(-mpg) %>% 
  filter(hp <= lag(hp))

2 回答

让 filter(hp <= lag(hp)) 排除 lag(hp) 为 NA 的行 . 您可以改为过滤不等式或 lag(hp) ，就像每组的顶行一样 .

为了清晰和调试，我包括 prev = lag(hp) 为滞后做一个独立的变量 .

library(tidyverse)

ds %>%
    group_by(cyl) %>%
    arrange(-mpg) %>%
    mutate(prev = lag(hp)) %>%
    filter(hp <= prev | is.na(prev))

这会产生：

# A tibble: 4 x 4
# Groups:   cyl [2]
    mpg   cyl    hp  prev
  <dbl> <dbl> <dbl> <dbl>
1  21.4    6.  110.   NA 
2  21.0    6.  110.  110.
3  21.0    6.  110.  110.
4  18.7    8.  175.   NA

回复于 2024-05-03T12:27:24+08:00

由于 OP 打算使用 <= （小于或等于）与之前的值，因此使用 lag 与 default = +Inf 就足够了 .

此外，由于 lag 提供了选择 order_by 的选项，因此无需在 dplyr 链中进行单独的 arrange 调用 .

因此，解决方案可以写成：

ds %>% 
  group_by(cyl) %>% 
  filter(hp <= lag(hp, default = +Inf, order_by = -mpg))

#Below result is in origianl order of the data.frame though lag was calculated 
#in ordered value of mpg
# # A tibble: 4 x 3
# # Groups: cyl [2]
#     mpg   cyl    hp
#    <dbl> <dbl> <dbl>
# 1  21.0  6.00   110
# 2  21.0  6.00   110
# 3  21.4  6.00   110
# 4  18.7  8.00   175

回复于 2024-05-03T12:27:24+08:00

dplyr :: filter（）基于dplyr :: lag（）而不会丢失第一个值

2 回答

相关问题