首页 文章

编写一个在dplyr :: mutate()内部工作的自定义函数

提问于
浏览
3

我正在努力编写一个在 dplyr::mutate() 内工作的函数 .

由于 rowwise() %>% sum() 在大型数据集上非常慢,因此建议的替代方法是返回到baseR . 我希望如下简化这个过程,但是在mutate函数中传递数据时遇到了麻烦 .

require(tidyverse)
#> Loading required package: tidyverse
#I'd like to write a function that works inside mutate and replaces the rowSums(select()).
cars <- as_tibble(cars)

cars %>% 
  mutate(sum = rowSums(select(., speed, dist), na.rm = T))
#> # A tibble: 50 x 3
#>    speed  dist   sum
#>    <dbl> <dbl> <dbl>
#>  1    4.    2.    6.
#>  2    4.   10.   14.
#>  3    7.    4.   11.
#>  4    7.   22.   29.
#>  5    8.   16.   24.
#>  6    9.   10.   19.
#>  7   10.   18.   28.
#>  8   10.   26.   36.
#>  9   10.   34.   44.
#> 10   11.   17.   28.
#> # ... with 40 more rows

#Here is my first attempt.
rowwise_sum <- function(data, ..., na.rm = FALSE) {
  columns <- rlang::enquos(...)

  data %>% 
    select(!!!columns) %>% 
    rowSums(na.rm = na.rm)
}

#Doesnt' work as expected:
cars %>% mutate(sum = rowwise_sum(speed, dist, na.rm = T))
#> Error in mutate_impl(.data, dots): Evaluation error: no applicable method for 'select_' applied to an object of class "c('double', 'numeric')".

#But alone it is creating a vector.
cars %>% rowwise_sum(speed, dist, na.rm = T)
#>  [1]   6  14  11  29  24  19  28  36  44  28  39  26  32  36  40  39  47
#> [18]  47  59  40  50  74  94  35  41  69  48  56  49  57  67  60  74  94
#> [35] 102  55  65  87  52  68  72  76  84  88  77  94 116 117 144 110

#Appears to not be getting the data passed.  Specifying with a dot works.
cars %>% mutate(sum = rowwise_sum(., speed, dist, na.rm = T))
#> # A tibble: 50 x 3
#>    speed  dist   sum
#>    <dbl> <dbl> <dbl>
#>  1    4.    2.    6.
#>  2    4.   10.   14.
#>  3    7.    4.   11.
#>  4    7.   22.   29.
#>  5    8.   16.   24.
#>  6    9.   10.   19.
#>  7   10.   18.   28.
#>  8   10.   26.   36.
#>  9   10.   34.   44.
#> 10   11.   17.   28.
#> # ... with 40 more rows

那么问题就变成了如何通过在函数内部传递数据来解决每次包含点的需要?

rowwise_sum2 <- function(data, ..., na.rm = FALSE) {
  columns <- rlang::enquos(...)

  data %>% 
    select(!!!columns) %>% 
    rowSums(., na.rm = na.rm)
}

#Same error
cars %>% mutate(sum = rowwise_sum2(speed, dist, na.rm = T))
#> Error in mutate_impl(.data, dots): Evaluation error: no applicable method for 'select_' applied to an object of class "c('double', 'numeric')".

#Same result
cars %>% rowwise_sum2(speed, dist, na.rm = T)
#>  [1]   6  14  11  29  24  19  28  36  44  28  39  26  32  36  40  39  47
#> [18]  47  59  40  50  74  94  35  41  69  48  56  49  57  67  60  74  94
#> [35] 102  55  65  87  52  68  72  76  84  88  77  94 116 117 144 110

#Same result
cars %>% mutate(sum = rowwise_sum2(., speed, dist, na.rm = T))
#> # A tibble: 50 x 3
#>    speed  dist   sum
#>    <dbl> <dbl> <dbl>
#>  1    4.    2.    6.
#>  2    4.   10.   14.
#>  3    7.    4.   11.
#>  4    7.   22.   29.
#>  5    8.   16.   24.
#>  6    9.   10.   19.
#>  7   10.   18.   28.
#>  8   10.   26.   36.
#>  9   10.   34.   44.
#> 10   11.   17.   28.
#> # ... with 40 more rows

reprex package(v0.2.0)创建于2018-05-22 .


来自akrun的回答(请upvote):

换句话说:放弃 mutate() 并在新功能中做所有事情 .

这是我的最终函数,作为对他的更新,如果需要,还允许命名sum value列 .

rowwise_sum <- function(data, ..., sum_col = "sum", na.rm = FALSE) {

  columns <- rlang::enquos(...)

  data %>%
    select(!!! columns) %>%
    transmute(!!sum_col := rowSums(., na.rm = na.rm)) %>%
    bind_cols(data, .)
}

1 回答

  • 3

    我们可以将 ... 放在最后

    rowwise_sum <- function(data, na.rm = FALSE,...) {
      columns <- rlang::enquos(...)
      data %>%
         select(!!!columns) %>%
         rowSums(na.rm = na.rm)
    }
    
    cars %>% 
         mutate(sum = rowwise_sum(., na.rm = TRUE, speed, dist))
    # A tibble: 50 x 3
    #   speed  dist   sum
    #   <dbl> <dbl> <dbl>
    # 1     4     2     6
    # 2     4    10    14
    # 3     7     4    11
    # 4     7    22    29
    # 5     8    16    24
    # 6     9    10    19
    # 7    10    18    28
    # 8    10    26    36
    # 9    10    34    44
    #10    11    17    28
    # ... with 40 more rows
    

    它也可以在不改变 ... 的位置的情况下工作(尽管通常建议使用) . 这里的主要问题是在 mutate 中的参数列表中未指定 data. ) .


    在函数中创建整个流程而不是做一个部分会更容易

    rowwise_sum2 <- function(data, na.rm = FALSE, ...) {
      columns <- rlang::enquos(...)
      data %>%
          select(!!! columns) %>%
          transmute(sum = rowSums(., na.rm = TRUE)) %>%
          bind_cols(data, .)
    
    }
    
    rowwise_sum2(cars, na.rm = TRUE, speed, dist)
    # A tibble: 50 x 3
    #   speed  dist   sum
    #   <dbl> <dbl> <dbl>
    # 1     4     2     6
    # 2     4    10    14
    # 3     7     4    11
    # 4     7    22    29
    # 5     8    16    24
    # 6     9    10    19
    # 7    10    18    28
    # 8    10    26    36
    # 9    10    34    44
    #10    11    17    28
    

相关问题