首页 文章

在dplyr链中替换NA

提问于
浏览
37

Question has been edited from the original .

看完这个有趣的discussion后,我想知道如何使用dplyr替换列中的NAs,例如Lahman击球数据:

Source: local data frame [96,600 x 3]
Groups: teamID

   yearID teamID G_batting
1    2004    SFN        11
2    2006    CHN        43
3    2007    CHA         2
4    2008    BOS         5
5    2009    SEA         3
6    2010    SEA         4
7    2012    NYA        NA

以下按照我的预期 not 工作

library(dplyr)
library(Lahman)

df <- Batting[ c("yearID", "teamID", "G_batting") ]
df <- group_by(df, teamID )
df$G_batting[is.na(df$G_batting)] <- mean(df$G_batting, na.rm = TRUE)

来源:本地数据框[20 x 3]组:yearID,teamID

yearID teamID G_batting
1    2004    SFN  11.00000
2    2006    CHN  43.00000
3    2007    CHA   2.00000
4    2008    BOS   5.00000
5    2009    SEA   3.00000
6    2010    SEA   4.00000
7    2012    NYA  **49.07894**

> mean(Batting$G_battin, na.rm = TRUE)
[1] **49.07894**

实际上,它归咎于整体均值而不是群体均值 . 你会如何在dplyr链中做到这一点?使用来自基础R的 transform 也可以 not 工作,因为它估算了整体平均值,而不是组平均值 . 此方法也将数据转换为常规数据 . 一个框架 . 有一个更好的方法吗?

df %.% 
  group_by( yearID ) %.%
  transform(G_batting = ifelse(is.na(G_batting), 
    mean(G_batting, na.rm = TRUE), 
    G_batting)
  )

编辑:用 mutate 替换 transform 会出现以下错误

Error in mutate_impl(.data, named_dots(...), environment()) : 
  INTEGER() can only be applied to a 'integer', not a 'double'

编辑:添加as.integer似乎解决了错误, does 产生了预期的结果 . 另见@ eddi的答案 .

df %.% 
  group_by( teamID ) %.%
  mutate(G_batting = ifelse(is.na(G_batting), as.integer(mean(G_batting, na.rm = TRUE)), G_batting))

Source: local data frame [96,600 x 3]
Groups: teamID

   yearID teamID G_batting
1    2004    SFN        11
2    2006    CHN        43
3    2007    CHA         2
4    2008    BOS         5
5    2009    SEA         3
6    2010    SEA         4
7    2012    NYA        47

> mean_NYA <- mean(filter(df, teamID == "NYA")$G_batting, na.rm = TRUE)
> as.integer(mean_NYA)
[1] 47

编辑:关注@ Romain的评论我从github安装了dplyr:

> head(df,10)
   yearID teamID G_batting
1    2004    SFN        11
2    2006    CHN        43
3    2007    CHA         2
4    2008    BOS         5
5    2009    SEA         3
6    2010    SEA         4
7    2012    NYA        NA
8    1954    ML1       122
9    1955    ML1       153
10   1956    ML1       153

> df %.% 
+   group_by(teamID)  %.%
+   mutate(G_batting = ifelse(is.na(G_batting), mean(G_batting, na.rm = TRUE), G_batting))
Source: local data frame [96,600 x 3]
Groups: teamID

   yearID teamID  G_batting
1    2004    SFN          0
2    2006    CHN          0
3    2007    CHA          0
4    2008    BOS          0
5    2009    SEA          0
6    2010    SEA 1074266112
7    2012    NYA   90693125
8    1954    ML1        122
9    1955    ML1        153
10   1956    ML1        153
..    ...    ...        ...

所以我没有得到错误(好),但我得到了(看似)奇怪的结果 .

1 回答

  • 32

    您遇到的主要问题是 mean 返回一个double,而 G_batting 列是一个整数 . 因此,在 as.integer 中包含均值会起作用,或者您需要将整个列转换为 numeric 我猜 .

    也就是说,这里有几个 data.table 替代品 - 我没有检查哪一个更快 .

    library(data.table)
    
    # using ifelse
    dt = data.table(a = 1:2, b = c(1,2,NA,NA,3,4,5,6,7,8))
    dt[, b := ifelse(is.na(b), mean(b, na.rm = T), b), by = a]
    
    # using a temporary column
    dt = data.table(a = 1:2, b = c(1,2,NA,NA,3,4,5,6,7,8))
    dt[, b.mean := mean(b, na.rm = T), by = a][is.na(b), b := b.mean][, b.mean := NULL]
    

    这就是我想要理想的事情(there is an FR关于此):

    # again, atm this is pure fantasy and will not work
    dt[, b[is.na(b)] := mean(b, na.rm = T), by = a]
    

    ifelsedplyr 版本(如在OP中):

    dt %>% group_by(a) %>% mutate(b = ifelse(is.na(b), mean(b, na.rm = T), b))
    

    我不知道如何在 dplyr 中的单行中实现第二个 data.table 想法 . 我也不确定如何阻止 dplyr 加扰/排序数据(除了创建索引列) .

相关问题