首页 文章

索引数据框中的NA值[重复]

提问于
浏览
0

这个问题在这里已有答案:

在按某种条件对数据帧进行子集化时,如果数据帧包含NA,则可能会因条件而获得NA值 . 然后它会在data.frame的子集化中产生问题:

# data generation
set.seed(123)
df <- data.frame(a = 1:100, b = sample(c("moon", "venus"), 100, replace = TRUE), c = sample(c('a', 'b', NA), 100, replace = TRUE))

# indexing
with(df, df[a < 30 & b == "moon" & c == "a",])

你得到:

a    b    c
NA   NA <NA> <NA>
10   10 moon    a
12   12 moon    a
NA.1 NA <NA> <NA>
NA.2 NA <NA> <NA>
29   29 moon    a

发生这种情况是因为条件导致包含NA的向量,然后这些NA将在索引数据帧时产生上述结果 .

解决方案之一将是以下修复之一:

with(df, df[a < 30 & b == "moon" & (c == "a" & !is.na(c)),])  # exclude NAs
with(df, df[a < 30 & b == "moon" & (c == "a" | is.na(c)),])  # include NAs

但这些都非常笨拙 - 想象你有一个像 df[A == x1 & B == x2 & C == x3 & D == x4,] 这样的长条件,你必须像这样包装每个元素 - df[(A == x1 | is.na(A)) & (B == x2 | is.na(B)) ...,] .

对于这个问题有没有优雅的解决方案,如果你只是试图检查一个数据框,你不需要在控制台上编写这些代码?

3 回答

  • 1

    好吧,如果你想省略 NA 行,一个快速和hackish解决方案是将它包装在 which

    > with(df, df[a < 30 & b == "moon" & c == "a",])
          a    b    c
    NA   NA <NA> <NA>
    10   10 moon    a
    12   12 moon    a
    NA.1 NA <NA> <NA>
    NA.2 NA <NA> <NA>
    29   29 moon    a
    > with(df, df[which(a < 30 & b == "moon" & c == "a"),])
        a    b c
    10 10 moon a
    12 12 moon a
    29 29 moon a
    

    在编辑:在这样的情况下的另一个选项,可能是一些人不赞成,但我个人觉得非常有用,是在括号内定义一个局部变量:

    > with(df, df[{i<-a < 30 & b == "moon" & c == "a"; i | is.na(i)},])
        a    b    c
    6   6 moon <NA>
    10 10 moon    a
    12 12 moon    a
    15 15 moon <NA>
    18 18 moon <NA>
    29 29 moon    a
    > with(df, df[{i<-a < 30 & b == "moon" & c == "a"; i & !is.na(i)},])
        a    b c
    10 10 moon a
    12 12 moon a
    29 29 moon a
    

    这比编写特殊函数或在单独的行上定义索引更简洁,并且适用于没有R函数完全符合您的要求的许多情况 .

  • 4

    您可以使用 data.table 包 . 这样可以简化代码,因为您不必将所有内容都包含在 with(df, ...) 中,并且将NAs视为FALSE .

    require(data.table)
    dt <- data.table(df)
    dt[a < 30 & b == "moon" & c == "a",] # exclude NAs
    dt[a < 30 & b == "moon" & (c == "a"|is.na(c)),] # include NAs
    
  • 1
    clean <- function(x, include = FALSE){
        x[is.na(x)] <- include
        x
    }
    
    # Original output
    with(df, df[a < 30 & b == "moon" & c == "a",])
    # Clean it up and remove NAs
    with(df, df[clean(a < 30 & b == "moon" & c == "a"),])
    # Clean it up but include NAs
    with(df, df[clean(a < 30 & b == "moon" & c == "a", include = TRUE),])
    

    这使

    > with(df, df[a < 30 & b == "moon" & c == "a",])
          a    b    c
    NA   NA <NA> <NA>
    10   10 moon    a
    12   12 moon    a
    NA.1 NA <NA> <NA>
    NA.2 NA <NA> <NA>
    29   29 moon    a
    > 
    > with(df, df[clean(a < 30 & b == "moon" & c == "a"),])
        a    b c
    10 10 moon a
    12 12 moon a
    29 29 moon a
    > with(df, df[clean(a < 30 & b == "moon" & c == "a", include = TRUE),])
        a    b    c
    6   6 moon <NA>
    10 10 moon    a
    12 12 moon    a
    15 15 moon <NA>
    18 18 moon <NA>
    29 29 moon    a
    

    使用 which 也可以工作,但它只允许您默认排除值

相关问题