首页 文章

根据最小值和最大值过滤数据帧

提问于
浏览
2

我有一个像这样的数据框:

df
      A     B     C     D     E     F
   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1   24.    6.   16.    5. 1.20     6.
 2   21.    2.   19.    2. 1.09     2.
 3   12.    2.   12.   79. 0.860    2.
 4   39.    7.   39.   39. 1.90     7.
 5   51.    1.   82.   27. 2.30     1.
 6   24.    9.   24.   40. 1.60     9.
 7   48.    1.   32.    5. 1.60     1.
 8   44.    1.   44.   12. 1.70     1.
 9   14.    1.   18.    6. 0.880    1.
10   34.    2.   51.    5. 2.70     2.
# ... with 4,688 more rows

我想根据列表过滤此数据框,这样对于每列df,最小值和最大值将根据列表Neighb的最小值和最大值:

[[1]]
[1] 15.7 15.9 16.0 16.1 16.2

[[2]]
[1] 0 1 2 3 4

[[3]]
[1] 15.0 15.3 16.0 16.3 16.5

[[4]]
[1] 3 4 5 6 7

[[5]]
[1] 1.08 1.09 1.10 1.11 1.12

[[6]]
[1] 0 1 2 3 4

有没有办法用dplyr / base R有效地做到这一点?到目前为止,我一直使用循环并过滤每列df

4 回答

  • 1

    我们可以使用 Map 来自 base R

    Map(function(x, y) x[x >= min(y) & x <= max(y)], df, Neighb)
    #$A
    #numeric(0)
    
    #$B
    #[1] 2 2 1 1 1 1 2
    
    #$C
    #[1] 16
    
    #$D
    #[1] 5 5 6 5
    
    #$E
    #[1] 1.09
    
    #$F
    #[1] 2 2 1 1 1 1 2
    

    如果我们需要 filter 基于逻辑索引的数据集,即基于与'Neighb'的比较而具有全部 TRUE 的行

    df[Reduce(`&`, Map(function(x, y) x >= min(y) & x <= max(y), df, Neighb)), ]
    

    如果它是任何TRUE

    df[Reduce(`|`, Map(function(x, y) x >= min(y) & x <= max(y), df, Neighb)),]
    

    数据

    df <- structure(list(A = c(24, 21, 12, 39, 51, 24, 48, 44, 14, 34), 
                         B = c(6, 2, 2, 7, 1, 9, 1, 1, 1, 2), 
                         C = c(16, 19, 12, 39, 82, 24, 32, 44, 18, 51),
                         D = c(5, 2, 79, 39, 27, 40, 5, 12, 6, 5), 
                         E = c(1.2, 1.09, 0.86, 1.9, 2.3, 1.6, 1.6, 1.7, 0.88, 2.7), 
                         F = c(6, 2, 2, 7, 1, 9, 1, 1, 1, 2)), 
                    .Names = c("A","B", "C", "D", "E", "F"), 
                    class = "data.frame", 
                    row.names = c(NA, -10L))
    
    
    Neighb <- list(c(15.7, 15.9, 16.0, 16.1, 16.2),
                   c(0, 1, 2, 3, 4),
                   c(15.0, 15.3, 16.0, 16.3, 16.5),
                   c(3, 4, 5, 6, 7),
                   c(1.08, 1.09, 1.10, 1.11, 1.12),
                   c(0, 1, 2, 3, 4))
    
  • 4

    你可以将 purrrpurrr 一起使用 between 来自 dplyr 来获得你想要的结果 .

    library(purrr)
    library(dplyr)
    
    map2(df, Neighb, function(x, y) x[between(x, min(y), max(y))] )
    $A
    numeric(0)
    
    $B
    [1] 2 2 1 1 1 1 2
    
    $C
    [1] 16
    
    $D
    [1] 5 5 6 5
    
    $E
    [1] 1.09
    
    $F
    [1] 2 2 1 1 1 1 2
    

    数据:

    df <- structure(list(A = c(24, 21, 12, 39, 51, 24, 48, 44, 14, 34), 
                         B = c(6, 2, 2, 7, 1, 9, 1, 1, 1, 2), 
                         C = c(16, 19, 12, 39, 82, 24, 32, 44, 18, 51),
                         D = c(5, 2, 79, 39, 27, 40, 5, 12, 6, 5), 
                         E = c(1.2, 1.09, 0.86, 1.9, 2.3, 1.6, 1.6, 1.7, 0.88, 2.7), 
                         F = c(6, 2, 2, 7, 1, 9, 1, 1, 1, 2)), 
                    .Names = c("A","B", "C", "D", "E", "F"), 
                    class = "data.frame", 
                    row.names = c(NA, -10L))
    
    
    Neighb <- list(c(15.7, 15.9, 16.0, 16.1, 16.2),
                   c(0, 1, 2, 3, 4),
                   c(15.0, 15.3, 16.0, 16.3, 16.5),
                   c(3, 4, 5, 6, 7),
                   c(1.08, 1.09, 1.10, 1.11, 1.12),
                   c(0, 1, 2, 3, 4))
    
  • 2

    可能的解决方案:

    # needed packages
    library(data.table)
    
    # get the minimum and maximum for each list item
    nr <- lapply(Neighb, range)
    
    # create a matrix with the 'inrange' function from 'data.table'
    m <- mapply(function(x, y) x %inrange% y, df, nr)
    

    这给了:

    m
    A B C D E F.
    [1,] FALSE FALSE TRUE FUE FALSE FALSE
    [2,] FALSE TRUE FALSE FALSE TRUE
    [3,] FALSE TRUE FALSE FALSE TRUE
    [4,]错误,错误,错误,错误
    [5,] FALSE TRUE FALSE FALSE TRUE
    [6,]错误,错误,错误,错误
    [7,] FALSE TRUE FUE FUE FALSE TRUE
    [8,] FALSE TRUE FALSE FALSE TRUE
    [9,] FALSE TRUE FUE FUE FALSE TRUE
    [10,] FALSE TRUE FUE FUE FALSE TRUE

    现在您可以使用 rowSums 函数过滤 df

    df[rowSums(m) == ncol(df),]
    

    将此应用于所呈现的示例数据( df )将导致空数据帧,但在原始数据集上很可能会导致非空数据帧 .


    使用数据:

    df <- read.table(text="     A     B     C     D     E     F
                       1   24    6   16    5 1.20     6
                       2   21    2   19    2 1.09     2
                       3   12    2   12   79 0.860    2
                       4   39    7   39   39 1.90     7
                       5   51    1   82   27 2.30     1
                       6   24    9   24   40 1.60     9
                       7   48    1   32    5 1.60     1
                       8   44    1   44   12 1.70     1
                       9   14    1   18    6 0.880    1
                       10   34   2   51    5 2.70     2", header=TRUE, stringsAsFactors=FALSE)
    Neighb <- list(c(15.7,15.9,16.0,16.1,16.2),c(0:4),c(15.0,15.3,16.0,16.3,16.5),c(3:7),seq(1.08,1.12,0.01),c(0:4))
    
  • 1

    另一种方法可能是

    #minimum and maximum value from given list
    filter_criteria <- lapply(lookup_list, function(x) c(min(x), max(x)))
    
    df1 <- as.data.frame(mapply(function(x, y) replace(x, !(x>=y[1] & x<=y[2]), NA), 
                                df, filter_criteria))
    
    df1
    #    A  B  C  D    E  F
    #1  NA NA 16  5   NA NA
    #2  NA  2 NA NA 1.09  2
    #3  NA  2 NA NA   NA  2
    #4  NA NA NA NA   NA NA
    #5  NA  1 NA NA   NA  1
    #6  NA NA NA NA   NA NA
    #7  NA  1 NA  5   NA  1
    #8  NA  1 NA NA   NA  1
    #9  NA  1 NA  6   NA  1
    #10 NA  2 NA  5   NA  2
    
    #final output
    df1 <- na.omit(df1)   #as per given sample data it's empty
    

    Sample data

    df <- structure(list(A = c(24, 21, 12, 39, 51, 24, 48, 44, 14, 34), 
        B = c(6, 2, 2, 7, 1, 9, 1, 1, 1, 2), C = c(16, 19, 12, 39, 
        82, 24, 32, 44, 18, 51), D = c(5, 2, 79, 39, 27, 40, 5, 12, 
        6, 5), E = c(1.2, 1.09, 0.86, 1.9, 2.3, 1.6, 1.6, 1.7, 0.88, 
        2.7), F = c(6, 2, 2, 7, 1, 9, 1, 1, 1, 2)), class = "data.frame", row.names = c("1", 
    "2", "3", "4", "5", "6", "7", "8", "9", "10"))
    
    lookup_list <- list(c(15.7, 15.9, 16, 16.1, 16.2), c(0, 1, 2, 3, 4), c(15, 15.3, 
    16, 16.3, 16.5), c(3, 4, 5, 6, 7), c(1.08, 1.09, 1.1, 1.11, 1.12
    ), c(0, 1, 2, 3, 4))
    

相关问题