首页 文章

R中数据帧中低频数据滤波的有效方法

提问于
浏览
4

我有一个包含多列的data.frame,并希望根据变量的组合过滤低频数据 . 这个例子就像男性/女性的性别变量和胆固醇变量的高/低 . 那我的数据框就像:

set.seed(123)
Sex = sample(c('Male','Female'),size = 20,replace = TRUE)
Age = sample(c('Low','High'),size = 20,replace = TRUE)
Index = 1:20
df = data.frame(index = Index,Sex=Sex,Age=Age)
df


  index    Sex  Age
1      1   Male High
2      2 Female High
3      3   Male High
4      4 Female High
5      5 Female High
6      6   Male High
7      7 Female High
8      8 Female High
9      9 Female  Low
10    10   Male  Low
11    11 Female High
12    12   Male High
13    13 Female High
14    14 Female High
15    15   Male  Low
16    16 Female  Low
17    17   Male High
18    18   Male  Low
19    19   Male  Low
20    20 Female  Low

现在我想过滤频率高于3的性别/年龄组合

table(df[,2:3])
        Age
Sex      High Low
  Female    8   3
  Male      5   4

换句话说,我想保持女性高,男性低和男性高的指数 .

Notice 1)我的数据框有几个变量(不像上面的例子)和2)我做 not want 使用任何第三个R包和3)我希望它快 .

5 回答

  • 1

    这是基础R中的一个简单方法:

    lvls <- interaction(df$Sex, df$Age)
    counts <- table(lvls)
    df[lvls %in% names(counts)[counts > 3], ]
    
    #   index    Sex  Age
    #1      1   Male High
    #2      2 Female High
    #3      3   Male High
    #4      4 Female High
    #5      5 Female High
    #6      6   Male High
    #7      7 Female High
    #8      8 Female High
    #10    10   Male  Low
    #11    11 Female High
    #12    12   Male High
    #13    13 Female High
    #14    14 Female High
    #15    15   Male  Low
    #17    17   Male High
    #18    18   Male  Low
    #19    19   Male  Low
    

    如果您有更多的变量,可以将它们存储在向量中:

    vars <- c("Age", "Sex") # add more
    lvls <- interaction(df[, vars])
    counts <- table(lvls)
    df[lvls %in% names(counts)[counts > 3], ]
    

    这是使用 ave 的第二个基础R方法:

    subset(df, ave(as.integer(factor(Sex)), Sex, Age, FUN = "length") > 3)
    
  • 1

    好的,这是一个Base-R选项

    set.seed(123)
    Sex = sample(c('Male','Female'),size = 20,replace = TRUE)
    Age = sample(c('Low','High'),size = 20,replace = TRUE)
    Index = 1:20
    df = data.frame(index = Index,Sex=Sex,Age=Age)
    df
    
    merge(
        df
        , aggregate(rep(1, nrow(df)), by = df[,c("Sex", "Age")], sum)
        , by = c("Sex", "Age")
    )
    

    聚合函数 sum 为所有组合的所有 1 s .

  • 7

    我们可以用 data.table 做到这一点,它也应该是有效的

    library(data.table)
    setDT(df)[, .SD[.N > 3], .(Sex, Age)]
    

    或者 .I

    setDT(df)[df[, .I[.N >3], .(Sex, Age)]$V1]
    
  • 4
    vars     <- c("Sex","Age")
    max_freq <- 3
    new_df   <- merge(df, subset(as.data.frame(table(df[,vars])),Freq>max_freq)[1:2])
    
    new_df
    #       Sex  Age index
    # 1  Female High     2
    # 2  Female High     7
    # 3  Female High    14
    # 4  Female High    11
    # 5  Female High     5
    # 6  Female High     4
    # 7  Female High    13
    # 8  Female High     8
    # 9    Male High     6
    # 10   Male High     3
    # 11   Male High     1
    # 12   Male High    17
    # 13   Male High    12
    # 14   Male  Low    10
    # 15   Male  Low    15
    # 16   Male  Low    18
    # 17   Male  Low    19
    
  • 4

    答案是 dplyr

    library(dplyr)
    df %>% 
      group_by(Sex, Age) %>% 
      filter(n() > 3)
    

    即使在OP中声明,这也不是基本的R解决方案 . 认为它可能对没有此类限制的未来用户有用 .

相关问题