首页 文章

优雅的方式从数据框中删除稀有因子水平

提问于
浏览
8

我想按因子对数据帧进行子集化 . 我只想保留高于特定频率的因子水平 .

df <- data.frame(factor = c(rep("a",5),rep("b",5),rep("c",2)), variable = rnorm(12))

此代码创建数据框:

factor    variable
1       a -1.55902013
2       a  0.22355431
3       a -1.52195456
4       a -0.32842689
5       a  0.85650212
6       b  0.00962240
7       b -0.06621508
8       b -1.41347823
9       b  0.08969098
10      b  1.31565582
11      c -1.26141417
12      c -0.33364069

而且我想降低重复次数少于5次的因子水平 . 我开发了一个for循环,它正在工作:

for (i in 1:length(levels(df$factor))){
  if(table(df$factor)[i] < 5){
    df.new <- df[df$factor != names(table(df$factor))[i],] 
  }
}

但是,存在更快更漂亮的解决方案吗?

6 回答

  • 5
    require(dplyr)
    
    df %>% group_by(factor) %>% filter(n() >= 5)
    #factor   variable
    #1       a  2.0769363
    #2       a  0.6187513
    #3       a  0.2426108
    #4       a -0.4279296
    #5       a  0.2270024
    #6       b -0.6839748
    #7       b -0.3285610
    #8       b  0.2625743
    #9       b -0.9532957
    #10      b  1.4526317
    
  • 6
    library(data.table)
    setDT(df)[, variable[.N >= 5], by = factor]
    
    ##    factor         V1
    ## 1:      a -0.8204684
    ## 2:      a  0.4874291
    ## 3:      a  0.7383247
    ## 4:      a  0.5757814
    ## 5:      a -0.3053884
    ## 6:      b  1.5117812
    ## 7:      b  0.3898432
    ## 8:      b -0.6212406
    ## 9:      b -2.2146999
    ## 10:     b  1.1249309
    
  • 11

    也许加入一个过滤的因子计数:

    library(dplyr)
    common.factors <- df %.% group_by(factor) %.% tally() %.% filter(n >= 5) 
    df.1 <- semi_join(df, common.factors)
    
  • 0

    尝试使用基本功能...

    lvl = as.data.frame(table(df$factor))
    colnames(lvl) = c('factor','count')
    lvl
      factor count
    1      a     5
    2      b     5
    3      c     2
    
    df[df$factor %in% lvl[lvl$count>=5,]$factor,]
       factor    variable
    1       a -0.01619026
    2       a  0.94383621
    3       a  0.82122120
    4       a  0.59390132
    5       a  0.91897737
    6       b  0.78213630
    7       b  0.07456498
    8       b -1.98935170
    9       b  0.61982575
    10      b -0.05612874
    
  • 3

    关于什么

    df.new <- df[!(as.numeric(df$factor) %in% which(table(df$factor)<5)),]
    
  • 0

    这对我有用:

    df = df[df$factor %in% names(table(df$factor)) [table(df$factor) >=5],]
    

相关问题