首页 文章

分组后只对一列进行排序

提问于
浏览
-3

如何使用dplyr将一个列中的单个条目组合在一起?我希望我可以通过电子邮件分组,然后对每个单独的列A-Z进行排序,但我无法弄清楚如何在不对整个数据帧进行排序的情况下执行此操作 . 非常感谢你提前!

Sample Data

df <- data.frame(
  cleanname = c("Steven Smith", "Rob Tan", 'Zachary', "Matthew"),
  dirtyname = c('rob Tan', 'stevesmith','zach', "Matthew"),
  email = c('hello@email.com', 'hello@email.com', 'email2@email.com', 'email2@email.com')
)

Desired End Result

desireddf <- data.frame(
  cleanname = c("Rob Tan", "Steven Smith", "Zachary", "Matthew"),
  dirtyname = c('rob Tan', 'stevesmith','zach', 'Matthew'),
  email = c('hello@email.com', 'hello@email.com', 'email2@email.com', 'email2@email.com')
)

Edit

感谢Sotos指出我的问题可以通过模糊名称匹配来解决 .

2 回答

  • 1

    您可以使用 stringdist -package中的 amatch -function:

    library(stringdist)
    df %>% 
      mutate(dirtyname = dirtyname[amatch(tolower(cleanname), tolower(dirtyname), maxDist = 3)],
             email = email[amatch(tolower(cleanname), tolower(dirtyname), maxDist = 3)])
    

    这使:

    cleanname dirtyname电子邮件
    1 Steven Smith stevesmith hello@email.com
    2 Rob Tan rob Tan hello@email.com
    3 Zachary zach email2@email.com
    4 Matthew Matthew email2@email.com

    data.table 相同的逻辑:

    library(data.table)
    setDT(df)[, `:=` (dirtyname = dirtyname[amatch(tolower(cleanname), tolower(dirtyname), maxDist = 3)],
                      email = email[amatch(tolower(cleanname), tolower(dirtyname), maxDist = 3)])]
    
  • 0

    如果数据框中的行表示不同的观察值,则不适合对每个列进行独立排序,因为独立的向量排序将使行不再代表单独的观察 .

    矢量可以通过多种方式进行排序,例如使用 order() 函数 .

    dirtyname <- dirtyname[order(dirtyname)]
    

相关问题