首页 文章

在加入Dplyr之前比较两个数据帧之间的名称列

提问于
浏览
0

我想知道在dplyr中进行连接之前是否有一种比较列的简单方法 . 以下是两个简单的数据帧 . 我想根据名字和姓氏进行全面加入,但是有一些拼写错误或不同格式,例如“Elizabeth Ray”和“Elizabeth” .

我想在加入之前比较这些列 . 我希望有一种方法可以产生一个列表或向量,包含索引的所有差异,所以我可以在加入之前纠正它们 .

如果有一种更简单的方法,我也会对此持开放态度,但我希望有最简单的方法 . 我想要一个基于dplyr,tidyr和stringr的解决方案 .

FirstNames<-c("Chris","Doug","Shintaro","Bubbles","Elsa")
LastNames<-c("MacDougall","Shapiro","Yamazaki","Murphy","Elizabeth Ray")
Pets<-c("Cat","Dog","Cat","Dog","Cat")
Names1<-data.frame(FirstNames,LastNames,Pets)

FirstNames2<-c("Chris","Doug","Shintaro","Bubbles","Elsa")
LastNames2<-c("MacDougal","Shapiro","Yamazaku","Murphy","Elizabeth")
Dwelling<-c("House","House","Apartment","Condo","House")
Names2<-data.frame(FirstNames2,LastNames2,Dwelling)

3 回答

  • 0

    为了比较您的记录之间的相似性,我想您可能正在寻找一种方法来将一定量的模糊逻辑匹配应用于您的名称比较任务 . AKA:在执行Record Linkage任务时应用String Distance Function . (如果你已经知道这一切,请原谅我 - 但这些关键词在开始时对我有很大的帮助 . )

    有一个名为stringdist的优秀软件包可以很好地用于这些应用程序,但recordlinkage可能会帮助您最快地完成对齐数据框架的工作 .

    如果您希望查看最不同的名字和名字最相似的值,您可以使用如下代码:

    library(RecordLinkage)
    library(dplyr)
    
    id <- c(1:5) # added in to allow joining of data tables & comparison results
    firstName <- c("Chris","Doug","Shintaro","Bubbles","Elsa")
    lastName <- c("MacDougall","Shapiro","Yamazaki","Murphy","Elizabeth Ray")
    pet <- c("Cat","Dog","Cat","Dog","Cat")
    Names1 <- data.frame(id, firstName, lastName, pet)
    
    id <- c(1:5) # added in to allow joining of data tables & comparison results
    firstName2 <- c("Chris","Doug","Shintaro","Bubbles","Elsa")
    lastName2 <- c("MacDougal","Shapiro","Yamazaku","Murphy","Elizabeth")
    dwelling <- c("House","House","Apartment","Condo","House")
    Names2 <- data.frame(id, firstName2, lastName2, dwelling)
    
    # RecordLinkage function that calculates string distance b/w records in two data frames
    Results <- compare.linkage(Names1, Names2, blockfld = 1, strcmp = T, exclude = 4)
    Results
    #  $data1
    #    firstName      lastName  pet
    # 1      Chris    MacDougall  Cat
    # 2       Doug       Shapiro  Dog
    # 3   Shintaro      Yamazaki  Cat
    # 4    Bubbles        Murphy  Dog
    # 5       Elsa Elizabeth Ray  Cat
    
    # $data2
    #    firstName2  lastName2  dwelling
    # 1       Chris  MacDougal     House
    # 2        Doug    Shapiro     House
    # 3    Shintaro   Yamazaku Apartment
    # 4     Bubbles     Murphy     Condo
    # 5        Elsa  Elizabeth     House
    
    # $pairs
    # id1 id2 id firstName  lastName is_match
    # 1   1   1  1         1 0.9800000       NA
    # 2   2   2  1         1 1.0000000       NA
    # 3   3   3  1         1 0.9500000       NA
    # 4   4   4  1         1 1.0000000       NA
    # 5   5   5  1         1 0.9384615       NA
    
    # $frequencies
    # id firstName  lastName 
    # 0.200     0.200     0.125 
    # $type
    # [1] "linkage"
    
    # attr(,"class")
    # [1] "RecLinkData"
    
    # Trim $pairs dataframe (seen above) to contain just id's & similarity measures
    PairsSelect <- 
        Results$pairs %>% 
        select(id = id1, firstNameSim = firstName, lastNameSim = lastName)
    
    # Join original data & string comparison results together
    # reorganize data to facilitate review
    JoinedResults <-
        left_join(Names1, Names2) %>% 
        left_join(PairsSelect) %>% 
        select(id, firstNameSim, firstName, firstName2, lastNameSim, lastName, lastName2) %>% 
        arrange(desc(lastNameSim), desc(firstNameSim), id)
    JoinedResults
    # id firstNameSim firstName firstName2 lastNameSim      lastName lastName2
    # 1  2            1      Doug       Doug   1.0000000       Shapiro   Shapiro
    # 2  4            1   Bubbles    Bubbles   1.0000000        Murphy    Murphy
    # 3  1            1     Chris      Chris   0.9800000    MacDougall MacDougal
    # 4  3            1  Shintaro   Shintaro   0.9500000      Yamazaki  Yamazaku
    # 5  5            1      Elsa       Elsa   0.9384615 Elizabeth Ray Elizabeth
    
    # If you want to collect just the perfect matches
    PerfectMatches <- 
        JoinedResults %>% 
        filter(firstNameSim == 1 & lastNameSim == 1) %>% 
        select(id, firstName, lastName)
    PerfectMatches
    #   id firstName lastName
    # 1  2      Doug  Shapiro
    # 2  4   Bubbles   Murphy
    
    # To collect the matches that are going to need alignment
    ImperfectMatches <- 
        JoinedResults %>% 
        filter(firstNameSim < 1 | lastNameSim < 1) %>% 
        mutate(flgFrstNm = 0, flgLstNm = 0)
    ImperfectMatches
    #   id firstNameSim firstName firstName2 lastNameSim      lastName lastName2 flgFrstNm flgLstNm
    # 1  1            1     Chris      Chris   0.9800000    MacDougall MacDougal         0        0
    # 2  3            1  Shintaro   Shintaro   0.9500000      Yamazaki  Yamazaku         0        0
    # 3  5            1      Elsa       Elsa   0.9384615 Elizabeth Ray Elizabeth         0        0
    # 
    
    # If you want to enter your column preference in a flag column to facilitate faster rectification...
    write.csv(ImperfectMatches, "ImperfectMatches.csv", na = "", row.names = F)
    ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ##
    # Flag data externally - save file to new name with '_reviewed' appended to filename
    ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ##
    #reload results
    FlaggedMatches <- read.csv("ImperfectMatches_reviewed.csv", stringsAsFactors = F)
    FlaggedMatches
    ## Where a 1 is the 1st data set preferred and 0 (or 2 if that is easier for the 'data processor') means the 2nd data set is preferred.
    #   id firstNameSim firstName firstName2 lastNameSim      lastName lastName2 flgFrstNm flgLstNm
    # 1  1            1     Chris      Chris   0.9800000    MacDougall MacDougal         1        0
    # 2  3            1  Shintaro   Shintaro   0.9500000      Yamazaki  Yamazaku         1        1
    # 3  5            1      Elsa       Elsa   0.9384615 Elizabeth Ray Elizabeth         1        0
    
    ## Executing Assembly of preferred/rectified firstName and lastName columns
    ResolvedMatches <- 
        FlaggedMatches %>% 
        mutate(rectifiedFirstName = ifelse(flgFrstNm == 1,firstName, firstName2),
               rectifiedLastName = ifelse(flgLstNm == 1, lastName, lastName2)) %>% 
        select(id, starts_with("rectified"))
    
    ResolvedMatches
    # id rectifiedFirstName rectifiedLastName
    # 1  1              Chris         MacDougal
    # 2  3           Shintaro          Yamazaki
    # 3  5               Elsa         Elizabeth
    

    dplyr非常直观,但 compare.linkage() 函数可以使用一些解释 .

    前两个参数是显而易见的:您要比较的两个数据帧(dataframe1和dataframe2) . [如果您只想将onedataframe内的记录与自身进行比较(重复记录集记录集),则可以使用 compare.dedup() ,而只引用一个数据帧 .

    在这种情况下,将 blockfld 设置为1或2将分别指定First Name或Last Name上的匹配项必须为100% . 相反,您可能希望在数据集中包含主/外键并在blckfld参数中引用该列 . 或者,如果您的记录实际上并非如此等效构建,则可以完全保留此参数(默认为 FALSE ),然后将比较所有可能的组合[数据帧的叉积] .

    strcmpTRUE 获取应用于您正在比较的数据列的字符串距离距离函数;如果你把它假,那么它只是测试精确的1:1字符串对应 .

    exclude 也是避免构建中间数据帧并仅选择您希望彼此比较的列的好方法:排除 3 只是允许我们从结果中删除Pets和Dwelling比较 .

    上面代码中的4列,键控数据帧(不是原始问题3列数据帧)产生的结果如下:

    #  $data1
    #    firstName      lastName  pet
    # 1      Chris    MacDougall  Cat
    # 2       Doug       Shapiro  Dog
    # 3   Shintaro      Yamazaki  Cat
    # 4    Bubbles        Murphy  Dog
    # 5       Elsa Elizabeth Ray  Cat
    
    # $data2
    #    firstName2  lastName2  dwelling
    # 1       Chris  MacDougal     House
    # 2        Doug    Shapiro     House
    # 3    Shintaro   Yamazaku Apartment
    # 4     Bubbles     Murphy     Condo
    # 5        Elsa  Elizabeth     House
    
    # $pairs
    # id1 id2 id firstName  lastName is_match
    # 1   1   1  1         1 0.9800000       NA
    # 2   2   2  1         1 1.0000000       NA
    # 3   3   3  1         1 0.9500000       NA
    # 4   4   4  1         1 1.0000000       NA
    # 5   5   5  1         1 0.9384615       NA
    
    # $frequencies
    # id firstName  lastName 
    # 0.200     0.200     0.125 
    # $type
    # [1] "linkage"
    
    # attr(,"class")
    # [1] "RecLinkData"
    

    上面的每个部分(例如$ pair)都是它自己的数据框 . 添加一个键,您可以将它们连接在一起,然后引用并使用成对的df作为切换级别门,然后甚至将data1值复制到data2帧中,例如,当您在配对等级中具有> 0.95的值时 . (注意:is_match看起来很重要,但是它用于训练匹配工具,并且与我们的任务无关 . )

    在任何情况下,我希望你发现这些图书馆能够突然增加你的工作量,就像我第一次遇到它们时一样令人头疼 .

    顺便说一句:我还发现这个Comparison of String Distance Algorithms对目前可用的字符串距离指标进行了很好的调查 .

  • 0

    我正在回答,因为我无法访问评论

    df = Names1[!(Names1$LastNames %in% Names2$LastNames2), ]
    

    试试关于代码 .

  • 0

    使用标准 adist() 函数作为@alistaire建议提供了一种非常有效的方法(并且可能更有可能是教师试图看到的 . ) adist 的字符串度量仅限于广义的Levenshtein(编辑)距离,但这看起来确切你在寻找什么 .

    代码如下:(因为这看起来像是一个专门用于数据处理的R编码类的介绍,我在一些最佳实践中添加了抛光到可重现/提出的问题 . )

    library(dplyr)
    
    id <- c(1:5)
    firstName <- c("Chris","Doug","Shintaro","Bubbles","Elsa")
    lastName <- c("MacDougall","Shapiro","Yamazaki","Murphy","Elizabeth Ray")
    pet <- c("Cat","Dog","Cat","Dog","Cat")
    Names1 <- data.frame(id, firstName, lastName, pet)
    
    id <- c(1:5)
    firstName2 <- c("Chris","Doug","Shintaro","Bubbles","Elsa")
    lastName2 <- c("MacDougal","Shapiro","Yamazaku","Murphy","Elizabeth")
    dwelling <- c("House","House","Apartment","Condo","House")
    Names2 <- data.frame(id, firstName2, lastName2, dwelling)
    # NB: technically you could merge these data frames later with `bind_cols()` but best 
    # datahandling practices dictate joining/comparing data based on keys (instead of 
    # binding columns together based upon the order in which tables are initially arranged.)
    #[also preference is for column headers to be singular and lower case, and tables/dataframes to be uppercase and plural - from (or extension from principles in): https://google.github.io/styleguide/Rguide.xml]
    
    ## adist() calculates string distance b/w records in two data frames
    # Matrix between all columns is great way to ascertain similarity of data
    # on overall column to column basis.
    # 0 is closest resemblance, higher numbers are lowest resemblance
    ResultsInterColumnComparison <-
        adist(Names1, Names2, partial = T)
    ResultsInterColumnComparison
    
    # firstName to firstName2 & lastName to LastName2 are similar columns.
    #           id firstName2 lastName2 dwelling
    # id         0          2         2        2
    # firstName 15          0         3        4
    # lastName  15          3         0        5
    # pet       15          5         4        3
    
    # adist column to column DifferenceCount (using dplyr)
    dltFrstN <- diag(adist(Names1$firstName, Names2$firstName2, partial = T))
    dltLstN <- diag(adist(Names1$lastName, Names2$lastName2, partial = T))
    
    # Join all info together
    DFcompilation <- 
        data.frame(id, dltFrstN, firstName, firstName2, dltLstN, lastName, lastName2) %>% 
        arrange(desc(dltLstN), desc(dltLstN))
    DFcompilation
    #   id dltFrstN firstName firstName2 dltLstN      lastName lastName2
    # 1  5        0      Elsa       Elsa       4 Elizabeth Ray Elizabeth
    # 2  1        0     Chris      Chris       1    MacDougall MacDougal
    # 3  3        0  Shintaro   Shintaro       1      Yamazaki  Yamazaku
    # 4  2        0      Doug       Doug       0       Shapiro   Shapiro
    # 5  4        0   Bubbles    Bubbles       0        Murphy    Murphy
    

    这种方法更简单,并且在所需的编码中更加简洁 . 我希望这对你的目的更有帮助 .

相关问题