Home Articles

基于从多个数据集匹配多个条件和日期范围添加列

Asked
Viewed 1468 times
2

我一直在努力寻找解决这个问题的最佳方法 .

为了概括这个问题并帮助其他可能发现自己需要执行类似任务的人,我试图找到将列添加到第三个数据集的最佳方法,即基于中间数据集中的匹配,并且属于第三个数据集的日期范围 . 最终结果是将第三个数据集中的匹配值返回到第一个数据集 .

以下是示例数据框的头部,以增加一些清晰度:

> head(SalesData, 10)
   sale_id sale_amt int_rate  sale_date sale_status
1        1     7000    10.71 2008-05-01  Fully Paid
2        2    10800    13.57 2009-11-01  Fully Paid
3        3     7500    10.08 2008-04-01  Fully Paid
4        4     3000    14.26 2009-09-01  Fully Paid
5        5     5600    14.96 2010-02-01 Charged Off
6        6     2800    11.49 2010-08-01  Fully Paid
7        7    10000     8.59 2009-10-01  Fully Paid
8        8    18000    10.39 2008-03-01  Fully Paid
9        9     5000    15.13 2008-04-01  Fully Paid
10      10     9600    12.29 2008-03-01  Fully Paid

> head(EmployeeSales, 10)
   sale_id empl_name empl_num
1        1    Dakota        4
2        2    Dakota        4
3        3      Kami        9
4        4      Adel        1
5        5      Adel        1
6        6     Farah        6
7        7      Kami        9
8        8      Kami        9
9        9       Ida        7
10      10      Kami        9

> head(EmployeeMap, 10)
   empl_num empl_name skill_lvl team start_date   end_date
1         1      Adel       Beg  Red 2007-06-01 2008-05-31
2         1      Adel       Int  Red 2008-06-01 2010-10-31
3         1      Adel       Adv  Red 2010-11-01 2999-12-12
4         2    Bailey       Beg Blue 2010-08-01 2011-04-30
5         2    Bailey       Beg  Red 2011-05-01 2999-12-12
6         3     Casey       Beg Blue 2010-08-01 2010-12-31
7         3     Casey       Int Blue 2011-01-01 2999-12-12
8         4    Dakota       Beg  Red 2007-06-01 2009-08-30
9         4    Dakota       Int  Red 2009-09-01 2010-08-30
10        4    Dakota       Adv  Red 2010-09-01 2011-08-30

所需的输出会将EmployeeMap的empl_num,sales_team和skill_level添加到每个sale_id的SalesData中 .

在尝试概念化步骤时,这就是我的想法,但也许有更好的方法:从SalesData获取sale_id,将其与Employee Sales中的sale_id匹配并获得empl_num . 获取empl_num并将其与Employee Map中的empl_num匹配 . 现在我们需要从SalesData获取sale_date并找到它所属的“start_date,end_date”范围 . 然后我们将获取匹配的团队和技能级别,并将其添加到SalesData .

见下表:

> head(df2,10)
    sale_id sale_amt int_rate  sale_date sale_status empl_num  team skill_lvl
 1        1     7000    10.71 2008-05-01  Fully Paid        4   Red       Beg
 2        2    10800    13.57 2009-11-01  Fully Paid        4   Red       Int
 3        3     7500    10.08 2008-04-01  Fully Paid        9  Blue       Beg
 4        4     3000    14.26 2009-09-01  Fully Paid        1   Red       Int
 5        5     5600    14.96 2010-02-01 Charged Off        1   Red       Int
 6        6     2800    11.49 2010-08-01  Fully Paid        6   Red       Beg
 7        7    10000     8.59 2009-10-01  Fully Paid        9  Blue       Int
 8        8    18000    10.39 2008-03-01  Fully Paid        9  Blue       Beg
 9        9     5000    15.13 2008-04-01  Fully Paid        7  Blue       Beg
 10      10     9600    12.29 2008-03-01  Fully Paid        9  Blue       Int

让我复杂化的是,在EmployeeMap中,start_date和end_date告诉我们每个员工开始和结束属于特定技能级别和团队的日期 . 但是每个员工都改变了技能水平和/或团队,因此每个员工都有多行 .

例如,在empl_id 1的EmployeeMap中,我们可以看到3行告诉我们他们的start_date和end_date,而他们在Red Team上有一个skill_level Beg,Int,Adv all . 但有些人,比如empl_id 2,在保持相同技能水平的同时改变团队 . 而其他人则会改变技能水平和团队 .

我将非常感谢您对解决此问题的最佳方法的任何见解 .

3 Answers

  • 1

    在SQL术语中,这是一个3向连接 . 它可以在单个SQL select中完成,如下所示:

    library(sqldf)
    
    sqldf("
      select s.*, es.empl_num, em.team, em.skill_lvl
      from SalesData s
      left join EmployeeSales es 
        using (sale_id)
      left join EmployeeMap em
        on es.empl_num = em.empl_num and s.sale_date between em.start_date and em.end_date
    ")
    

    最后使用Note中的数据(基于所讨论的数据),我们得到以下结果 . 问题中显示的EmployeeMap数据中只存在前4个员工编号,左连接确保我们获得团队的NA值和其他人的技能级别,而不是由于不匹配而丢弃的SalesData行 .

    sale_id sale_amt int_rate  sale_date sale_status empl_num team skill_lvl
    1        1     7000    10.71 2008-05-01  Fully Paid        4  Red       Beg
    2        2    10800    13.57 2009-11-01  Fully Paid        4  Red       Int
    3        3     7500    10.08 2008-04-01  Fully Paid        9 <NA>      <NA>
    4        4     3000    14.26 2009-09-01  Fully Paid        1  Red       Int
    5        5     5600    14.96 2010-02-01 Charged Off        1  Red       Int
    6        6     2800    11.49 2010-08-01  Fully Paid        6 <NA>      <NA>
    7        7    10000     8.59 2009-10-01  Fully Paid        9 <NA>      <NA>
    8        8    18000    10.39 2008-03-01  Fully Paid        9 <NA>      <NA>
    9        9     5000    15.13 2008-04-01  Fully Paid        7 <NA>      <NA>
    10      10     9600    12.29 2008-03-01  Fully Paid        9 <NA>      <NA>
    

    注意

    以可重复的形式输入数据:

    SalesData <- structure(list(sale_id = 1:10, sale_amt = c(7000L, 10800L, 7500L, 
    3000L, 5600L, 2800L, 10000L, 18000L, 5000L, 9600L), int_rate = c(10.71, 
    13.57, 10.08, 14.26, 14.96, 11.49, 8.59, 10.39, 15.13, 12.29), 
        sale_date = structure(c(3L, 6L, 2L, 4L, 7L, 8L, 5L, 1L, 2L, 
        1L), .Label = c("2008-03-01", "2008-04-01", "2008-05-01", 
        "2009-09-01", "2009-10-01", "2009-11-01", "2010-02-01", "2010-08-01"
        ), class = "factor"), sale_status = structure(c(2L, 2L, 2L, 
        2L, 1L, 2L, 2L, 2L, 2L, 2L), .Label = c("Charged Off", "Fully Paid"
        ), class = "factor")), class = "data.frame", row.names = c("1", 
    "2", "3", "4", "5", "6", "7", "8", "9", "10"))
    
    EmployeeSales <-
    structure(list(sale_id = 1:10, empl_name = structure(c(2L, 2L, 
    5L, 1L, 1L, 3L, 5L, 5L, 4L, 5L), .Label = c("Adel", "Dakota", 
    "Farah", "Ida", "Kami"), class = "factor"), empl_num = c(4L, 
    4L, 9L, 1L, 1L, 6L, 9L, 9L, 7L, 9L)), class = "data.frame", row.names = c("1", 
    "2", "3", "4", "5", "6", "7", "8", "9", "10"))
    
    EmployeeMap <- structure(list(empl_num = c(1L, 1L, 1L, 2L, 2L, 3L, 3L, 4L, 4L, 
    4L), empl_name = structure(c(1L, 1L, 1L, 2L, 2L, 3L, 3L, 4L, 
    4L, 4L), .Label = c("Adel", "Bailey", "Casey", "Dakota"), class = "factor"), 
        skill_lvl = structure(c(2L, 3L, 1L, 2L, 2L, 2L, 3L, 2L, 3L, 
        1L), .Label = c("Adv", "Beg", "Int"), class = "factor"), 
        team = structure(c(2L, 2L, 2L, 1L, 2L, 1L, 1L, 2L, 2L, 2L
        ), .Label = c("Blue", "Red"), class = "factor"), start_date = structure(c(1L, 
        2L, 6L, 4L, 8L, 4L, 7L, 1L, 3L, 5L), .Label = c("2007-06-01", 
        "2008-06-01", "2009-09-01", "2010-08-01", "2010-09-01", "2010-11-01", 
        "2011-01-01", "2011-05-01"), class = "factor"), end_date = structure(c(1L, 
        4L, 8L, 6L, 8L, 5L, 8L, 2L, 3L, 7L), .Label = c("2008-05-31", 
        "2009-08-30", "2010-08-30", "2010-10-31", "2010-12-31", "2011-04-30", 
        "2011-08-30", "2999-12-12"), class = "factor")), class = "data.frame", 
        row.names = c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10"))
    
  • 0

    考虑按日期运行 merge 两次然后 subset . 下面将呼叫嵌在一个长的单行中,但可以在单独的行中分开 . 此外,由于您发布的数据是样本行,因此输出小于您想要的结果 .

    # MERGE TWICE AND SUBSET BY DATE
    finaldf <- subset(merge(merge(SalesData, EmployeeSales, by="sale_id"), 
                            EmployeeMap, "empl_num", suffixes=c('', '_')),
                      sale_date >= start_date & sale_date <= end_date)
    
    # SELECT NEEDED COLUMNS
    finaldf <- finaldf[c("sale_id", "sale_amt", "int_rate", "sale_date", 
                         "sale_status", "empl_num", "team", "skill_lvl")]
    
    # RE-ORDER BY SALE_ID AND RESET ROW NAMES
    finaldf <- with(finaldf, finaldf[order(sale_id),])
    row.names(finaldf) <- NULL
    
    finaldf
    #   sale_id sale_amt int_rate  sale_date sale_status empl_num team skill_lvl
    # 1       1     7000    10.71 2008-05-01  Fully Paid        4  Red       Beg
    # 2       2    10800    13.57 2009-11-01  Fully Paid        4  Red       Int
    # 3       4     3000    14.26 2009-09-01  Fully Paid        1  Red       Int
    # 4       5     5600    14.96 2010-02-01 Charged Off        1  Red       Int
    
  • 0

    也许最简单的方法是使用两个类似SQL的连接(如果你不熟悉连接/关系代数,我建议你给一些like this读取) .

    可以使用基本R中的 merge 函数执行许多连接,并且许多其他流行的包( dplyrdata.tablesqldf ,仅举几例)在连接操作中提供替代语法或扩展功能 .

    您可以使用 merge 轻松完成两个连接中的第一个( SalesData 和_1844207之间):

    merge(SalesData, EmployeeSales, by = "sale_id")
    
    #    sale_id sale_amt int_rate  sale_date sale_status empl_name empl_num
    # 1        1     7000    10.71 2008-05-01  Fully Paid    Dakota        4
    # 2        2    10800    13.57 2009-11-01  Fully Paid    Dakota        4
    # 3        3     7500    10.08 2008-04-01  Fully Paid      Kami        9
    # ...
    

    然而,第二次连接更复杂,因为它不是典型的equi-join . 相反,连接逻辑需要在 EmployeeMap 中查找 start_date 小于 sale_dateend date 大于它的行(除了 empl_num 上的相等条件外) .

    幸运的是,前面提到的data.table包提供了应用所述逻辑的能力 .

    library(data.table)
    
    # convert all three dataframes to data.table objects
    setDT(SalesData) ; setDT(EmployeeSales) ; setDT(EmployeeMap)
    
    EmployeeMap[SalesData[EmployeeSales[, c("sale_id","empl_num")],
                          on = "sale_id"], 
                on = .(empl_num, start_date <= sale_date, end_date >= sale_date)]
    
    #    empl_num empl_name skill_lvl team start_date   end_date sale_id sale_amt int_rate sale_status
    # 1:        4    Dakota       Beg  Red 2008-05-01 2008-05-01       1     7000    10.71  Fully Paid
    # 2:        4    Dakota       Int  Red 2009-11-01 2009-11-01       2    10800    13.57  Fully Paid
    # 3:        9        NA        NA   NA 2008-04-01 2008-04-01       3     7500    10.08  Fully Paid
    # ...
    

    请注意,所有三个日期列都应该是日期类型,而不是字符串,以便进行比较 . 另请注意,上面输出中的 NA 值是问题中提供的 EmployeeMap 快照的结果,该快照仅映射 empl_num 1-4 .

    我还建议阅读this question的答案,了解如何加入日期范围的更多背景信息 .

Related