首页 文章

使用dplyr将一个data.frame输出到另一个

提问于
浏览
4

我有两个data.frames - 一个查询表告诉我组中包含的一组产品 . 每组至少有一种1型和2型产品 .

第二个data.frame告诉我有关交易的详细信息 . 每笔交易都可以使用以下产品之一:

a)仅来自其中一个组的类型1的产品 s

b)仅来自其中一个组的类型2的产品 s

c)来自同一组的1型和2型产品

对于我的分析,我有兴趣发现上面的c),即有多少交易产品的类型1和类型2(来自同一组) . 如果在同一交易中出售的不同组中的类型1的产品和类型2的产品,我们将完全忽略该交易 .

因此,类型1或类型2的每个产品必须属于同一组 .

这是我的查找表:

> P_Lookup
   Group ProductID1 ProductID2
  Group1          A          1
  Group1          B          2
  Group1          B          3
  Group2          C          4
  Group2          C          5
  Group2          C          6
  Group3          D          7
  Group3          C          8
  Group3          C          9
  Group4          E         10
  Group4          F         11
  Group4          G         12
  Group5          H         13
  Group5          H         14
  Group5          H         15

例如,我不会在一次交易中拥有产品G和产品15,因为它们属于不同的组 .

以下是交易:

TransactionID ProductID ProductType
             a1         A           1
             a1         B           1
             a1         1           2
             a2         C           1
             a2         4           2
             a2         5           2
             a3         D           1
             a3         C           1
             a3         7           2
             a3         8           2
             a4         H           1
             a5         1           2
             a5         2           2
             a5         3           2
             a5         3           2
             a5         1           2
             a6         H           1
             a6        15           2

My Code:

现在,我能够使用 dplyr 编写代码,用于从一个组中筛选交易 . 但是,我不确定如何为所有组矢量化我的代码 .

这是我的代码:

P_Groups<-unique(P_Lookup$Group)
Chosen_Group<-P_Groups[5]

P_Group_Ind <- P_Trans %>%
group_by(TransactionID)%>%
dplyr::filter((ProductID %in% unique(P_Lookup[P_Lookup$Group==Chosen_Group,]$ProductID1)) | 
(ProductID %in% unique(P_Lookup[P_Lookup$Group==Chosen_Group,]$ProductID2)) ) %>%
mutate(No_of_PIDs = n_distinct(ProductType)) %>%
mutate(Group_Name = Chosen_Group)

P_Group_Ind<-P_Group_Ind[P_Group_Ind$No_of_PIDs>1,]

只要我手动选择每个组,即设置 Chosen_Group ,这就可以正常工作 . 但是,我不确定如何自动化这个 . 一种方式,我想是使用for循环,但我知道R的美丽是矢量化,所以我想远离使用for循环 .

我真诚地感谢任何帮助 . 我花了差不多两天时间 . 我看了using dplyr in for loop in r,但似乎这个主题正在讨论一个不同的问题 .


DATA: 这是 dput for P_Trans

structure(list(TransactionID = c("a1", "a1", "a1", "a2", "a2", 
"a2", "a3", "a3", "a3", "a3", "a4", "a5", "a5", "a5", "a5", "a5", 
"a6", "a6"), ProductID = c("A", "B", "1", "C", "4", "5", "D", 
"C", "7", "8", "H", "1", "2", "3", "3", "1", "H", "15"), ProductType = c(1, 
1, 2, 1, 2, 2, 1, 1, 2, 2, 1, 2, 2, 2, 2, 2, 1, 2)), .Names = c("TransactionID", 
"ProductID", "ProductType"), row.names = c(NA, 18L), class = "data.frame")

这是 dput for P_Lookup

structure(list(Group = c("Group1", "Group1", "Group1", "Group2", 
"Group2", "Group2", "Group3", "Group3", "Group3", "Group4", "Group4", 
"Group4", "Group5", "Group5", "Group5"), ProductID1 = c("A", 
"B", "B", "C", "C", "C", "D", "C", "C", "E", "F", "G", "H", "H", 
"H"), ProductID2 = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 
14, 15)), .Names = c("Group", "ProductID1", "ProductID2"), row.names = c(NA, 
15L), class = "data.frame")

将查询表中不存在的产品添加到P_Trans后,这是 dput()

structure(list(TransactionID = c("a1", "a1", "a1", "a2", "a2", 
"a2", "a3", "a3", "a3", "a3", "a4", "a5", "a5", "a5", "a5", "a5", 
"a6", "a6", "a7"), ProductID = c("A", "B", "1", "C", "4", "5", 
"D", "C", "7", "8", "H", "1", "2", "3", "3", "1", "H", "15", 
"22"), ProductType = c(1, 1, 2, 1, 2, 2, 1, 1, 2, 2, 1, 2, 2, 
2, 2, 2, 1, 2, 3)), .Names = c("TransactionID", "ProductID", 
"ProductType"), row.names = c(NA, 19L), class = "data.frame")

2 回答

  • 2

    下面是一个tidyverse(dplyr,tidyr和purrr)解决方案,我希望能帮到你 .

    请注意,在最后一行中使用 map_df 会将所有结果作为数据框返回 . 如果您希望它是每个组的列表对象,那么只需使用 map .

    library(dplyr)
    library(tidyr)
    library(purrr)
    
    # Save unique groups for later use
    P_Groups <- unique(P_Lookup$Group)
    
    # Convert lookup table to product IDs and Groups
    P_Lookup <- P_Lookup %>% 
                  gather(ProductIDn, ProductID, ProductID1, ProductID2) %>% 
                  select(ProductID, Group) %>% 
                  distinct() %>% 
                  nest(-ProductID, .key = Group)
    
    # Bind Group information to transactions
    # and group for next analysis
    P_Trans <- P_Trans %>%
                 left_join(P_Lookup) %>%
                 filter(!map_lgl(Group, is.null)) %>%  
                 unnest(Group) %>% 
                 group_by(TransactionID)
    
    # Iterate through Groups to produce results
    map(P_Groups, ~ filter(P_Trans, Group == .)) %>% 
      map(~ mutate(., No_of_PIDs = n_distinct(ProductType))) %>% 
      map_df(~ filter(., No_of_PIDs > 1))
    #> Source: local data frame [12 x 5]
    #> Groups: TransactionID [4]
    #> 
    #>    TransactionID ProductID ProductType  Group No_of_PIDs
    #>            <chr>     <chr>       <dbl>  <chr>      <int>
    #> 1             a1         A           1 Group1          2
    #> 2             a1         B           1 Group1          2
    #> 3             a1         1           2 Group1          2
    #> 4             a2         C           1 Group2          2
    #> 5             a2         4           2 Group2          2
    #> 6             a2         5           2 Group2          2
    #> 7             a3         D           1 Group3          2
    #> 8             a3         C           1 Group3          2
    #> 9             a3         7           2 Group3          2
    #> 10            a3         8           2 Group3          2
    #> 11            a6         H           1 Group5          2
    #> 12            a6        15           2 Group5          2
    
  • 4

    这是一个单一的管道 dplyr 解决方案:

    P_DualGroupTransactionsCount <- 
        P_Lookup %>% # data needing single column map of Keys
        gather(IDnum, ProductID, ProductID1:ProductID2) %>% # produce long single map of Keys for GroupID (tidyr::)
        right_join(P_trans) %>% # join transactions to groupID info
        group_by(TransactionID, Group) %>% # organize for same transaction & same group
        mutate(DualGroup = ifelse(n_distinct(ProductType)==2, T, F)) %>% # flag groups with both groups in a single transaction
        filter(DualGroup == T) %>% # choose only doubles
        select(TransactionID, Group) %>% # remove excess columns
        distinct %>%  # remove excess rows
        nrow # count of unique transaction ID's
    
    # P_DualGroupTransactions
    # Source: local data frame [4 x 2]
    # Groups: TransactionID, Group [4]
    #     
    # TransactionID  Group
    #           <chr>  <chr>
    # 1            a1 Group1
    # 2            a2 Group2
    # 3            a3 Group3
    # 4            a6 Group5
    
    
    # P_DualGroupTransactionsCount
     [1] 4
    

相关问题