首页 文章

使用Purrr和Dplyr在多个数据帧中重新编码相似的因子水平

提问于
浏览
0

下面是两个简单的数据框架 . 我想重新编码(折叠) Sat1Sat2 列,以便将所有满意度编码为 Satisfied ,并且所有不满意度都编码为 Dissatisfied . 中性仍为中性 . 因此,这些因素将有三个层次 - Satisfied, Dissatisfied, and Neutral .

我通常会通过绑定数据帧,使用 lapply 以及 car 包中的重新代码来完成此操作,例如:

DF1[2:3] <- lapply(DF1[2:3], recode, c('"Somewhat Satisfied"= "Satisfied","Satisfied"="Satisfied","Extremely Dissatisfied"="Dissatisfied"........etc, etc

我想用 Map 函数来完成这个,特别是 at_map (以保持数据框,但我是 purrr 的新手,所以随时可以建议其他版本的 Map )来自 purrr ,以及 dplyr ,tidyr , stringr and ggplot2`所以一切都可以轻松流水线化 .

以下示例是我想要完成的,但是对于重新编码,但我无法使其工作 .

http://www.r-bloggers.com/using-purrr-with-dplyr/

我想使用at_map或类似的map函数,以便我可以保留 Sat1Sat2 的原始列,因此重新编码的列将被添加到数据框并重命名 . 如果这个步骤也可以包含在一个函数中,那将会很棒 .

实际上,我将拥有许多数据帧,因此我只想重新编码因子级别一次,然后使用 purrr 中的函数使用最少量的代码在所有数据帧中进行更改 .

Names<-c("James","Chris","Jessica","Tomoki","Anna","Gerald")
Sat1<-c("Satisfied","Very Satisfied","Dissatisfied","Somewhat Satisfied","Dissatisfied","Neutral")
Sat2<-c("Very Dissatisfied","Somewhat Satisfied","Neutral","Neutral","Satisfied","Satisfied")
Program<-c("A","B","A","C","B","D")
Pets<-c("Snake","Dog","Dog","Dog","Cat","None")

DF1<-data.frame(Names,Sat1,Sat2,Program,Pets)

Names<-c("Tim","John","Amy","Alberto","Desrahi","Francesca")
Sat1<-c("Extremely Satisfied","Satisfied","Satisfed","Somewhat Dissatisfied","Dissatisfied","Satisfied")
Sat2<-c("Dissatisfied","Somewhat Dissatisfied","Neutral","Extremely Dissatisfied","Somewhat Satisfied","Somewhat Dissatisfied")
Program<-c("A","B","A","C","B","D")


DF2<-data.frame(Names,Sat1,Sat2,Program)

2 回答

  • 1

    一种方法是使用 mutate_eachmap 函数之一结合工作来查看data.frames列表 . 使用dplyr_0.4.3.9001中的 mutate_each 或等效项可以重命名新列 .

    在这种情况下,您可以使用字符串操作而不是重新编码 . 我相信你想从当前的字符串中拉出 SatisfiedDissatisfiedNeutral . 您可以使用正则表达式使用 sub 实现此目的 . 例如,

    sub(".*(Satisfied|Dissatisfied|Neutral).*$", "\\1", DF2$Sat2)
    "Dissatisfied" "Dissatisfied" "Neutral"      "Dissatisfied" "Satisfied"    "Dissatisfied"
    

    包stringr有一个很好的函数来提取特定的字符串 str_extract .

    library(stringr)
    str_extract(DF2$Sat2, "Satisfied|Neutral|Dissatisfied")
     "Dissatisfied" "Dissatisfied" "Neutral"      "Dissatisfied" "Satisfied"    "Dissatisfied"
    

    您可以在 mutate_each 中使用它在多列上使用这些函数之一 . 您在 funs 中为该函数指定的名称将添加到新列名称中 . 我用 recode . 对于您的一个数据集:

    DF1 %>% 
        mutate_each( funs(recode = str_extract(., "Satisfied|Neutral|Dissatisfied") ), 
                  starts_with("Sat") )
    
        Names               Sat1               Sat2 Program  Pets  Sat1_recode  Sat2_recode
    1   James          Satisfied  Very Dissatisfied       A Snake    Satisfied Dissatisfied
    2   Chris     Very Satisfied Somewhat Satisfied       B   Dog    Satisfied    Satisfied
    3 Jessica       Dissatisfied            Neutral       A   Dog Dissatisfied      Neutral
    4  Tomoki Somewhat Satisfied            Neutral       C   Dog    Satisfied      Neutral
    5    Anna       Dissatisfied          Satisfied       B   Cat Dissatisfied    Satisfied
    6  Gerald            Neutral          Satisfied       D  None      Neutral    Satisfied
    

    要浏览存储在列表中的许多数据集,可以使用purrr中的 map 函数对列表中的每个元素执行一个函数 .

    list(DF1, DF2) %>%
        map(~mutate_each(.x, 
                      funs(recode = str_extract(., "Satisfied|Neutral|Dissatisfied") ), 
                      starts_with("Sat")) )
    
    [[1]]
        Names               Sat1               Sat2 Program  Pets  Sat1_recode  Sat2_recode
    1   James          Satisfied  Very Dissatisfied       A Snake    Satisfied Dissatisfied
    2   Chris     Very Satisfied Somewhat Satisfied       B   Dog    Satisfied    Satisfied
    ...
    [[2]]
          Names                  Sat1                   Sat2 Program  Sat1_recode  Sat2_recode
    1       Tim   Extremely Satisfied           Dissatisfied       A    Satisfied Dissatisfied
    2      John             Satisfied  Somewhat Dissatisfied       B    Satisfied Dissatisfied
    ...
    

    使用 map_df 会将列表中的所有元素绑定到data.frame中,这可能是您想要的,也可能不是 . 使用 .id 参数为每个原始数据集添加一个名称 .

    list(DF1, DF2) %>%
        map_df(~mutate_each(.x, 
                      funs(recode = str_extract(., "Satisfied|Neutral|Dissatisfied")), 
                      starts_with("Sat")), .id = "Group")
    
       Group     Names                  Sat1                   Sat2 Program  Pets  Sat1_recode
    1      1     James             Satisfied      Very Dissatisfied       A Snake    Satisfied
    2      1     Chris        Very Satisfied     Somewhat Satisfied       B   Dog    Satisfied
    3      1   Jessica          Dissatisfied                Neutral       A   Dog Dissatisfied
    4      1    Tomoki    Somewhat Satisfied                Neutral       C   Dog    Satisfied
    5      1      Anna          Dissatisfied              Satisfied       B   Cat Dissatisfied
    6      1    Gerald               Neutral              Satisfied       D  None      Neutral
    7      2       Tim   Extremely Satisfied           Dissatisfied       A  <NA>    Satisfied
    8      2      John             Satisfied  Somewhat Dissatisfied       B  <NA>    Satisfied
    ...
    
  • 1

    我通过连接进行这样大的重新编码,在这种情况下,我认为转换为长数据帧会使问题更容易思考 .

    library(tidyr)
    library(dplyr)
    
    mdf <- DF1 %>% 
      gather(var, value, starts_with("Sat"))
    
    recode_df <- data_frame( value = c("Extremely Satisfied","Satisfied","Somewhat Dissatisfied","Dissatisfied"),
                             recode = 1:4)
    mdf <- left_join(mdf, recode_df)
    mdf %>% spread(var, recode)
    

相关问题