首页 文章

通过sparklyr替换spark数据帧中的'\\'或'\\\\'失败

提问于
浏览
1

我尝试替换spark数据帧中的反斜杠 . 我写了一个与R数据帧配合得很好的功能 . 我将它插入 spark_apply 并且它不起作用:

rm(back_slash_replace_func)

back_slash_replace_func <- function(x)
{

     cbind.data.frame(
          lapply(
          x, function(x) { if(class(x) == 'character'){ gsub(pattern = "\\", replacement = "/", x = x, fixed = T)} else { x } }
            )
     , stringsAsFactors = F
     )

}

## do in R

x <- data.frame(x = rep('\\', 10), stringsAsFactors = F)

back_slash_replace_func(x)

## do in spark

r_spark_connection <- spark_connect(master = "local")

xsp <- copy_to(r_spark_connection, x, overwrite = T)

start <- Sys.time()

spark_apply(
               x = xsp
               , f = back_slash_replace_func
               , memory = F
               )

Sys.time() - start

It does not do the job, no error, no warning. What could be the case?

1 回答

  • 2

    您应该注意的第一件事是 copy_to 使您的数据格式不正确 . 而 x 是:

    x %>% head(1)
    #    x
    # 1 \\
    

    xsp

    xsp %>% head(1)
    # # Source:   lazy query [?? x 1]
    # # Database: spark_connection
    #   x    
    #   <chr>
    # 1 "\""
    

    这是因为当您使用 copy_to 时,spakrlyr会将数据转储到平面文件 . 因此,它甚至无法在本地工作:

    xsp %>% collect %>% back_slash_replace_func %>% head(1)
    #   x
    # 1 "
    

    如果您直接创建数据框:

    df <-spark_session(r_spark_connection) %>%
      invoke("sql", "SELECT '\\\\' AS x FROM range(10)") %>% 
      sdf_register() 
    
    df %>% collect %>% back_slash_replace_func %>% head(1)
    #   x
    # 1 /
    

    这个特殊问题不会出现 .

    这里的另一个问题是, spark_apply 实际上将 strings 转换为 factors (根据Kevin's跟踪的Kevin's评论)而不是:

    function(x) {
      if (is.character(x)) {
        gsub(pattern = "\\", replacement = "/", x = x, fixed = T)
      } else { x }
    }
    

    你宁愿需要:

    function(x) {
      if (is.factor(x)) {
        gsub(pattern = "\\", replacement = "/", x = as.character(x), fixed = T)
      } else { x }
    }
    

    但在实践中只是 translate

    df %>% mutate(x = translate(x, "\\\\", "/")) %>% head(1)
    # # Source:   lazy query [?? x 1]
    # # Database: spark_connection
    #   x    
    #   <chr>
    # 1 /
    

相关问题