首页 文章

将pyspark dataframe列从列表转换为字符串

提问于
浏览
3

我有这个pyspark数据帧

+-----------+--------------------+
|uuid       |   test_123         |    
+-----------+--------------------+
|      1    |[test, test2, test3]|
|      2    |[test4, test, test6]|
|      3    |[test6, test9, t55o]|

我想将列 test_123 转换为如下:

+-----------+--------------------+
    |uuid       |   test_123         |    
    +-----------+--------------------+
    |      1    |"test,test2,test3"  |
    |      2    |"test4,test,test6"  |
    |      3    |"test6,test9,t55o"  |

所以从列表到字符串 .

怎么用pyspark呢?

2 回答

  • 8

    您可以创建一个连接数组/列表的 udf ,然后将其应用于测试列:

    from pyspark.sql.functions import udf, col
    
    join_udf = udf(lambda x: ",".join(x))
    df.withColumn("test_123", join_udf(col("test_123"))).show()
    
    +----+----------------+
    |uuid|        test_123|
    +----+----------------+
    |   1|test,test2,test3|
    |   2|test4,test,test6|
    |   3|test6,test9,t55o|
    +----+----------------+
    

    初始数据框是从以下位置创建的:

    from pyspark.sql.types import StructType, StructField
    schema = StructType([StructField("uuid",IntegerType(),True),StructField("test_123",ArrayType(StringType(),True),True)])
    rdd = sc.parallelize([[1, ["test","test2","test3"]], [2, ["test4","test","test6"]],[3,["test6","test9","t55o"]]])
    df = spark.createDataFrame(rdd, schema)
    
    df.show()
    +----+--------------------+
    |uuid|            test_123|
    +----+--------------------+
    |   1|[test, test2, test3]|
    |   2|[test4, test, test6]|
    |   3|[test6, test9, t55o]|
    +----+--------------------+
    
  • 6

    虽然你可以使用 UserDefinedFunction ,但它是 very inefficient . 相反,最好使用 concat_ws 函数:

    from pyspark.sql.functions import concat_ws
    
    df.withColumn("test_123", concat_ws(",", "test_123")).show()
    
    +----+----------------+
    |uuid|        test_123|
    +----+----------------+
    |   1|test,test2,test3|
    |   2|test4,test,test6|
    |   3|test6,test9,t55o|
    +----+----------------+
    

相关问题