首页 文章

获取spark数据帧中ArrayType列的不同元素

提问于
浏览
3

我有一个名为id,feat1和feat2的3列数据框 . feat1和feat2采用Array of String的形式:

Id, feat1,feat2
------------------
1, ["feat1_1","feat1_2","feat1_3"],[] 

2, ["feat1_2"],["feat2_1","feat2_2"]

3,["feat1_4"],["feat2_3"]

我想获取每个功能列中的不同元素列表,因此输出将是:

distinct_feat1,distinct_feat2
-----------------------------  
["feat1_1","feat1_2","feat1_3","feat1_4"],["feat2_1","feat2_2","feat2_3]

在Scala中执行此操作的最佳方法是什么?

2 回答

  • 0

    在每列上应用 explode 函数后,可以使用 collect_set 查找相应列的不同值,以取消每个单元格中的数组元素 . 假设您的数据框名为 df

    import org.apache.spark.sql.functions._
    
    val distinct_df = df.withColumn("feat1", explode(col("feat1"))).
                         withColumn("feat2", explode(col("feat2"))).
                         agg(collect_set("feat1").alias("distinct_feat1"), 
                             collect_set("feat2").alias("distinct_feat2"))
    
    distinct_df.show
    +--------------------+--------------------+
    |      distinct_feat1|      distinct_feat2|
    +--------------------+--------------------+
    |[feat1_1, feat1_2...|[, feat2_1, feat2...|
    +--------------------+--------------------+
    
    
    distinct_df.take(1)
    res23: Array[org.apache.spark.sql.Row] = Array([WrappedArray(feat1_1, feat1_2, feat1_3, feat1_4),
                                                    WrappedArray(, feat2_1, feat2_2, feat2_3)])
    
  • 2

    Psidom提供的方法效果很好,这是一个函数,在给定Dataframe和字段列表的情况下也是如此:

    def array_unique_values(df, fields):
        from pyspark.sql.functions import col, collect_set, explode
        from functools import reduce
        data = reduce(lambda d, f: d.withColumn(f, explode(col(f))), fields, df)
        return data.agg(*[collect_set(f).alias(f + '_distinct') for f in fields])
    

    然后:

    data = array_unique_values(df, my_fields)
    data.take(1)
    

相关问题