火花数据帧组多次-Java 学习之路

val df = (Seq((1, "a", "10"),(1,"b", "12"),(1,"c", "13"),(2, "a", "14"),
              (2,"c", "11"),(1,"b","12" ),(2, "c", "12"),(3,"r", "11")).
          toDF("col1", "col2", "col3"))

所以我有一个包含3列的火花数据帧 .

我的要求实际上是我需要执行两个级别的groupby，如下所述 .

Level1：如果我在col1上进行groupby并做一个Col3的总和 . 我将在两列以下 . 1. col1 2. sum（col3）我将在这里松开col2 .

Level2：如果我想再次按col1和col2分组并做一个Col3的总和，我将得到3列以下 . 1. col1 2. col2 3. sum（col3）

我的要求实际上是我需要执行两个级别的groupBy并且在最后一个数据帧中具有这两个列（level1的sum（col3），level2的sum（col3）） .

我怎么能这样做，谁能解释一下？

火花：1.6.2斯卡拉：2.10

1 回答

一种选择是分别进行两次求和，然后加入它们：

(df.groupBy("col1", "col2").agg(sum($"col3").as("sum_level2")).
    join(df.groupBy("col1").agg(sum($"col3").as("sum_level1")), Seq("col1")).show)

+----+----+----------+----------+
|col1|col2|sum_level2|sum_level1|
+----+----+----------+----------+
|   2|   c|      23.0|      37.0|
|   2|   a|      14.0|      37.0|
|   1|   c|      13.0|      47.0|
|   1|   b|      24.0|      47.0|
|   3|   r|      11.0|      11.0|
|   1|   a|      10.0|      47.0|
+----+----+----------+----------+

另一种选择是使用窗口函数，考虑到level1_sum是由 col1 分组的level2_sum的总和：

import org.apache.spark.sql.expressions.Window
val w = Window.partitionBy($"col1")

(df.groupBy("col1", "col2").agg(sum($"col3").as("sum_level2")).
    withColumn("sum_level1", sum($"sum_level2").over(w)).show)

+----+----+----------+----------+
|col1|col2|sum_level2|sum_level1|
+----+----+----------+----------+
|   1|   c|      13.0|      47.0|
|   1|   b|      24.0|      47.0|
|   1|   a|      10.0|      47.0|
|   3|   r|      11.0|      11.0|
|   2|   c|      23.0|      37.0|
|   2|   a|      14.0|      37.0|
+----+----+----------+----------+

回复于 2024-04-20T00:28:57+08:00

火花数据帧组多次

1 回答

相关问题