在窗口上聚合（总和）以获得列的列表-Java 学习之路

对于DataFrame中可用的列列表，我无法找到在给定窗口上计算Sum（或任何聚合函数）的通用方法 .

val inputDF = spark
.sparkContext
.parallelize(
    Seq(
        (1,2,1, 30, 100),
        (1,2,2, 30, 100), 
        (1,2,3, 30, 100),
        (11,21,1, 30, 100),
        (11,21,2, 30, 100), 
        (11,21,3, 30, 100)
    ),
    10)
.toDF("c1", "c2", "offset", "v1", "v2")

input.show
+---+---+------+---+---+
| c1| c2|offset| v1| v2|
+---+---+------+---+---+
|  1|  2|     1| 30|100|
|  1|  2|     2| 30|100|
|  1|  2|     3| 30|100|
| 11| 21|     1| 30|100|
| 11| 21|     2| 30|100|
| 11| 21|     3| 30|100|
+---+---+------+---+---+

给定如上所示的DataFrame，很容易找到列列表的Sum，类似于下面显示的代码片段 -

val groupKey = List("c1", "c2").map(x => col(x.trim))
    val orderByKey = List("offset").map(x => col(x.trim))

    val aggKey = List("v1", "v2").map(c => sum(c).alias(c.trim))

    import org.apache.spark.sql.expressions.Window

    val w = Window.partitionBy(groupKey: _*).orderBy(orderByKey: _*)

    val outputDF = inputDF
    .groupBy(groupKey: _*)
    .agg(aggKey.head, aggKey.tail: _*)

    outputDF.show

但我似乎无法在窗口规范上找到类似的聚合函数方法 . 到目前为止，我只能通过单独指定每列来解决这个问题，如下所示 -

val outputDF2 = inputDF
    .withColumn("cumulative_v1", sum(when($"offset".between(-1, 1), inputDF("v1")).otherwise(0)).over(w))
    .withColumn("cumulative_v3", sum(when($"offset".between(-2, 2), inputDF("v1")).otherwise(0)).over(w))

如果有一种方法可以在动态列列表上进行此聚合，我将不胜感激 . 谢谢！

1 回答

我想我找到了一种比上述问题更好的方法 .

/**
    * Utility method takes a DataFrame and a List of columns to return aggregated values for the specified list of columns
    * @param colsToAggregate    Seq[String] of all columns in the input DataFrame to be aggregated
    * @param inputDF            Input DataFrame
    * @param f                  aggregate function 'call by name'
    * @param partitionByColSeq  Seq[] of column names to partition the inputDF before applying the aggregate
    * @param orderByColSeq      Seq[] of column names to order the inputDF before applying the aggregate
    * @param name_prefix        String to prefix the new columns with, to avoid collisions
    * @param name               New column names. Uses Identify function and reuses aggregated column names
    * @return                   output DataFrame
    */
  def withRollingAggregateColumns(colsToAggregate: Seq[String],
                                  inputDF: DataFrame,
                                  f: String => Column,
                                  partitionByColSeq: Seq[String],
                                  orderByColSeq: Seq[String],
                                  name_prefix: String,
                                  name: String => String = identity) = {

    val groupByKey = partitionByColSeq.map(x => col(x.trim))
    val orderByKey = orderByColSeq.map(x => col(x.trim))

    import org.apache.spark.sql.expressions.Window

    val w = Window.partitionBy(groupByKey: _*).orderBy(orderByKey: _*)

    colsToAggregate
      .foldLeft(inputDF)(
        (df, elementInCols) => df
          .withColumn(
            name_prefix + "_" + name(elementInCols),
            f(elementInCols).over(w)
          )
      )
  }

在这种情况下，Utility方法将DataFrame作为输入，并根据提供的函数f附加新列 . 它使用“withColumn”和“foldLeft”语法迭代需要聚合的列列表 . 为避免任何列名冲突，它会将用户提供的“前缀”附加到新聚合列

回复于 2024-05-16T13:18:53+08:00

在窗口上聚合（总和）以获得列的列表

1 回答

相关问题