首页 文章

如何将Spark Dataframe列转换为字符串数组的单个列

提问于
浏览
2

我想知道如何将多个数据帧列“合并”为一个字符串数组?

例如,我有这个数据帧:

val df = sqlContext.createDataFrame(Seq((1, "Jack", "125", "Text"), (2,"Mary", "152", "Text2"))).toDF("Id", "Name", "Number", "Comment")

看起来像这样:

scala> df.show
+---+----+------+-------+
| Id|Name|Number|Comment|
+---+----+------+-------+
|  1|Jack|   125|   Text|
|  2|Mary|   152|  Text2|
+---+----+------+-------+

scala> df.printSchema
root
 |-- Id: integer (nullable = false)
 |-- Name: string (nullable = true)
 |-- Number: string (nullable = true)
 |-- Comment: string (nullable = true)

我怎样才能改变它,看起来像这样:

scala> df.show
+---+-----------------+
| Id|             List|
+---+-----------------+
|  1|  [Jack,125,Text]|
|  2| [Mary,152,Text2]|
+---+-----------------+

scala> df.printSchema
root
 |-- Id: integer (nullable = false)
 |-- List: Array (nullable = true)
 |    |-- element: string (containsNull = true)

1 回答

  • 7

    使用 org.apache.spark.sql.functions.array

    import org.apache.spark.sql.functions._
    val result = df.select($"Id", array($"Name", $"Number", $"Comment") as "List")
    
    result.show()
    // +---+------------------+
    // |Id |List              |
    // +---+------------------+
    // |1  |[Jack, 125, Text] |
    // |2  |[Mary, 152, Text2]|
    // +---+------------------+
    

相关问题