如何从Spark数据框中删除多个列？-Java 学习之路

我有一个CSV，其中一些列 Headers 及其对应的值为空 . 我想知道如何删除名为 null 的列？示例CSV如下：

"name"|"age"|"city"|"null"|"null"|"null"
"abcd"|"21" |"7yhj"|"null"|"null"|"null"
"qazx"|"31" |"iuhy"|"null"|"null"|"null"
"foob"|"51" |"barx"|"null"|"null"|"null"

我想删除所有具有 null Headers 的列，以便输出数据框如下所示：

"name"|"age"|"city"
"abcd"|"21" |"7yhj"
"qazx"|"31" |"iuhy"
"foob"|"51" |"barx"

当我在spark中加载此CSV时，Spark会将数字附加到空列，如下所示：

"name"|"age"|"city"|"null4"|"null5"|"null6"
"abcd"|"21" |"7yhj"|"null"|"null"|"null"
"qazx"|"31" |"iuhy"|"null"|"null"|"null"
"foob"|"51" |"barx"|"null"|"null"|"null"

找到解决方案

谢谢@MaxU的答案 . 我的最终解决方案是：

val filePath = "C:\\Users\\shekhar\\spark-trials\\null_column_header_test.csv"

val df = spark.read.format("csv")
.option("inferSchema", "false")
.option("header", "true")
.option("delimiter", "|")
.load(filePath)

val q = df.columns.filterNot(c => c.startsWith("null")).map(a => df(a))
// df.columns.filterNot(c => c.startsWith("null")) this part removes column names which start with null and returns array of string. each element of array represents column name

// .map(a => df(a)) converts elements of array into object of type Column
df.select(q:_*).show

1 回答

4
IIUC你可以这样做：
```
df = df.drop(df.columns.filter(_.startsWith("null")))
```
回复于 2024-05-04T22:26:18+08:00

如何从Spark数据框中删除多个列？

1 回答

相关问题