Spark将csv列中的空值视为null数据类型-Java 学习之路

我的spark应用程序读取csv文件，使用sql将其转换为不同的格式，并将结果数据帧写入不同的csv文件 .

例如，我输入csv如下：

Id|FirstName|LastName|LocationId
1|John|Doe|123
2|Alex|Doe|234

我的转变是：

Select Id, 
       FirstName, 
       LastName, 
       LocationId as PrimaryLocationId,
       null as SecondaryLocationId
from Input

（我可以't answer why the null is being used as SecondaryLocationId, it is business use case) Now spark can'弄清楚SecondaryLocationId的数据类型并在架构中返回null并在写入输出csv时抛出错误 CSV data source does not support null data type .

下面是我正在使用的printSchema（）和写入选项 .

root
     |-- Id: string (nullable = true)
     |-- FirstName: string (nullable = true)
     |-- LastName: string (nullable = true)
     |-- PrimaryLocationId: string (nullable = false)
     |-- SecondaryLocationId: null (nullable = true)

dataFrame.repartition(1).write
      .mode(SaveMode.Overwrite)
      .option("header", "true")
      .option("delimiter", "|")
      .option("nullValue", "")
      .option("inferSchema", "true")
      .csv(outputPath)

有没有办法默认为数据类型（如字符串）？顺便说一句，我可以通过用空字符串（''）替换null来实现这一点，但这不是我想要做的 .

1 回答

use lit(null): import org.apache.spark.sql.functions.{lit, udf}

例：

import org.apache.spark.sql.functions.{lit, udf}

case class Record(foo: Int, bar: String)
val df = Seq(Record(1, "foo"), Record(2, "bar")).toDF

val dfWithFoobar = df.withColumn("foobar", lit(null: String))


scala> dfWithFoobar.printSchema
root
|-- foo: integer (nullable = false)
|-- bar: string (nullable = true)
|-- foobar: null (nullable = true)
and it is not retained by the csv writer. If it is a hard requirement you 
 can cast column to the specific type (lets say String):

import org.apache.spark.sql.types.StringType
df.withColumn("foobar", lit(null).cast(StringType))

或使用这样的UDF：

val getNull = udf(() => None: Option[String]) // Or some other type

df.withColumn("foobar", getNull()).printSchema

root
 |-- foo: integer (nullable = false)
 |-- bar: string (nullable = true)
 |-- foobar: string (nullable = true)

重新发布zero323代码 .

现在让我们讨论你的第二个问题

Question :

“这只有在我知道哪些列将被视为空数据类型时 . 当正在读取大量文件并应用各种转换时，我不知道或者有什么方法我可能知道哪些字段被空处理？“

Ans :

在这种情况下，您可以使用选项

Databricks Scala style guide不同意应始终禁止在Scala代码中使用null，并说：“对于性能敏感的代码，首选null而不是Option，以避免虚方法调用和装箱 . ”

示例：

+------+
|number|
+------+
|     1|
|     8|
|    12|
|  null|
+------+


val actualDf = sourceDf.withColumn(
  "is_even",
  when(
    col("number").isNotNull, 
    isEvenSimpleUdf(col("number"))
  ).otherwise(lit(null))
)

actualDf.show()
+------+-------+
|number|is_even|
+------+-------+
|     1|  false|
|     8|   true|
|    12|   true|
|  null|   null|
+------+-------+

回复于 2024-04-28T02:28:45+08:00

Spark将csv列中的空值视为null数据类型

1 回答

相关问题