我有一个列类型的数据框是字符串,但实际上它包含4个架构的json对象,其中很少有字段是常见的 . 我需要将其转换为jason对象 .

这是数据框架的架构:

query.printSchema()

root
 |-- test: string (nullable = true)

DF的 Value 看起来像

query.show(10)

+--------------------+
|                test|
+--------------------+
|{"PurchaseActivit...|
|{"PurchaseActivit...|
|{"PurchaseActivit...|
|{"Interaction":{"...|
|{"PurchaseActivit...|
|{"Interaction":{"...|
|{"PurchaseActivit...|
|{"PurchaseActivit...|
|{"PurchaseActivit...|
|{"PurchaseActivit...|
+--------------------+
only showing top 10 rows

我申请的解决方案::

  • 写入文本文件

query.write.format(“text”).mode('overwrite') . save(“s3:// bucketname / temp /”)

  • 读为json

df = spark.read.json(“s3a:// bucketname / temp /”)

  • 现在打印Schema,它是已经转换为json对象的每一行的json字符串

df.printSchema()root
| - EventDate:string(nullable = true)
| - EventId:string(nullable = true)
| - EventNotificationType:long(nullable = true)
| - 交互:struct(nullable = true)
| | - ContextId:string(nullable = true)
| | - 创建:string(nullable = true)
| | - 描述:string(nullable = true)
| | - Id:string(nullable = true)
| | - ModelContextId:string(nullable = true)
| - PurchaseActivity:struct(nullable = true)
| | - BillingCity:string(nullable = true)
| | - BillingCountry:string(nullable = true)
| | - ShippingAndHandlingAmount:double(nullable = true)
| | - ShippingDiscountAmount:double(nullable = true)
| | - SubscriberId:long(nullable = true)
| | - SubscriptionOriginalEndDate:string(nullable = true)
| - SubscriptionChurn:struct(nullable = true)
| | - PaymentTypeCode:long(nullable = true)
| | - PaymentTypeName:string(nullable = true)
| | - PreviousPaidAmount:double(nullable = true)
| | - SubscriptionRemoved:string(nullable = true)
| | - SubscriptionStartDate:string(nullable = true)
| - TransactionDetail:struct(nullable = true)
| | - 数量:double(nullable = true)
| | - OrderShipToCountry:string(nullable = true)
| | - PayPalUserName:string(nullable = true)
| | - PaymentSubTypeCode:long(nullable = true)
| | - PaymentSubTypeName:string(nullable = true)

有没有最好的方法,我不需要将数据帧写为文本文件,并再次将其作为json文件读取,以获得预期的输出