我有一个列类型的数据框是字符串,但实际上它包含4个架构的json对象,其中很少有字段是常见的 . 我需要将其转换为jason对象 .
这是数据框架的架构:
query.printSchema()
root
|-- test: string (nullable = true)
DF的 Value 看起来像
query.show(10)
+--------------------+
| test|
+--------------------+
|{"PurchaseActivit...|
|{"PurchaseActivit...|
|{"PurchaseActivit...|
|{"Interaction":{"...|
|{"PurchaseActivit...|
|{"Interaction":{"...|
|{"PurchaseActivit...|
|{"PurchaseActivit...|
|{"PurchaseActivit...|
|{"PurchaseActivit...|
+--------------------+
only showing top 10 rows
我申请的解决方案::
- 写入文本文件
query.write.format(“text”).mode('overwrite') . save(“s3:// bucketname / temp /”)
- 读为json
df = spark.read.json(“s3a:// bucketname / temp /”)
- 现在打印Schema,它是已经转换为json对象的每一行的json字符串
df.printSchema()root
| - EventDate:string(nullable = true)
| - EventId:string(nullable = true)
| - EventNotificationType:long(nullable = true)
| - 交互:struct(nullable = true)
| | - ContextId:string(nullable = true)
| | - 创建:string(nullable = true)
| | - 描述:string(nullable = true)
| | - Id:string(nullable = true)
| | - ModelContextId:string(nullable = true)
| - PurchaseActivity:struct(nullable = true)
| | - BillingCity:string(nullable = true)
| | - BillingCountry:string(nullable = true)
| | - ShippingAndHandlingAmount:double(nullable = true)
| | - ShippingDiscountAmount:double(nullable = true)
| | - SubscriberId:long(nullable = true)
| | - SubscriptionOriginalEndDate:string(nullable = true)
| - SubscriptionChurn:struct(nullable = true)
| | - PaymentTypeCode:long(nullable = true)
| | - PaymentTypeName:string(nullable = true)
| | - PreviousPaidAmount:double(nullable = true)
| | - SubscriptionRemoved:string(nullable = true)
| | - SubscriptionStartDate:string(nullable = true)
| - TransactionDetail:struct(nullable = true)
| | - 数量:double(nullable = true)
| | - OrderShipToCountry:string(nullable = true)
| | - PayPalUserName:string(nullable = true)
| | - PaymentSubTypeCode:long(nullable = true)
| | - PaymentSubTypeName:string(nullable = true)
有没有最好的方法,我不需要将数据帧写为文本文件,并再次将其作为json文件读取,以获得预期的输出