AWS Glue PySpark替换NULL-Java 学习之路

我正在运行AWS Glue作业，使用来自Glue的自动生成的PySpark脚本，将S3上的管道分隔文件加载到RDS Postgres实例中 .

最初，它在某些列中抱怨NULL值：

pyspark.sql.utils.IllegalArgumentException: u"Can't get JDBC type for null"

在谷歌搜索和阅读之后，我尝试通过将我的AWS Glue Dynamic Dataframe转换为Spark Dataframe来替换我文件中的NULL，执行函数 fillna() 并重新转换回动态数据帧 .

datasource0 = glueContext.create_dynamic_frame.from_catalog(database = 
"xyz_catalog", table_name = "xyz_staging_files", transformation_ctx = 
"datasource0")
custom_df = datasource0.toDF()
custom_df2 = custom_df.fillna(-1)
custom_df3 = custom_df2.fromDF()

applymapping1 = ApplyMapping.apply(frame = custom_df3, mappings = [("id", 
"string", "id", "int"),........more code

参考文献：

https://github.com/awslabs/aws-glue-samples/blob/master/FAQ_and_How_to.md#3-there-are-some-transforms-that-i-cannot-figure-out

How to replace all Null values of a dataframe in Pyspark

http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.fillna

现在，当我运行我的工作时，它会抛出以下错误：

Log Contents:
Traceback (most recent call last):
File "script_2017-12-20-22-02-13.py", line 23, in <module>
custom_df3 = custom_df2.fromDF()
AttributeError: 'DataFrame' object has no attribute 'fromDF'
End of LogType:stdout

我是Python和Spark的新手并且已经尝试了很多，但是无法理解这一点 . 感谢一些专家帮助 .

我尝试将我的reconvert命令更改为：

custom_df3 = glueContext.create_dynamic_frame.fromDF(frame = custom_df2)

但仍然有错误：

AttributeError: 'DynamicFrameReader' object has no attribute 'fromDF'

更新：我怀疑这不是关于NULL值 . 消息“无法获取null的JDBC类型”似乎不是指NULL值，而是JDBC无法解密的某些数据/类型 .

我创建了一个只有1条记录，没有NULL值的文件，将所有布尔类型更改为INT（并将替换值更改为0和1），但仍然得到相同的错误：

pyspark.sql.utils.IllegalArgumentException: u"Can't get JDBC type for null"

更新：确保导入DynamicFrame（来自awsglue.context导入DynamicFrame），因为fromDF / toDF是DynamicFrame的一部分 .

参考https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-extensions-dynamic-frame.html

2 回答

0
你在错误的 class 上打电话给.fromDF . 它应该如下所示：
```
from awsglue.dynamicframe import DynamicFrame
DyamicFrame.fromDF(custom_df2, glueContext, 'label')
```
回复于 2024-05-06T21:08:23+08:00
0
对于此错误， pyspark.sql.utils.IllegalArgumentException: u"Can't get JDBC type for null" 您应该使用drop Null列 .

加载到Redshift DB Tables时，我遇到了类似的错误 . 使用以下命令后，问题得到解决
```
Loading= DropNullFields.apply(frame = resolvechoice3, transformation_ctx = "Loading")
```
回复于 2024-05-06T21:08:23+08:00

AWS Glue PySpark替换NULL

2 回答

相关问题