我一直在通过AWS胶水教程(https://docs.aws.amazon.com/glue/latest/dg/getting-started.html)工作,现在我正在尝试配置我的第一份工作,旨在将所有数据从RDS表复制到S3上的镶木地板文件中 .
我成功了:
-
创建了S3 VPC endpoints
-
创建了胶水RDS连接和爬虫
-
成功将RDS表元数据添加到目录 .
为了创造我的工作:
-
从胶水仪表板中选择'add job'
-
给作业命名,分配用于上面RDS连接的相同ROLE(因为它被分配了AWSGlueServiceRole策略),选择'A proposed script generated by AWS Glue'并将其他字段保留为默认值 .
-
从目录中选择所需的RDS表作为选择的输出源'create tables in your data target',使用s3作为数据源,镶木地板作为格式,作为目标选择新创建的输出s3文件夹 - 'aws-glue-test-etl/data'
-
单击“下一步”后,我将所有字段映射都保留为默认值 .
-
保存作业并编辑脚本
当我使用默认值运行作业时,我得到以下日志输出:
--conf spark.hadoop.yarn.resourcemanager.connect.max-wait.ms=60000 --conf spark.hadoop.fs.defaultFS=hdfs://ip-10-0-1-88.eu-west-1.compute.internal:8020 --conf spark.hadoop.yarn.resourcemanager.address=ip-10-0-1-88.eu-west-1.compute.internal:8032 --conf spark.dynamicAllocation.enabled=true --conf spark.shuffle.service.enabled=true --conf spark.dynamicAllocation.minExecutors=1 --conf spark.dynamicAllocation.maxExecutors=18 --conf spark.executor.memory=5g --conf spark.executor.cores=4 --JOB_ID j_20380e2f5d565a53d8bd397904dd210cbca826f3825ae8ff6b5a23e8f7bca45d --JOB_RUN_ID jr_6d60e2930a43a06edf6b6e8307171e88bd754ac5f9e66f2eaf5373e570b61280 --scriptLocation s3://aws-glue-scripts-558091818291-eu-west-1/MarcFletcher/UpdateAccountsExport-py --job-bookmark-option job-bookmark-disable --job-language python --TempDir s3://aws-glue-temporary-558091818291-eu-west-1/MarcFletcher --JOB_NAME UpdateAccountsExport-py
YARN_RM_DNS=ip-10-0-1-88.eu-west-1.compute.internal
Detected region eu-west-1
JOB_NAME = UpdateAccountsExport-py
Specifying eu-west-1 while copying script.
S3 copy with region specified failed. Falling back to not specifying region.
并输出以下错误:
fatal error: HTTPSConnectionPool(host='aws-glue-scripts-558091818291-eu-west-1.s3.eu-west-1.amazonaws.com', port=443): Max retries exceeded with url: /MarcFletcher/UpdateAccountsExport-py (Caused by ConnectTimeoutError(<botocore.awsrequest.AWSHTTPSConnection object at 0x7f9b11afbf10>, 'Connection to aws-glue-scripts-558091818291-eu-west-1.s3.eu-west-1.amazonaws.com timed out. (connect timeout=60)'))
Error downloading script: fatal error: HTTPSConnectionPool(host='aws-glue-scripts-558091818291-eu-west-1.s3.eu-west-1.amazonaws.com', port=443): Max retries exceeded with url: /MarcFletcher/UpdateAccountsExport-py (Caused by ConnectTimeoutError(<botocore.awsrequest.AWSHTTPSConnection object at 0x7fe752548f10>, 'Connection to aws-glue-scripts-558091818291-eu-west-1.s3.eu-west-1.amazonaws.com timed out. (connect timeout=60)'))
我已经查看了故障排除指南(https://docs.aws.amazon.com/glue/latest/dg/glue-troubleshooting-errors.html),但没有找到任何可能的解决方案 . 自动选择eu-west-1的区域是正确的 .
如果有人能够指出工作出错的地方,那将非常感激 .
2 回答
在子网路由表中有一个S3 endpoints 很重要 .
https://docs.aws.amazon.com/glue/latest/dg/start-development-endpoint.html https://github.com/awsdocs/aws-glue-developer-guide/blob/master/doc_source/vpc-endpoints-s3.md
尽管如此,我还发现在设置boto3资源时需要指定区域 .
我找不到这个,或相关的boto.config记录 .
最有可能是安全组端口阻塞问题 .
检查附加到胶水连接的AWS安全组出口规则,允许所有的443端口上的TCP