首页 文章

aws如何在redshift中粘贴作业上传几个表

提问于
浏览
1

是否可以使用AWS Glue作业在Redshift中加载多个表?

这些是我遵循的步骤 .

  • 从S3抓取json并将数据转换为数据目录表 .

  • 我创建了一个将在redshift中上传数据目录表的作业,但它只限制我为每个作业上传1个表 . 在作业属性中(添加作业),我选择的作业运行选项是:由AWS Glue生成的建议脚本 .

我不熟悉python,我是AWS Glue的新手 . 但是我需要上传几张 table .

这是一个示例脚本:

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

## @params: [TempDir, JOB_NAME]
args = getResolvedOptions(sys.argv, ['TempDir','JOB_NAME'])

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
## @type: DataSource
## @args: [database = "sampledb", table_name = "abs", transformation_ctx = "datasource0"]
## @return: datasource0
## @inputs: []
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "sampledb", table_name = "abs", transformation_ctx = "datasource0")
## @type: ApplyMapping
## @args: [mapping = [("value", "int", "value", "int"), ("sex", "string", "sex", "string"), ("age", "string", "age", "string"), ("highest year of school completed", "string", "highest year of school completed", "string"), ("state", "string", "state", "string"), ("region type", "string", "region type", "string"), ("lga 2011", "string", "lga 2011", "string"), ("frequency", "string", "frequency", "string"), ("time", "string", "time", "string")], transformation_ctx = "applymapping1"]
## @return: applymapping1
## @inputs: [frame = datasource0]
applymapping1 = ApplyMapping.apply(frame = datasource0, mappings = [("value", "int", "value", "int"), ("sex", "string", "sex", "string"), ("age", "string", "age", "string"), ("highest year of school completed", "string", "highest year of school completed", "string"), ("state", "string", "state", "string"), ("region type", "string", "region type", "string"), ("lga 2011", "string", "lga 2011", "string"), ("frequency", "string", "frequency", "string"), ("time", "string", "time", "string")], transformation_ctx = "applymapping1")
## @type: ResolveChoice
## @args: [choice = "make_cols", transformation_ctx = "resolvechoice2"]
## @return: resolvechoice2
## @inputs: [frame = applymapping1]
resolvechoice2 = ResolveChoice.apply(frame = applymapping1, choice = "make_cols", transformation_ctx = "resolvechoice2")
## @type: DropNullFields
## @args: [transformation_ctx = "dropnullfields3"]
## @return: dropnullfields3
## @inputs: [frame = resolvechoice2]
dropnullfields3 = DropNullFields.apply(frame = resolvechoice2, transformation_ctx = "dropnullfields3")
## @type: DataSink
## @args: [catalog_connection = "redshift", connection_options = {"dbtable": "abs", "database": "dbmla"}, redshift_tmp_dir = TempDir, transformation_ctx = "datasink4"]
## @return: datasink4
## @inputs: [frame = dropnullfields3]
datasink4 = glueContext.write_dynamic_frame.from_jdbc_conf(frame = dropnullfields3, catalog_connection = "redshift", connection_options = {"dbtable": "abs", "database": "dbmla"}, redshift_tmp_dir = args["TempDir"], transformation_ctx = "datasink4")
job.commit()

aws glue database:sampledb
aws glue中的表名:abs
redshift数据库:dbmla

请举例说明如何上传它们 . 谢谢!

1 回答

  • 2

    根据AWS Glue FAQ,您可以修改生成的代码,然后运行该作业 .

    问:如何自定义AWS Glue生成的ETL代码? AWS Glue的ETL脚本推荐系统生成Scala或Python代码 . 它利用Glue的自定义ETL库来简化对数据源的访问以及管理作业执行 . 您可以在我们的文档中找到有关该库的更多详细信息 . 您可以使用AWS Glue的自定义库编写ETL代码,或者通过AWS Glue Console脚本编辑器使用内联编辑,下载自动生成的代码并在您自己的IDE中编辑,在Scala或Python中编写任意代码 . 您还可以从我们的Github存储库中托管的众多示例之一开始,并自定义该代码 .

    所以请尝试将其他表的代码片段添加到同一个脚本中,如下所示,

    datasource1 = glueContext.create_dynamic_frame.from_catalog(database = "sampledb", table_name = "abs2", transformation_ctx = "datasource1")
    applymapping2 = ApplyMapping.apply(.. transformation_ctx = "applymapping2")
    resolvechoice2 = ResolveChoice.apply(frame = applymapping2, choice = "make_cols", transformation_ctx = "resolvechoice2")
    dropnullfields3 = DropNullFields.apply(frame = resolvechoice2, transformation_ctx = "dropnullfields3")
    datasink4 = glueContext.write_dynamic_frame.from_jdbc_conf(frame = dropnullfields3, catalog_connection = "redshift", connection_options = {"dbtable": "abs2", "database": "dbmla"}, redshift_tmp_dir = args["TempDir"], transformation_ctx = "datasink4")
    
    datasource2 = glueContext.create_dynamic_frame.from_catalog(database = "sampledb", table_name = "abs2", transformation_ctx = "datasource1")
    applymapping2 = ApplyMapping.apply(.. transformation_ctx = "applymapping2")
    resolvechoice2 = ResolveChoice.apply(frame = applymapping2, choice = "make_cols", transformation_ctx = "resolvechoice2")
    dropnullfields3 = DropNullFields.apply(frame = resolvechoice2, transformation_ctx = "dropnullfields3")
    datasink4 = glueContext.write_dynamic_frame.from_jdbc_conf(frame = dropnullfields3, catalog_connection = "redshift", connection_options = {"dbtable": "abs2", "database": "dbmla"}, redshift_tmp_dir = args["TempDir"], transformation_ctx = "datasink4")
    
    datasource3 = glueContext.create_dynamic_frame.from_catalog(database = "sampledb", table_name = "abs2", transformation_ctx = "datasource1")
    applymapping2 = ApplyMapping.apply(.. transformation_ctx = "applymapping2")
    resolvechoice2 = ResolveChoice.apply(frame = applymapping2, choice = "make_cols", transformation_ctx = "resolvechoice2")
    dropnullfields3 = DropNullFields.apply(frame = resolvechoice2, transformation_ctx = "dropnullfields3")
    datasink4 = glueContext.write_dynamic_frame.from_jdbc_conf(frame = dropnullfields3, catalog_connection = "redshift", connection_options = {"dbtable": "abs2", "database": "dbmla"}, redshift_tmp_dir = args["TempDir"], transformation_ctx = "datasink4")
    
    job.commit()
    

    相应地更改变量名称以使其唯一 . 谢谢

相关问题