首页 文章

AWS胶水自定义分类器Json Path

提问于
浏览
1

我有一组看起来像这样的Json数据文件

[
  {"client":"toys",
   "filename":"toy1.csv",
   "file_row_number":1,
   "secondary_db_index":"4050",
   "processed_timestamp":1535004075,
   "processed_datetime":"2018-08-23T06:01:15+0000",
   "entity_id":"4050",
   "entity_name":"4050",
   "is_emailable":false,
   "is_txtable":false,
   "is_loadable":false}
]

我用以下自定义分类器Json Path创建了一个Glue Crawler

$[*]

Glue返回正确的模式,并正确识别列 .

但是,当我在Athena上查询数据时...所有数据都在第一列中着陆而其余列都是空的 .

如何根据列显示数据?

image of Athena query

谢谢!

1 回答

  • 1

    这是一个与Hive相关的问题 . 我建议两种方法 . 首先,您可以使用结构数据类型在Athena中创建新表:

    CREATE EXTERNAL TABLE `example`(
    `row` struct<client:string,filename:string,file_row_number:int,secondary_db_index:string,processed_timestamp:int,processed_datetime:string,entity_id:string,entity_name:string,is_emailable:boolean,is_txtable:boolean,is_loadable:boolean> COMMENT 'from deserializer')
    ROW FORMAT SERDE 
    'org.openx.data.jsonserde.JsonSerDe' 
    STORED AS INPUTFORMAT 
    'org.apache.hadoop.mapred.TextInputFormat' 
    OUTPUTFORMAT 
    'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
    LOCATION
    's3://example'
    TBLPROPERTIES (
    'CrawlerSchemaDeserializerVersion'='1.0', 
    'CrawlerSchemaSerializerVersion'='1.0', 
    'UPDATED_BY_CRAWLER'='example', 
    'averageRecordSize'='271', 
    'classification'='json', 
    'compressionType'='none', 
    'jsonPath'='$[*]', 
    'objectCount'='1', 
    'recordCount'='1', 
    'sizeKey'='271', 
    'transient_lastDdlTime'='1535533583', 
    'typeOfData'='file')
    

    然后您可以按如下方式运行查询:

    SELECT row.client, row.filename, row.file_row_number FROM "example"
    

    其次,您可以重新设计您的json文件,然后再次运行Crawler . 在这个例子中,我使用了Single-JSON-Record-Per-Line格式 .

    {"client":"toys","filename":"toy1.csv","file_row_number":1,"secondary_db_index":"4050","processed_timestamp":1535004075,"processed_datetime":"2018-08-23T06:01:15+0000","entity_id":"4050","entity_name":"4050","is_emailable":false,"is_txtable":false,"is_loadable":false},
    {"client":"toys2","filename":"toy2.csv","file_row_number":1,"secondary_db_index":"4050","processed_timestamp":1535004075,"processed_datetime":"2018-08-23T06:01:15+0000","entity_id":"4050","entity_name":"4050","is_emailable":false,"is_txtable":false,"is_loadable":false}
    

相关问题