首页 文章

当timestamp列包含年份<1900时,无法从BigQuery读取数据

提问于
浏览
0

在使用最新的Apache Beam SDK for Python 2.2.0定义的管道上,运行读取和写入BigQuery表的简单管道时出现此错误 .

由于少数行具有年份<1900的时间戳,因此读取操作失败 . 如何修补此dataflow_worker包?

apache_beam.runners.dataflow.dataflow_runner.DataflowRuntimeException: Dataflow pipeline failed. State: FAILED, Error:
(4d31192aa4aec063): Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/batchworker.py", line 582, in do_work
    work_executor.execute()
  File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/executor.py", line 167, in execute
    op.start()
  File "dataflow_worker/native_operations.py", line 38, in dataflow_worker.native_operations.NativeReadOperation.start
    def start(self):
  File "dataflow_worker/native_operations.py", line 39, in dataflow_worker.native_operations.NativeReadOperation.start
    with self.scoped_start_state:
  File "dataflow_worker/native_operations.py", line 44, in dataflow_worker.native_operations.NativeReadOperation.start
    with self.spec.source.reader() as reader:
  File "dataflow_worker/native_operations.py", line 48, in dataflow_worker.native_operations.NativeReadOperation.start
    for value in reader:
  File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/nativefileio.py", line 198, in __iter__
    for record in self.read_next_block():
  File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/nativeavroio.py", line 95, in read_next_block
    yield self.decode_record(record)
  File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/nativebigqueryavroio.py", line 110, in decode_record
    record, self.source.table_schema)
  File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/nativebigqueryavroio.py", line 104, in _fix_field_values
    record[field.name], field)
  File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/nativebigqueryavroio.py", line 83, in _fix_field_value
    return dt.strftime('%Y-%m-%d %H:%M:%S.%f UTC')
ValueError: year=200 is before 1900; the datetime strftime() methods require year >= 1900

1 回答

  • 0

    不幸的是,您无法修补它以使用时间戳,因为这是Google的Apache Beam运行程序的内部实现:Dataflow . 因此,您必须等到Google修复此问题(这应该被识别为错误) . 请尽快报告,因为这更多地是使用Python版本的限制而不是错误 .

    问题来自 strftime ,您可以在错误中看到 . documentation明确提到它不适用于1900年之前的任何一年 . 但是,最后的解决方法是将时间戳转换为字符串(您可以在_1438639中指定的BigQuery中执行此操作) . 然后在你的Beam管道中你可以再次将它重新转换为某个时间戳或任何最适合你的时间戳 .

    您还有一个示例,介绍如何将 datetime 对象转换为字符串作为answer中错误的模板 . 在同一个问题中还有另一个answer解释了这个bug发生了什么,它是如何解决的(在Python中)以及你可以做些什么 . 不幸的是,解决方案似乎完全避免使用 strftime ,而是使用一些替代方案 .

相关问题