当timestamp列包含year<时,无法从BigQuery读取数据。 1900

时间:2017-12-17 13:55:54

标签: python google-cloud-dataflow apache-beam

在使用最新的Apache Beam SDK for Python 2.2.0定义的管道上,运行一个读取和写入BigQuery表的简单管道时出现此错误。

由于几行的时间戳与年份< 1900年,读操作失败。如何修补此dataflow_worker包?

apache_beam.runners.dataflow.dataflow_runner.DataflowRuntimeException: Dataflow pipeline failed. State: FAILED, Error:
(4d31192aa4aec063): Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/batchworker.py", line 582, in do_work
    work_executor.execute()
  File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/executor.py", line 167, in execute
    op.start()
  File "dataflow_worker/native_operations.py", line 38, in dataflow_worker.native_operations.NativeReadOperation.start
    def start(self):
  File "dataflow_worker/native_operations.py", line 39, in dataflow_worker.native_operations.NativeReadOperation.start
    with self.scoped_start_state:
  File "dataflow_worker/native_operations.py", line 44, in dataflow_worker.native_operations.NativeReadOperation.start
    with self.spec.source.reader() as reader:
  File "dataflow_worker/native_operations.py", line 48, in dataflow_worker.native_operations.NativeReadOperation.start
    for value in reader:
  File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/nativefileio.py", line 198, in __iter__
    for record in self.read_next_block():
  File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/nativeavroio.py", line 95, in read_next_block
    yield self.decode_record(record)
  File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/nativebigqueryavroio.py", line 110, in decode_record
    record, self.source.table_schema)
  File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/nativebigqueryavroio.py", line 104, in _fix_field_values
    record[field.name], field)
  File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/nativebigqueryavroio.py", line 83, in _fix_field_value
    return dt.strftime('%Y-%m-%d %H:%M:%S.%f UTC')
ValueError: year=200 is before 1900; the datetime strftime() methods require year >= 1900

1 个答案:

答案 0 :(得分:0)

不幸的是,您无法修补它以使用时间戳,因为这是Google的Apache Beam运行程序的内部实现:Dataflow。因此,您必须等到Google修复此问题(这应该被识别为错误)。请尽快报告,因为这更多是使用Python版本的限制而不是错误。

问题来自strftime,您可以在错误中看到。 documentation明确提到它不适用于1900年以前的任何一年。 不过,最后的解决方法是将时间戳转换为字符串(您可以在documentation中指定的BigQuery中执行此操作)。然后在您的Beam管道中,您可以将其重新转换为某个时间戳或任何最适合您的时间段。

您还有一个示例,介绍如何将datetime对象转换为字符串作为answer中错误的模板。在同一个问题中,还有另一个answer解释了这个错误发生了什么,以及它是如何解决的(在Python中)以及你可以做些什么。不幸的是,解决方案似乎完全避免使用strftime,而是使用一些替代方案。