Google Cloud Dataflow(Python SDK):工作流程失败|每次工作进程最终都与服务失去联系

时间:2017-11-29 06:02:08

标签: python google-cloud-platform google-cloud-dataflow apache-beam

我建立了一个工作流程,从Google云端存储中提取数据,在ParDo中执行转换并将输出转储到BigQuery。

import apache_beam as beam
import logging

class ParseValidateRecordDoFn(beam.DoFn):
    def process(self, context):
        # All transformations come here    
        return custom_object

        try: 
            data = json.loads(context)
            yield beam.pvalue.TaggedOutput('PASS', data)

        except:
            print "ERROR"
            yield beam.pvalue.TaggedOutput('ERROR', context)

job_name = JOB_NAME
project = PROJECT_NAME
staging_location = STAGING_LOCATION
temp_location = TEMP_LOCATION

p = beam.Pipeline(argv=[
        '--job_name', job_name,
        '--project', project,
        '--staging_location', staging_location,
        '--temp_location', temp_location,
        '--no_save_main_session',
        '--runner', 'DataflowRunner',
        '--num_workers', '25',
        '--requirements_file', 'requirements.txt'])

text = p | "Reading Source" >> beam.io.ReadFromText('SOURCE LOCATION')

output_validate = text | beam.ParDo(ParseValidateRecordDoFn()).with_outputs('PASS','ERROR', main='main')

(output_validate.PASS | "Writing to BQ" >> beam.io.Write(beam.io.BigQuerySink('Table_name',
                                      create_disposition=beam.io.BigQueryDisposition.CREATE_NEVER,
                                      write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND, validate=True)))


(output_validate.ERROR | "Writing UNPARSED File" >> beam.io.WriteToText('ERROR_LOCATION'))

logging.getLogger().setLevel(logging.INFO)
p.run().wait_until_finish()

从本周开始,代码已经开始抛出错误:

Error Message Screen Shot

错误Stacktrace:

Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/batchworker.py", line 778, in run
    deferred_exception_details=deferred_exception_details)
  File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/batchworker.py", line 630, in do_work
    exception_details=exception_details)
  File "/usr/local/lib/python2.7/dist-packages/apache_beam/utils/retry.py", line 168, in wrapper
    return fun(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/batchworker.py", line 491, in report_completion_status
    exception_details=exception_details)
  File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/batchworker.py", line 299, in report_status
    work_executor=self._work_executor)
  File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/workerapiclient.py", line 342, in report_status
    append_counter(work_item_status, counter)
  File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/workerapiclient.py", line 38, in append_counter
    if isinstance(counter.name, counters.CounterName):
AttributeError: 'module' object has no attribute 'CounterName'

我尝试过的事情:

  • 将代码分解为最新生的形式。
  • 删除对本地目录的IO操作以进行调试
  • 在DF Runner上尝试Hello World代码
  • 切换到高记忆工作者

以上都没有导致成功,所有人都犯了同样的错误: 尝试了4次工作项目没有成功。每次工人最终都失去与服务的联系。

提前致谢:)

1 个答案:

答案 0 :(得分:-1)

相同的代码正在DirectRunner中的本地计算机上运行而没有问题。当我们删除对此代码的PANDAS部分的引用时,代码正在执行而没有任何问题。