我建立了一个工作流程,从Google云端存储中提取数据,在ParDo中执行转换并将输出转储到BigQuery。
import apache_beam as beam
import logging
class ParseValidateRecordDoFn(beam.DoFn):
def process(self, context):
# All transformations come here
return custom_object
try:
data = json.loads(context)
yield beam.pvalue.TaggedOutput('PASS', data)
except:
print "ERROR"
yield beam.pvalue.TaggedOutput('ERROR', context)
job_name = JOB_NAME
project = PROJECT_NAME
staging_location = STAGING_LOCATION
temp_location = TEMP_LOCATION
p = beam.Pipeline(argv=[
'--job_name', job_name,
'--project', project,
'--staging_location', staging_location,
'--temp_location', temp_location,
'--no_save_main_session',
'--runner', 'DataflowRunner',
'--num_workers', '25',
'--requirements_file', 'requirements.txt'])
text = p | "Reading Source" >> beam.io.ReadFromText('SOURCE LOCATION')
output_validate = text | beam.ParDo(ParseValidateRecordDoFn()).with_outputs('PASS','ERROR', main='main')
(output_validate.PASS | "Writing to BQ" >> beam.io.Write(beam.io.BigQuerySink('Table_name',
create_disposition=beam.io.BigQueryDisposition.CREATE_NEVER,
write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND, validate=True)))
(output_validate.ERROR | "Writing UNPARSED File" >> beam.io.WriteToText('ERROR_LOCATION'))
logging.getLogger().setLevel(logging.INFO)
p.run().wait_until_finish()
从本周开始,代码已经开始抛出错误:
错误Stacktrace:
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/batchworker.py", line 778, in run
deferred_exception_details=deferred_exception_details)
File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/batchworker.py", line 630, in do_work
exception_details=exception_details)
File "/usr/local/lib/python2.7/dist-packages/apache_beam/utils/retry.py", line 168, in wrapper
return fun(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/batchworker.py", line 491, in report_completion_status
exception_details=exception_details)
File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/batchworker.py", line 299, in report_status
work_executor=self._work_executor)
File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/workerapiclient.py", line 342, in report_status
append_counter(work_item_status, counter)
File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/workerapiclient.py", line 38, in append_counter
if isinstance(counter.name, counters.CounterName):
AttributeError: 'module' object has no attribute 'CounterName'
我尝试过的事情:
以上都没有导致成功,所有人都犯了同样的错误: 尝试了4次工作项目没有成功。每次工人最终都失去与服务的联系。
提前致谢:)
答案 0 :(得分:-1)
相同的代码正在DirectRunner中的本地计算机上运行而没有问题。当我们删除对此代码的PANDAS部分的引用时,代码正在执行而没有任何问题。