将Apache Beam多文件作业部署到包括protobufs的数据流中

时间:2019-09-02 15:19:15

标签: python python-3.x protocol-buffers google-cloud-dataflow apache-beam

我正在尝试将包含原型定义(_pb2)的Apache Beam作业部署到Google Dataflow,但是出现了一个酸洗错误:

_pickle.PicklingError: Can't pickle <class 'test_pb2.Example'>: import of module 'test_pb2' failed [while running 'Convert to Proto']

我的项目的结构遵循this documentjuliaset example中建议的方法:

root/
  main.py
  setup.py
  pipeline/
    __init__.py
    pipeline.py
    test_pb2.py
    input.txt
  proto/
    test.proto

test_pb2是使用test.proto的protoc生成的,并在转换中用于将字典转换为proto。

main.py的内容:

import logging

from pipeline import pipeline

if __name__ == '__main__':
    logging.getLogger().setLevel(logging.INFO)
    pipeline.run()

pipeline.py的内容:

from __future__ import absolute_import

import apache_beam as beam
import argparse
from apache_beam.options.pipeline_options import PipelineOptions
from apache_beam.options.pipeline_options import SetupOptions
from google.protobuf import json_format

from pipeline import test_pb2


class TransformDictToProto(beam.DoFn):

    def process(self, row, **kwargs):
        d = dict({'identifier': row})
        result = json_format.ParseDict(d, test_pb2.Example())
        yield result


class ConvertProtoToJson(beam.DoFn):

    def process(self, row, **kwargs):
        yield json_format.MessageToJson(row)


def run(argv=None):
    """Run the workflow."""
    parser = argparse.ArgumentParser()
    parser.add_argument('--input', default="pipeline/input.txt")
    parser.add_argument('--output', default="pipeline/output.txt")

    known_args, pipeline_args = parser.parse_known_args(argv)

    pipeline_options = PipelineOptions(pipeline_args)
    pipeline_options.view_as(SetupOptions).save_main_session = True

    with beam.Pipeline(options=pipeline_options) as p:
        lines = (p
                 | 'Read' >> beam.io.ReadFromText(known_args.input)
                 | 'Convert to Proto' >> beam.ParDo(TransformDictToProto())
                 | 'Convert to Bytes' >> beam.ParDo(ConvertProtoToJson())
                 )

        lines | beam.io.WriteToText(known_args.output)

我希望它可以在本地和Google Dataflow上使用。我一直在研究的可能方向是自定义ParDos上的类型提示和类型提示,但无济于事。有没有人经历过类似的经历,或者有人在GCP上看到过有效的apap光束管道,包括protobuf生成的文件?

这是上面示例中获得的完整堆栈跟踪:

/usr/local/Cellar/pyenv/1.2.13/versions/research/bin/python /Users/kmevissen/src/private/beam_multifile_with_proto_example/main.py
/usr/local/Cellar/pyenv/1.2.13/versions/research/lib/python3.7/site-packages/apache_beam/__init__.py:84: UserWarning: Some syntactic constructs of Python 3 are not yet fully supported by Apache Beam.
  'Some syntactic constructs of Python 3 are not yet fully supported by '
INFO:root:Missing pipeline option (runner). Executing pipeline using the default runner: DirectRunner.
INFO:root:==================== <function annotate_downstream_side_inputs at 0x110330320> ====================
INFO:root:==================== <function fix_side_input_pcoll_coders at 0x110330440> ====================
INFO:root:==================== <function lift_combiners at 0x1103304d0> ====================
INFO:root:==================== <function expand_sdf at 0x110330560> ====================
INFO:root:==================== <function expand_gbk at 0x1103305f0> ====================
INFO:root:==================== <function sink_flattens at 0x110330710> ====================
INFO:root:==================== <function greedily_fuse at 0x1103307a0> ====================
INFO:root:==================== <function read_to_impulse at 0x110330830> ====================
INFO:root:==================== <function impulse_to_input at 0x1103308c0> ====================
INFO:root:==================== <function inject_timer_pcollections at 0x110330a70> ====================
INFO:root:==================== <function sort_stages at 0x110330b00> ====================
INFO:root:==================== <function window_pcollection_coders at 0x110330b90> ====================
INFO:root:Running (((ref_AppliedPTransform_WriteToText/Write/WriteImpl/DoOnce/Read_10)+(ref_AppliedPTransform_WriteToText/Write/WriteImpl/InitializeWrite_11))+(ref_PCollection_PCollection_4/Write))+(ref_PCollection_PCollection_5/Write)
INFO:root:Running ((((((ref_AppliedPTransform_Read/Read_3)+(ref_AppliedPTransform_Convert to Proto_4))+(ref_AppliedPTransform_Convert to Bytes_5))+(ref_AppliedPTransform_WriteToText/Write/WriteImpl/WriteBundles_12))+(ref_AppliedPTransform_WriteToText/Write/WriteImpl/Pair_13))+(ref_AppliedPTransform_WriteToText/Write/WriteImpl/WindowInto(WindowIntoFn)_14))+(WriteToText/Write/WriteImpl/GroupByKey/Write)
Traceback (most recent call last):
  File "apache_beam/runners/common.py", line 782, in apache_beam.runners.common.DoFnRunner.process
  File "apache_beam/runners/common.py", line 453, in apache_beam.runners.common.SimpleInvoker.invoke_process
  File "apache_beam/runners/common.py", line 921, in apache_beam.runners.common._OutputProcessor.process_outputs
  File "apache_beam/runners/worker/operations.py", line 142, in apache_beam.runners.worker.operations.SingletonConsumerSet.receive
  File "apache_beam/runners/worker/operations.py", line 122, in apache_beam.runners.worker.operations.ConsumerSet.update_counters_start
  File "apache_beam/runners/worker/opcounters.py", line 196, in apache_beam.runners.worker.opcounters.OperationCounters.update_from
  File "apache_beam/runners/worker/opcounters.py", line 214, in apache_beam.runners.worker.opcounters.OperationCounters.do_sample
  File "apache_beam/coders/coder_impl.py", line 1014, in apache_beam.coders.coder_impl.WindowedValueCoderImpl.get_estimated_size_and_observables
  File "apache_beam/coders/coder_impl.py", line 1023, in apache_beam.coders.coder_impl.WindowedValueCoderImpl.get_estimated_size_and_observables
  File "apache_beam/coders/coder_impl.py", line 330, in apache_beam.coders.coder_impl.FastPrimitivesCoderImpl.get_estimated_size_and_observables
  File "apache_beam/coders/coder_impl.py", line 385, in apache_beam.coders.coder_impl.FastPrimitivesCoderImpl.encode_to_stream
  File "apache_beam/coders/coder_impl.py", line 200, in apache_beam.coders.coder_impl.CallbackCoderImpl.encode_to_stream
  File "/usr/local/Cellar/pyenv/1.2.13/versions/research/lib/python3.7/site-packages/apache_beam/coders/coders.py", line 594, in <lambda>
    lambda x: dumps(x, HIGHEST_PROTOCOL), pickle.loads)
_pickle.PicklingError: Can't pickle <class 'test_pb2.Example'>: import of module 'test_pb2' failed

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/kmevissen/src/private/beam_multifile_with_proto_example/main.py", line 7, in <module>
    pipeline.run()
  File "/Users/kmevissen/src/private/beam_multifile_with_proto_example/pipeline/pipeline.py", line 43, in run
    bts | beam.io.WriteToText(known_args.output)
  File "/usr/local/Cellar/pyenv/1.2.13/versions/research/lib/python3.7/site-packages/apache_beam/pipeline.py", line 426, in __exit__
    self.run().wait_until_finish()
  File "/usr/local/Cellar/pyenv/1.2.13/versions/research/lib/python3.7/site-packages/apache_beam/pipeline.py", line 406, in run
    self._options).run(False)
  File "/usr/local/Cellar/pyenv/1.2.13/versions/research/lib/python3.7/site-packages/apache_beam/pipeline.py", line 419, in run
    return self.runner.run_pipeline(self, self._options)
  File "/usr/local/Cellar/pyenv/1.2.13/versions/research/lib/python3.7/site-packages/apache_beam/runners/direct/direct_runner.py", line 128, in run_pipeline
    return runner.run_pipeline(pipeline, options)
  File "/usr/local/Cellar/pyenv/1.2.13/versions/research/lib/python3.7/site-packages/apache_beam/runners/portability/fn_api_runner.py", line 294, in run_pipeline
    default_environment=self._default_environment))
  File "/usr/local/Cellar/pyenv/1.2.13/versions/research/lib/python3.7/site-packages/apache_beam/runners/portability/fn_api_runner.py", line 301, in run_via_runner_api
    return self.run_stages(stage_context, stages)
  File "/usr/local/Cellar/pyenv/1.2.13/versions/research/lib/python3.7/site-packages/apache_beam/runners/portability/fn_api_runner.py", line 383, in run_stages
    stage_context.safe_coders)
  File "/usr/local/Cellar/pyenv/1.2.13/versions/research/lib/python3.7/site-packages/apache_beam/runners/portability/fn_api_runner.py", line 655, in _run_stage
    result, splits = bundle_manager.process_bundle(data_input, data_output)
  File "/usr/local/Cellar/pyenv/1.2.13/versions/research/lib/python3.7/site-packages/apache_beam/runners/portability/fn_api_runner.py", line 1471, in process_bundle
    result_future = self._controller.control_handler.push(process_bundle_req)
  File "/usr/local/Cellar/pyenv/1.2.13/versions/research/lib/python3.7/site-packages/apache_beam/runners/portability/fn_api_runner.py", line 990, in push
    response = self.worker.do_instruction(request)
  File "/usr/local/Cellar/pyenv/1.2.13/versions/research/lib/python3.7/site-packages/apache_beam/runners/worker/sdk_worker.py", line 342, in do_instruction
    request.instruction_id)
  File "/usr/local/Cellar/pyenv/1.2.13/versions/research/lib/python3.7/site-packages/apache_beam/runners/worker/sdk_worker.py", line 368, in process_bundle
    bundle_processor.process_bundle(instruction_id))
  File "/usr/local/Cellar/pyenv/1.2.13/versions/research/lib/python3.7/site-packages/apache_beam/runners/worker/bundle_processor.py", line 593, in process_bundle
    data.ptransform_id].process_encoded(data.data)
  File "/usr/local/Cellar/pyenv/1.2.13/versions/research/lib/python3.7/site-packages/apache_beam/runners/worker/bundle_processor.py", line 143, in process_encoded
    self.output(decoded_value)
  File "apache_beam/runners/worker/operations.py", line 255, in apache_beam.runners.worker.operations.Operation.output
  File "apache_beam/runners/worker/operations.py", line 256, in apache_beam.runners.worker.operations.Operation.output
  File "apache_beam/runners/worker/operations.py", line 143, in apache_beam.runners.worker.operations.SingletonConsumerSet.receive
  File "apache_beam/runners/worker/operations.py", line 428, in apache_beam.runners.worker.operations.ImpulseReadOperation.process
  File "apache_beam/runners/worker/operations.py", line 435, in apache_beam.runners.worker.operations.ImpulseReadOperation.process
  File "apache_beam/runners/worker/operations.py", line 256, in apache_beam.runners.worker.operations.Operation.output
  File "apache_beam/runners/worker/operations.py", line 143, in apache_beam.runners.worker.operations.SingletonConsumerSet.receive
  File "apache_beam/runners/worker/operations.py", line 593, in apache_beam.runners.worker.operations.DoOperation.process
  File "apache_beam/runners/worker/operations.py", line 594, in apache_beam.runners.worker.operations.DoOperation.process
  File "apache_beam/runners/common.py", line 778, in apache_beam.runners.common.DoFnRunner.receive
  File "apache_beam/runners/common.py", line 784, in apache_beam.runners.common.DoFnRunner.process
  File "apache_beam/runners/common.py", line 851, in apache_beam.runners.common.DoFnRunner._reraise_augmented
  File "/usr/local/Cellar/pyenv/1.2.13/versions/research/lib/python3.7/site-packages/future/utils/__init__.py", line 421, in raise_with_traceback
    raise exc.with_traceback(traceback)
  File "apache_beam/runners/common.py", line 782, in apache_beam.runners.common.DoFnRunner.process
  File "apache_beam/runners/common.py", line 453, in apache_beam.runners.common.SimpleInvoker.invoke_process
  File "apache_beam/runners/common.py", line 921, in apache_beam.runners.common._OutputProcessor.process_outputs
  File "apache_beam/runners/worker/operations.py", line 142, in apache_beam.runners.worker.operations.SingletonConsumerSet.receive
  File "apache_beam/runners/worker/operations.py", line 122, in apache_beam.runners.worker.operations.ConsumerSet.update_counters_start
  File "apache_beam/runners/worker/opcounters.py", line 196, in apache_beam.runners.worker.opcounters.OperationCounters.update_from
  File "apache_beam/runners/worker/opcounters.py", line 214, in apache_beam.runners.worker.opcounters.OperationCounters.do_sample
  File "apache_beam/coders/coder_impl.py", line 1014, in apache_beam.coders.coder_impl.WindowedValueCoderImpl.get_estimated_size_and_observables
  File "apache_beam/coders/coder_impl.py", line 1023, in apache_beam.coders.coder_impl.WindowedValueCoderImpl.get_estimated_size_and_observables
  File "apache_beam/coders/coder_impl.py", line 330, in apache_beam.coders.coder_impl.FastPrimitivesCoderImpl.get_estimated_size_and_observables
  File "apache_beam/coders/coder_impl.py", line 385, in apache_beam.coders.coder_impl.FastPrimitivesCoderImpl.encode_to_stream
  File "apache_beam/coders/coder_impl.py", line 200, in apache_beam.coders.coder_impl.CallbackCoderImpl.encode_to_stream
  File "/usr/local/Cellar/pyenv/1.2.13/versions/research/lib/python3.7/site-packages/apache_beam/coders/coders.py", line 594, in <lambda>
    lambda x: dumps(x, HIGHEST_PROTOCOL), pickle.loads)
_pickle.PicklingError: Can't pickle <class 'test_pb2.Example'>: import of module 'test_pb2' failed [while running 'Convert to Proto']

Process finished with exit code 1

只需添加一下,此管道只是一个示例,它演示了我在更复杂的管道上所面临的挑战,该示例中的功能非常奇怪。

0 个答案:

没有答案