Question

我目前是刚开始在Python和Dataflow运行器中使用Apache Beam。我对创建发布到Google Cloud PubSub的批处理管道感兴趣，我对Beam Python API进行了修补并找到了解决方案。但是，在探索过程中，我遇到了一些有趣的问题，这使我感到好奇。

1。成功的管道

目前，我成功的用于从GCS批量发布数据的光束管道如下所示：

class PublishFn(beam.DoFn):
    def __init__(self, topic_path):
        self.topic_path = topic_path
        super(self.__class__, self).__init__()

    def process(self, element, **kwargs):
        from google.cloud import pubsub_v1
        publisher = pubsub_v1.PublisherClient()
        future = publisher.publish(self.topic_path, data=element.encode("utf-8"))
        return future.result()


def run_gcs_to_pubsub(argv):
    options = PipelineOptions(flags=argv)

    from datapipes.common.dataflow_utils import CsvFileSource
    from datapipes.protos import proto_schemas_pb2
    from google.protobuf.json_format import MessageToJson

    with beam.Pipeline(options=options) as p:
        normalized_data = (
                p |
                "Read CSV from GCS" >> beam.io.Read(CsvFileSource(
                    "gs://bucket/path/to/file.csv")) |
                "Normalize to Proto Schema" >> beam.Map(
                        lambda data: MessageToJson(
                            proto_schemas_pb2(data, proto_schemas_pb2.MySchema()),
                            indent=0,
                            preserving_proto_field_name=True)
                    )
        )
        (normalized_data |
            "Write to PubSub" >> beam.ParDo(
                    PublishFn(topic_path="projects/my-gcp-project/topics/mytopic"))
            )

2。管道失败

在这里，我试图使发布者在DoFn之间共享。我尝试了以下方法。

a。在DoFn中初始化发布商

class PublishFn(beam.DoFn):
    def __init__(self, topic_path):
        from google.cloud import pubsub_v1

        batch_settings = pubsub_v1.types.BatchSettings(
             max_bytes=1024,  # One kilobyte
             max_latency=1,  # One second
         )
        self.publisher = pubsub_v1.PublisherClient(batch_settings)
        self.topic_path = topic_path
        super(self.__class__, self).__init__()

    def process(self, element, **kwargs):
        future = self.publisher.publish(self.topic_path, data=element.encode("utf-8"))
        return future.result()

def run_gcs_to_pubsub(argv):
    ... ## same as 1

b。在DoFn外部初始化Publisher，并将其传递给DoFn

class PublishFn(beam.DoFn):
    def __init__(self, publisher, topic_path):
        self.publisher = publisher
        self.topic_path = topic_path
        super(self.__class__, self).__init__()

    def process(self, element, **kwargs):
        future = self.publisher.publish(self.topic_path, data=element.encode("utf-8"))
        return future.result()


def run_gcs_to_pubsub(argv):
    .... ## same as 1

    batch_settings = pubsub_v1.types.BatchSettings(
        max_bytes=1024,  # One kilobyte
        max_latency=1,  # One second
    )
    publisher = pubsub_v1.PublisherClient(batch_settings)

    with beam.Pipeline(options=options) as p:
        ... # same as 1
        (normalized_data | 
            "Write to PubSub" >> beam.ParDo(
                PublishFn(publisher=publisher, topic_path="projects/my-gcp-project/topics/mytopic"))
        )

两次尝试通过DoFn方法共享发布者的尝试均失败，并显示以下错误消息：

  File "stringsource", line 2, in grpc._cython.cygrpc.Channel.__reduce_cython__

和

  File "stringsource", line 2, in grpc._cython.cygrpc.Channel.__reduce_cython__
TypeError: no default __reduce__ due to non-trivial __cinit__

我的问题是：

共享的发布者实现是否可以改善光束管道性能？如果是，那么我想探索这个解决方案。
为什么在失败的管道上会发生错误？是由于在process函数之外将自定义类对象初始化并将其传递给DoFn吗？如果由于这个原因，我该如何实现管道，以便能够在DoFn中重用自定义对象？

非常感谢您的帮助。

编辑：解决方案

好的，所以Ankur解释了为什么会出现我的问题，并讨论了如何在DoFn上进行序列化。基于此知识，我现在了解到有两种解决方案可用于在DoFn中共享/重用自定义对象：

使自定义对象可序列化：这允许在创建DoFn对象（在__init__下）期间初始化对象/使其可用。该对象必须可序列化，因为它将在提交DoFn对象（调用__init__）的管道提交过程中被序列化。我如何在下面回答您如何实现这一目标。另外，我发现此要求实际上与[1] [2]下的Beam Documentation有关。
在__init__之外的DoFn函数中初始化不可序列化的对象，以避免序列化，因为在管道提交期间不会调用 init 之外的函数。 Ankur的答案中说明了如何实现此目的。

参考文献：

[1] https://beam.apache.org/documentation/programming-guide/#core-beam-transforms

[2] https://beam.apache.org/documentation/programming-guide/#requirements-for-writing-user-code-for-beam-transforms

Answer 1

PublisherClient无法正确腌制。有关腌制here的更多信息。在PublisherClient方法中初始化process可以避免PublisherClient的酸洗。

如果要重复使用PublisherClient，我建议在处理方法中初始化PublisherClient并使用以下代码将其存储在self中。

class PublishFn(beam.DoFn):
    def __init__(self, topic_path):
        self.topic_path = topic_path
        super(self.__class__, self).__init__()

    def process(self, element, **kwargs):
        if not hasattr(self, 'publish'):
            from google.cloud import pubsub_v1
            self.publisher = pubsub_v1.PublisherClient()
        future = self.publisher.publish(self.topic_path, data=element.encode("utf-8"))
        return future.result()

Answer 2

感谢Ankur，我发现此问题是由于python中的酸洗问题所致。然后，我尝试通过首先解决PublisherClient的酸洗问题来隔离问题，并找到了在Beam上的PublisherClient之间共享DoFn的解决方案。

在python中，我们可以使用dill程序包来腌制自定义对象，而我意识到该程序包已经在Beam python实现中用于腌制对象。因此，我尝试对问题进行故障排除，并发现了此错误：

TypeError: no default __reduce__ due to non-trivial __cinit__

然后，我尝试修复此错误，并且我的管道现在可以正常工作！

下面是解决方法：

class PubsubClient(PublisherClient):
    def __reduce__(self):
        return self.__class__, (self.batch_settings,)

# The DoFn to perform on each element in the input PCollection.
class PublishFn(beam.DoFn):
    def __init__(self, topic_path):
        self.topic_path = topic_path

        from google.cloud import pubsub_v1
        batch_settings = pubsub_v1.types.BatchSettings(
            max_bytes=1024,  # One kilobyte
            max_latency=1,  # One second
        )

        self.publisher = PubsubClient(batch_settings=batch_settings)
        super(self.__class__, self).__init__()

    def process(self, element, **kwargs):
        future = self.publisher.publish(topic=self.topic_path, data=element.encode("utf-8"))

        return future.result()

# ...the run_gcs_to_pubsub is the same as my successful pipeline

该解决方案的工作方式如下：首先，我从PublisherClient继承并自己实现__reduce__函数。请注意，因为我仅使用batch_settings属性来初始化PublisherClient，所以此属性足以满足我的__reduce__函数的要求。然后，我在PublisherClient中将修改后的__init__用于我的DoFn。

希望通过这个新解决方案，我的管道可以提高性能。

为什么自定义Python对象不能与ParDo Fn一起使用？

1。成功的管道

2。管道失败

编辑：解决方案

2 个答案: