我目前是刚开始在Python和Dataflow运行器中使用Apache Beam。我对创建发布到Google Cloud PubSub的批处理管道感兴趣,我对Beam Python API进行了修补并找到了解决方案。但是,在探索过程中,我遇到了一些有趣的问题,这使我感到好奇。
目前,我成功的用于从GCS批量发布数据的光束管道如下所示:
class PublishFn(beam.DoFn):
def __init__(self, topic_path):
self.topic_path = topic_path
super(self.__class__, self).__init__()
def process(self, element, **kwargs):
from google.cloud import pubsub_v1
publisher = pubsub_v1.PublisherClient()
future = publisher.publish(self.topic_path, data=element.encode("utf-8"))
return future.result()
def run_gcs_to_pubsub(argv):
options = PipelineOptions(flags=argv)
from datapipes.common.dataflow_utils import CsvFileSource
from datapipes.protos import proto_schemas_pb2
from google.protobuf.json_format import MessageToJson
with beam.Pipeline(options=options) as p:
normalized_data = (
p |
"Read CSV from GCS" >> beam.io.Read(CsvFileSource(
"gs://bucket/path/to/file.csv")) |
"Normalize to Proto Schema" >> beam.Map(
lambda data: MessageToJson(
proto_schemas_pb2(data, proto_schemas_pb2.MySchema()),
indent=0,
preserving_proto_field_name=True)
)
)
(normalized_data |
"Write to PubSub" >> beam.ParDo(
PublishFn(topic_path="projects/my-gcp-project/topics/mytopic"))
)
在这里,我试图使发布者在DoFn
之间共享。我尝试了以下方法。
a。在DoFn中初始化发布商
class PublishFn(beam.DoFn):
def __init__(self, topic_path):
from google.cloud import pubsub_v1
batch_settings = pubsub_v1.types.BatchSettings(
max_bytes=1024, # One kilobyte
max_latency=1, # One second
)
self.publisher = pubsub_v1.PublisherClient(batch_settings)
self.topic_path = topic_path
super(self.__class__, self).__init__()
def process(self, element, **kwargs):
future = self.publisher.publish(self.topic_path, data=element.encode("utf-8"))
return future.result()
def run_gcs_to_pubsub(argv):
... ## same as 1
b。在DoFn外部初始化Publisher,并将其传递给DoFn
class PublishFn(beam.DoFn):
def __init__(self, publisher, topic_path):
self.publisher = publisher
self.topic_path = topic_path
super(self.__class__, self).__init__()
def process(self, element, **kwargs):
future = self.publisher.publish(self.topic_path, data=element.encode("utf-8"))
return future.result()
def run_gcs_to_pubsub(argv):
.... ## same as 1
batch_settings = pubsub_v1.types.BatchSettings(
max_bytes=1024, # One kilobyte
max_latency=1, # One second
)
publisher = pubsub_v1.PublisherClient(batch_settings)
with beam.Pipeline(options=options) as p:
... # same as 1
(normalized_data |
"Write to PubSub" >> beam.ParDo(
PublishFn(publisher=publisher, topic_path="projects/my-gcp-project/topics/mytopic"))
)
两次尝试通过DoFn
方法共享发布者的尝试均失败,并显示以下错误消息:
File "stringsource", line 2, in grpc._cython.cygrpc.Channel.__reduce_cython__
和
File "stringsource", line 2, in grpc._cython.cygrpc.Channel.__reduce_cython__
TypeError: no default __reduce__ due to non-trivial __cinit__
我的问题是:
共享的发布者实现是否可以改善光束管道性能?如果是,那么我想探索这个解决方案。
为什么在失败的管道上会发生错误?是由于在process
函数之外将自定义类对象初始化并将其传递给DoFn吗?如果由于这个原因,我该如何实现管道,以便能够在DoFn中重用自定义对象?
非常感谢您的帮助。
好的,所以Ankur解释了为什么会出现我的问题,并讨论了如何在DoFn上进行序列化。基于此知识,我现在了解到有两种解决方案可用于在DoFn中共享/重用自定义对象:
使自定义对象可序列化:这允许在创建DoFn对象(在__init__
下)期间初始化对象/使其可用。该对象必须可序列化,因为它将在提交DoFn对象(调用__init__
)的管道提交过程中被序列化。我如何在下面回答您如何实现这一目标。另外,我发现此要求实际上与[1] [2]下的Beam Documentation有关。
在__init__
之外的DoFn函数中初始化不可序列化的对象,以避免序列化,因为在管道提交期间不会调用 init 之外的函数。 Ankur的答案中说明了如何实现此目的。
参考文献:
[1] https://beam.apache.org/documentation/programming-guide/#core-beam-transforms
答案 0 :(得分:2)
PublisherClient
无法正确腌制。有关腌制here的更多信息。
在PublisherClient
方法中初始化process
可以避免PublisherClient
的酸洗。
如果要重复使用PublisherClient
,我建议在处理方法中初始化PublisherClient
并使用以下代码将其存储在self
中。
class PublishFn(beam.DoFn):
def __init__(self, topic_path):
self.topic_path = topic_path
super(self.__class__, self).__init__()
def process(self, element, **kwargs):
if not hasattr(self, 'publish'):
from google.cloud import pubsub_v1
self.publisher = pubsub_v1.PublisherClient()
future = self.publisher.publish(self.topic_path, data=element.encode("utf-8"))
return future.result()
答案 1 :(得分:0)
感谢Ankur,我发现此问题是由于python中的酸洗问题所致。然后,我尝试通过首先解决PublisherClient
的酸洗问题来隔离问题,并找到了在Beam上的PublisherClient
之间共享DoFn
的解决方案。
在python中,我们可以使用dill
程序包来腌制自定义对象,而我意识到该程序包已经在Beam python实现中用于腌制对象。因此,我尝试对问题进行故障排除,并发现了此错误:
TypeError: no default __reduce__ due to non-trivial __cinit__
然后,我尝试修复此错误,并且我的管道现在可以正常工作!
下面是解决方法:
class PubsubClient(PublisherClient):
def __reduce__(self):
return self.__class__, (self.batch_settings,)
# The DoFn to perform on each element in the input PCollection.
class PublishFn(beam.DoFn):
def __init__(self, topic_path):
self.topic_path = topic_path
from google.cloud import pubsub_v1
batch_settings = pubsub_v1.types.BatchSettings(
max_bytes=1024, # One kilobyte
max_latency=1, # One second
)
self.publisher = PubsubClient(batch_settings=batch_settings)
super(self.__class__, self).__init__()
def process(self, element, **kwargs):
future = self.publisher.publish(topic=self.topic_path, data=element.encode("utf-8"))
return future.result()
# ...the run_gcs_to_pubsub is the same as my successful pipeline
该解决方案的工作方式如下:首先,我从PublisherClient
继承并自己实现__reduce__
函数。请注意,因为我仅使用batch_settings
属性来初始化PublisherClient
,所以此属性足以满足我的__reduce__
函数的要求。然后,我在PublisherClient
中将修改后的__init__
用于我的DoFn。
希望通过这个新解决方案,我的管道可以提高性能。