Question

在我的apache-beam作业中，我将外部源称为GCP存储，这可以看作是出于通用目的的http调用，重要的部分是它是外部调用，以丰富工作。

我正在处理的每条数据，我都调用此API来获取一些信息以丰富数据。在API上有大量重复调用相同数据的操作。

是否存在一种缓存或存储结果的好方法，以供处理的每个数据重用以限制所需的网络流量。这是处理的巨大瓶颈。

Answer 1

Beam中没有内部持久层。您必须下载要处理的数据。这可能会在所有必须访问数据的工人团队中发生。

但是，您可能希望考虑将数据作为辅助输入来访问。您将必须预加载所有数据，而无需查询每个元素的外部源：https://beam.apache.org/documentation/programming-guide/#side-inputs

对于GCS，您可能想尝试使用现有的IO，例如TextIO：https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/TextIO.java

Answer 2

您可以考虑将该值作为DoFn上的实例状态持久化。例如

class MyDoFn(beam.DoFn):
    def __init__(self):
        # This will be called during construction and pickled to the workers.
        self.value1 = some_api_call()

    def setup(self):
        # This will be called once for each DoFn instance (generally
        # once per worker), good for non-pickleable stuff that won't change.
        self.value2 = some_api_call()

    def start_bundle(self):
        # This will be called per-bundle, possibly many times on a worker.
        self.value3 = some_api_call()

    def process(self, element):
        # This is called on each element.
        key = ...
        if key not in self.some_lru_cache:
            self.some_lru_cache[key] = some_api_call()
        value4 = self.some_lru_cache[key]
        # Use self.value1, self.value2, self.value3 and/or value4 here.

如何在apache-beam python中持久保存外部获取的有状态数据？

2 个答案: