在我的apache-beam作业中,我将外部源称为GCP存储,这可以看作是出于通用目的的http调用,重要的部分是它是外部调用,以丰富工作。
我正在处理的每条数据,我都调用此API来获取一些信息以丰富数据。在API上有大量重复调用相同数据的操作。
是否存在一种缓存或存储结果的好方法,以供处理的每个数据重用以限制所需的网络流量。这是处理的巨大瓶颈。
答案 0 :(得分:0)
Beam中没有内部持久层。您必须下载要处理的数据。这可能会在所有必须访问数据的工人团队中发生。
但是,您可能希望考虑将数据作为辅助输入来访问。您将必须预加载所有数据,而无需查询每个元素的外部源:https://beam.apache.org/documentation/programming-guide/#side-inputs
对于GCS,您可能想尝试使用现有的IO,例如TextIO:https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/TextIO.java
答案 1 :(得分:0)
您可以考虑将该值作为DoFn上的实例状态持久化。例如
class MyDoFn(beam.DoFn):
def __init__(self):
# This will be called during construction and pickled to the workers.
self.value1 = some_api_call()
def setup(self):
# This will be called once for each DoFn instance (generally
# once per worker), good for non-pickleable stuff that won't change.
self.value2 = some_api_call()
def start_bundle(self):
# This will be called per-bundle, possibly many times on a worker.
self.value3 = some_api_call()
def process(self, element):
# This is called on each element.
key = ...
if key not in self.some_lru_cache:
self.some_lru_cache[key] = some_api_call()
value4 = self.some_lru_cache[key]
# Use self.value1, self.value2, self.value3 and/or value4 here.