在Apache Beam / Dataflow作业中是否可以有非并行步骤?

时间:2019-08-20 07:01:27

标签: python google-cloud-dataflow apache-beam

假设我在GCP中有一个python数据流作业,它可以完成以下2件事:

  • 从BigQuery获取一些数据

  • 调用外部API以获得特定值,并根据获取的值过滤BigQuery中的数据

我能够做到这一点,但是,第二步,我弄清楚如何实现它的唯一方法是将其作为扩展DoFn的类,并在以后以并行方式调用它:

class CallExternalServiceAndFilter(beam.DoFn):
    def to_runner_api_parameter(self, unused_context):
        pass

    def process(self, element, **kwargs):
        # here I have to make the http call and figure out whether to yield the element or not,
        # however this happens for each element of the set, as expected.
        if element['property'] < response_body_parsed['some_other_property']:
            logging.info("Yielding element")
            yield element
        else:
            logging.info("Not yielding element")
with beam.Pipeline(options=PipelineOptions(), argv=argv) as p:
    rows = p | 'Read data' >> beam.io.Read(beam.io.BigQuerySource(
        dataset='test',
        project=PROJECT,
        query='Select * from test.table'
    ))

    rows = rows | 'Calling external service and filtering items' >> beam.ParDo(CallExternalServiceAndFilter())

    # ...

有什么方法可以使API调用一次,然后在并行过滤步骤中使用结果?

1 个答案:

答案 0 :(得分:0)

使用__init__函数。

class CallExternalServiceAndFilter(beam.DoFn):
    def __init__():
        self.response_body_parsed = call_api()

    def to_runner_api_parameter(self, unused_context):
        pass

    def process(self, element, **kwargs):
        # here I have to make the http call and figure out whether to yield the element or not,
        # however this happens for each element of the set, as expected.
        if element['property'] < self.response_body_parsed['some_other_property']:
            logging.info("Yielding element")
            yield element
        else:
            logging.info("Not yielding element")

或更妙的是,只需事先(在构建管道的本地计算机上)调用API,然后在__init__中分配值。

reponse_body_parsed = call_api()

class CallExternalServiceAndFilter(beam.DoFn):
    def __init__():
        self.response_body_parsed = reponse_body_parsed

    def to_runner_api_parameter(self, unused_context):
        pass

    def process(self, element, **kwargs):
        # here I have to make the http call and figure out whether to yield the element or not,
        # however this happens for each element of the set, as expected.
        if element['property'] < self.response_body_parsed['some_other_property']:
            logging.info("Yielding element")
            yield element
        else:
            logging.info("Not yielding element")

您说过使用setup仍然会打多个电话。 __init__是否仍然存在(如果您在DoFn中进行API调用,而不是事先进行)?我仍然不清楚__init__setup之间的区别。