Question

我是apache beam的新手，正在探索apache beam dataflow的python版本。我想以特定顺序执行我的数据流任务，但它以并行模式执行所有任务。如何在apache beam python中创建任务依赖？

示例代码:(在此下面的代码sample.json文件中包含5行）

import apache_beam as beam
import logging
from apache_beam.options.pipeline_options import PipelineOptions

class Sample(beam.PTransform):
    def __init__(self, index):
        self.index = index

    def expand(self, pcoll):
        logging.info(self.index)
        return pcoll

class LoadData(beam.DoFn):
    def process(self, context):
        logging.info("***")

if __name__ == '__main__':

    logging.getLogger().setLevel(logging.INFO)
    pipeline = beam.Pipeline(options=PipelineOptions())

    (pipeline
        | "one" >> Sample(1)
        | "two: Read" >> beam.io.ReadFromText('sample.json')
        | "three: show" >> beam.ParDo(LoadData())
        | "four: sample2" >> Sample(2)
    )
    pipeline.run().wait_until_finish()

我预计它将按顺序执行，一，二，三，四。但它以并行模式运行。

上述代码的输出：

INFO:root:Missing pipeline option (runner). Executing pipeline using the 
default runner: DirectRunner.
INFO:root:1
INFO:root:2
INFO:root:Running pipeline with DirectRunner.
INFO:root:***
INFO:root:***
INFO:root:***
INFO:root:***
INFO:root:***

Answer 1

根据Dataflow's documentation：

当管道运行器为分布式构建实际管道时执行时，可以优化管道。例如，可能更多在计算上有效地将某些变换一起运行，或者在一起运行不同的顺序。 Dataflow服务完全管理这方面的问题你的管道执行。

另据Apache Beam's documentation：

API强调并行处理元素，这使得它成为可能难以表达诸如“为每个人分配序列号”之类的行为 PCollection中的元素“。这是有意的，因为这样的算法更可能遭受可扩展性问题。处理所有并行元素也有一些缺点。具体来说，它无法批处理任何操作，例如将元素写入a 在处理期间下沉或检查点进度。

因此，Dataflow和Apache Beam本质上是并行的;它们旨在处理令人尴尬的并行用例，如果您需要以特定顺序执行操作，它们可能不是最佳工具。正如@jkff指出的那样，Dataflow将以尽可能最佳的方式并行化操作来优化管道。

如果您确实需要按连续顺序执行每个步骤，则解决方法是使用blocking execution，而使用the waitUntilFinish() method，如此另一个Stack Overflow answer中所述。但是，我的理解是这样的实现只能在批处理管道中工作，因为流管道将持续消耗数据，因此您无法阻止执行以便在连续的步骤中工作。

如何在Apache beam python中创建任务之间的依赖关系

1 个答案: