Question

我在Dataflow（Apache梁）上创建了一个管道，以在Google BigQuery上读写数据，但是在创建DAG方面却遇到了问题，就像使用Airflow一样。

这是我代码中的一个示例：

RUN ["MSBuild.exe", "C:\\build\\Application.sln"]

我希望可以依次执行这些任务，而不是Dataflow在PARALLEL中执行它们

如何让它们顺序执行？

Answer 1

我假设您正在从大型查询中读取以下内容：

count = (p | 'read' >> beam.io.Read(beam.io.BigQuerySource(known_args.input_table))

我稍微研究了apache_beam源代码，看起来他们的Source转换忽略了输入pcollection，这就是为什么要并行设置它们的原因。

查看def expand(self, pbegin):的最后一行：

class Read(ptransform.PTransform):
  """A transform that reads a PCollection."""

  def __init__(self, source):
    """Initializes a Read transform.

    Args:
      source: Data source to read from.
    """
    super(Read, self).__init__()
    self.source = source

  def expand(self, pbegin):
    from apache_beam.options.pipeline_options import DebugOptions
    from apache_beam.transforms import util

    assert isinstance(pbegin, pvalue.PBegin)
    self.pipeline = pbegin.pipeline

    debug_options = self.pipeline._options.view_as(DebugOptions)
    if debug_options.experiments and 'beam_fn_api' in debug_options.experiments:
      source = self.source

      def split_source(unused_impulse):
        total_size = source.estimate_size()
        if total_size:
          # 1MB = 1 shard, 1GB = 32 shards, 1TB = 1000 shards, 1PB = 32k shards
          chunk_size = max(1 << 20, 1000 * int(math.sqrt(total_size)))
        else:
          chunk_size = 64 << 20  # 64mb
        return source.split(chunk_size)

      return (
          pbegin
          | core.Impulse()
          | 'Split' >> core.FlatMap(split_source)
          | util.Reshuffle()
          | 'ReadSplits' >> core.FlatMap(lambda split: split.source.read(
              split.source.get_range_tracker(
                  split.start_position, split.stop_position))))
    else:
      # Treat Read itself as a primitive.
      return pvalue.PCollection(self.pipeline)

# ... other methods

如果您设置此实验性beam_fn_api管道debug_option，则实际上pbegin会被使用，但是我不确定该选项的其他影响是什么。

您为什么需要它们顺序发生？您似乎正在写一个表，然后从另一个表中读取？

如果您确实需要按顺序进行此操作，则可以像这样将其子集Read做到这一点

class SequentialRead(Read):
  def expand(self, pbegin):
      return pbegin

Answer 2

由于您都希望将中间步骤输出到BigQuery并在两次转换之间传递数据，所以我认为Branch会达到您想要的结果。

PCollection_1 =（从BQ读取）。apply（Transform_1）

PCollection_1 .apply（写入BQ）

PCollection_1 .apply（Transform_2）.apply（写入BQ）

这将使您可以在元素经过Transform_1之后将Transform_2应用于元素，并将中间步骤写入BQ。通过对同一个PCollection应用多个ParDo，可以在DAG中生成一个不同的分支。

创建DAG数据流（Apache Beam）

2 个答案: