Question

我有一个内置于python的Apache Beam管道。我正在读取一个csv文件中的行，然后是所有pcollections的通用管道步骤。这很好。对于来自特定文件名的pcollection，我想执行几个附加步骤。因此，我在该文件中标记了pcollections并为这些标记的collection运行其他步骤。我在'Dataflow'上运行管道时，出现错误“工作流失败。原因：预期的自定义源拆分数非零。”

我测试过，并且在'DirectRunner'上运行良好。

lines = (p | beam.io.ReadFromText(input_file_path, skip_header_lines=1))

Generic = (lines | <"Do generic logic for all pCollections">)

tagged_lines = (lines | beam.ParDo(Tag(),input_file_path).with_outputs(Tag.TAG_OPTOUT,Tag.TAG_BOUNCE))

Optouts = (tagged_lines[Tag.TAG_OPTOUT] | <"Do logic 1">)

Bounces = (tagged_lines[Tag.TAG_BOUNCE] | <"Do logic 2">)

class Tag(beam.DoFn):
    TAG_OPTOUT = 'OPTOUT'
    TAG_BOUNCE = 'BOUNCE'
    def process(self, element,input_file_path):
        input_file = input_file_path.get()
        if "optout" in input_file:
            yield pvalue.TaggedOutput(self.TAG_OPTOUT, element)
        elif "bounce" in input_file:
        yield pvalue.TaggedOutput(self.TAG_BOUNCE, element)

在Dataflow流道上拆分pcollection时出错

0 个答案: