Question

我有一个包含百万条记录的文件，有些记录是错误的记录（在ParDo中处理记录时就会知道）。我想将不良记录连同它们在文件中出现的行号一起写入单独的PCollection，并将不良记录写入单独的PCollection。

是否有一种方法可以维护到目前为止跨工作进程读取的行的全局计数器，以便我可以用它来写出行号？

Answer 1

您可以使用Apache Beam指标保持全局监视计数器，您可以从计算机或运行程序的UI中查询该计数器。

如果您想保留所有不良记录的精确集合以及有关它们的信息（例如行号），那么您需要添加一个转换来执行此操作。像这样：

original_records = p | LoadRecords()

class SplitRecords(beam.DoFn):
  BAD_RECORD_TAG = 'BadRecord'

  def process(self, record):
    if self.is_bad(record):
      # Output the record onto the 'special' BadRecord input.
      yield beam.pvalue.TaggedOutput(self.BAD_RECORD_TAG, record)
    else:
      yield record   # Output the record onto the main input

record_collections = (original_records | 
                      beam.ParDo(SplitRecords()).with_outputs(
                          SplitRecords.BAD_RECORD_TAG,
                          main='GoodRecords'))

bad_records = record_collections[SplitRecords.BAD_RECORD_TAG]

good_records = record_collections['GoodRecords']

有关更详细的示例，建议您查看Apache Beam Cookbook目录，该目录包含一个example with a multiple-output ParDo

Apache Beam全局计数器

1 个答案: