Question

我有一个解析AVRO文件记录的管道。

我需要将传入的记录拆分为500个项目的块，以便调用同时接受多个输入的API。

有没有办法用Python SDK做到这一点？

Answer 1

我认为你的意思是批用例。你有两个选择：

如果您的PCollection足够大，并且您对捆绑包的大小有一定的灵活性，则可以在以随机/循环方式为您的元素分配键后使用GroupByKey转换。 e.g：

my_collection = p | ReadRecordsFromAvro()

element_bundles = (my_collection 
                     # Choose a number of keys that works for you (I chose 50 here)
                   | 'AddKeys' >> beam.Map(lambda x: (randint(0, 50), x))
                   | 'MakeBundles' >> beam.GroupByKey()
                   | 'DropKeys' >> beam.Map(lambda (k, bundle): bundle)
                   | beam.ParDo(ProcessBundlesDoFn()))

ProcessBundlesDoFn就是这样：

class ProcessBundlesDoFn(beam.DoFn):
  def process(self, bundle):
    while bundle.has_next():
      # Fetch in batches of 500 until you're done
      result = fetch_n_elements(bundle, 500)
      yield result

如果您需要拥有完全500个元素的所有捆绑包，那么您可能需要：

计算PCollection中的元素数量
将该计数作为单身侧输入传递到'AddKeys' ParDo，以确定您需要的密钥数量。

希望有所帮助。

Apache Beam管道中的组元素

1 个答案: