Question

采用ndjson格式的文本文件，以下代码产生了我所期望的。一个没有quotes.USD dict的ndjson文件，原始的quotes元素已删除。

  def unnest_quotes(element):
      element['USDquotes'] = element['quotes']['USD']
      del element['quotes']
      return element

  p = beam.Pipeline(options=pipeline_options)
  ReadJson = p | ReadFromText(known_args.input,coder=JsonCoder())
  MapFormattedJson = ReadJson | 'Map Function' >> beam.Map(unnest_quotes)
  MapFormattedJson | 'Write Map Output' >> WriteToText(known_args.output,coder=JsonCoder())

但是，当我尝试通过ParDo实现相同的目的时，我不了解其行为。

  class UnnestQuotes(beam.DoFn):
    def process(self,element):
      element['USDquotes'] = element['quotes']['USD']
      del element['quotes']
      return element

  p = beam.Pipeline(options=pipeline_options)
  ReadJson = p | ReadFromText(known_args.input,coder=JsonCoder())
  ClassFormattedJson = ReadJson | 'Pardo' >> beam.ParDo(UnnestQuotes())
  ClassFormattedJson | 'Write Class Output' >> WriteToText(known_args.output,coder=JsonCoder())

这将生成一个文件，该字典的dict的每个键位于单独的行中，没有值，如下所示。

"last_updated"
"name"
"symbol"
"rank"
"total_supply"
"max_supply"
"circulating_supply"
"website_slug"
"id"
"USDquotes"

好像Map函数生成的PCollection是完整字典，而Pardo每个键生成一个PCollection。

我知道我只能使用map函数，但是我需要了解这种行为，以备将来将来需要使用ParDo时使用。

Answer 1

我借助这个答案解决了这个问题。 apache beam flatmap vs map

我所经历的与FlatMap和Map之间的区别相同。为了获得所需的行为，我要做的就是将Pardo的返回结果包装在列表中。

  class UnnestQuotes(beam.DoFn):
    def process(self,element):
      element['USDquotes'] = element['quotes']['USD']
      del element['quotes']
      return [element]

Apache Beam对ParDo行为的解释

1 个答案: