Question

我开始在python中使用apache beam，每30分钟就会卡住一次。我正在尝试扁平化然后进行转换：

lines = messages | 'decode' >> beam.Map(lambda x: x.decode('utf-8'))
output = ( lines
           | 'process' >> beam.Map(process_xmls) # returns list
           | 'jsons' >> beam.Map(lambda x: [beam.Create(jsons.dump(model)) for model in x])
           | 'flatten' >> beam.Flatten()
           | beam.WindowInto(window.FixedWindows(1, 0)))

因此，运行此代码后，我会收到此错误：

ValueError: Input to Flatten must be an iterable. Got a value of type <class 'apache_beam.pvalue.PCollection'> instead.

我该怎么办？

Answer 1

beam.Flatten()操作采用一个PCollections的迭代方法，并返回一个新的PCollection，其中包含输入PCollections中所有元素的并集。不可能有一个PCollections的PCollections。

我认为您在这里寻找的是beam.FlatMap操作。这与beam.Map的不同之处在于，每个输入都会发出多个元素。例如，如果您有一个包含元素lines的集合{'two', 'words'}，则

lines | beam.Map(list)

将是由两个列表组成的PCollection

{['t', 'w', 'o'], ['w', 'o', 'r', 'd', 's']}

而

lines | beam.FlatMap(list)

将导致PCollection由多个字母组成

{'t', 'w', 'o', 'w', 'o', 'r', 'd', 's'}。

因此您的最终程序应类似于

lines = messages | 'decode' >> beam.Map(lambda x: x.decode('utf-8'))
output = ( lines
           | 'process' >> beam.FlatMap(process_xmls) # concatinates all lists returned by process_xmls into a single PCollection
           | 'jsons' >> beam.Map(jsons.dumps)  # apply json.dumps to each element
           | beam.WindowInto(window.FixedWindows(1, 0)))

（也请注意，返回字符串可能是json.dumps，而不是json.dump，后者将第二个参数作为要写入的文件/流）。

如何将输入传递给beam.Flatten（）？

1 个答案: