“ FlatMap仅可与可调用对象一起使用...”

时间:2018-08-28 22:44:25

标签: python google-cloud-dataflow apache-beam

我的代码已附上。我需要读入两个CSV。我读了第一个CSV,然后要将那个PCollection作为侧面输入传递给另一个我要逐行读取的CSV文件。然后,我想产生连接到FlatMap函数的两个元素。 Probelm是,我无法将其传递给函数(我正在使用Python)。我在网上看了很多示例,其他人在较早的版本中已经做到了。我知道它实际上在做某事,因为我至少可以将左csv写出到文本文件,并且可以看到它将每一行更改为一个键值对。非常感谢您的帮助。

from __future__ import absolute_import
import logging
import csv
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions

class append_lr(beam.DoFn):
    def __init__(self, lineup):
        self._lineup=(1,2)

    def process(self, left, right):
        bla=left
        burp=right
        both=left+right
        yield both


class MyCsvFileSource(beam.io.filebasedsource.FileBasedSource):
    def read_records(self, file_name, range_tracker):
        self._file = self.open_file(file_name)
        reader = csv.DictReader(self._file)
        for rec in reader:
            yield rec

def combine_lines():
    with beam.Pipeline(options=PipelineOptions()) as p:

        left_side = p | 'Read_Left_Side' >> beam.io.Read(MyCsvFileSource('/folder/left_side.csv'))
        left_and_right = (p | 'Read_Rght_Side' >> beam.io.Read(MyCsvFileSource('/folder/right_side.csv'))
                     | beam.FlatMap(append_lr, beam.pvalue.AsIter(left_side)))
        left_and_right | 'Write' >> beam.io.WriteToText('/folder/', file_name_suffix='test_output.csv')

def run(argv=None):
    combine_lines()

if __name__ == '__main__':
    logging.getLogger().setLevel(logging.INFO)
    run(None)

1 个答案:

答案 0 :(得分:0)

原来我应该将DoFn实现为“可调用”对象,下面的Apache Beam代码文档片段说:

def FlatMap(fn, *args, **kwargs):  # pylint: disable=invalid-name
     

“”“:func:FlatMap类似于:class:ParDo,只是需要一个可调用来指定转换。

因此,我将功能从类更改为def,就像魅力一样。其他一些代码更改也显示了如何从另一个PCollection的侧面输入基本构建一个for循环(请注意,如果您在本地运行此代码,请确保左侧和右侧的测试文件很小,这会产生很大的输出!):

from __future__ import absolute_import
import logging
import csv
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions

class MyCsvFileSource(beam.io.filebasedsource.FileBasedSource):
    def read_records(self, file_name, range_tracker):
        self._file = self.open_file(file_name)
        reader = csv.DictReader(self._file)
        for rec in reader:
            yield rec

def append_lr(left_, right_):
    for thingy in right_:
        yield (left_, thingy)

def combine_sides():
    with beam.Pipeline(options=PipelineOptions()) as p:
        left_file = '/path/to/file/left.csv'
        right_file = '/path/to/file/right.csv'
        test_output = '/path/to/file/outputs/'

        left_side = p | 'Read_Left_Side' >> beam.io.Read(MyCsvFileSource(left_file))
        right_side = p | 'Read_Right_Side' >> beam.io.Read(MyCsvFileSource(right_file))
        all_combos = left_side | beam.FlatMap(append_lr, beam.pvalue.AsIter(right_side))
        all_combos | 'Write' >> beam.io.WriteToText(test_output, file_name_suffix='purple_nurple.csv')

def run(argv=None):
    combine_sides()

if __name__ == '__main__':
    logging.getLogger().setLevel(logging.INFO)
    run(None)