我的代码已附上。我需要读入两个CSV。我读了第一个CSV,然后要将那个PCollection作为侧面输入传递给另一个我要逐行读取的CSV文件。然后,我想产生连接到FlatMap函数的两个元素。 Probelm是,我无法将其传递给函数(我正在使用Python)。我在网上看了很多示例,其他人在较早的版本中已经做到了。我知道它实际上在做某事,因为我至少可以将左csv写出到文本文件,并且可以看到它将每一行更改为一个键值对。非常感谢您的帮助。
from __future__ import absolute_import
import logging
import csv
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions
class append_lr(beam.DoFn):
def __init__(self, lineup):
self._lineup=(1,2)
def process(self, left, right):
bla=left
burp=right
both=left+right
yield both
class MyCsvFileSource(beam.io.filebasedsource.FileBasedSource):
def read_records(self, file_name, range_tracker):
self._file = self.open_file(file_name)
reader = csv.DictReader(self._file)
for rec in reader:
yield rec
def combine_lines():
with beam.Pipeline(options=PipelineOptions()) as p:
left_side = p | 'Read_Left_Side' >> beam.io.Read(MyCsvFileSource('/folder/left_side.csv'))
left_and_right = (p | 'Read_Rght_Side' >> beam.io.Read(MyCsvFileSource('/folder/right_side.csv'))
| beam.FlatMap(append_lr, beam.pvalue.AsIter(left_side)))
left_and_right | 'Write' >> beam.io.WriteToText('/folder/', file_name_suffix='test_output.csv')
def run(argv=None):
combine_lines()
if __name__ == '__main__':
logging.getLogger().setLevel(logging.INFO)
run(None)
答案 0 :(得分:0)
原来我应该将DoFn实现为“可调用”对象,下面的Apache Beam代码文档片段说:
def FlatMap(fn, *args, **kwargs): # pylint: disable=invalid-name
“”“:func:
FlatMap
类似于:class:ParDo
,只是需要一个可调用来指定转换。
因此,我将功能从类更改为def,就像魅力一样。其他一些代码更改也显示了如何从另一个PCollection的侧面输入基本构建一个for循环(请注意,如果您在本地运行此代码,请确保左侧和右侧的测试文件很小,这会产生很大的输出!):
from __future__ import absolute_import
import logging
import csv
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions
class MyCsvFileSource(beam.io.filebasedsource.FileBasedSource):
def read_records(self, file_name, range_tracker):
self._file = self.open_file(file_name)
reader = csv.DictReader(self._file)
for rec in reader:
yield rec
def append_lr(left_, right_):
for thingy in right_:
yield (left_, thingy)
def combine_sides():
with beam.Pipeline(options=PipelineOptions()) as p:
left_file = '/path/to/file/left.csv'
right_file = '/path/to/file/right.csv'
test_output = '/path/to/file/outputs/'
left_side = p | 'Read_Left_Side' >> beam.io.Read(MyCsvFileSource(left_file))
right_side = p | 'Read_Right_Side' >> beam.io.Read(MyCsvFileSource(right_file))
all_combos = left_side | beam.FlatMap(append_lr, beam.pvalue.AsIter(right_side))
all_combos | 'Write' >> beam.io.WriteToText(test_output, file_name_suffix='purple_nurple.csv')
def run(argv=None):
combine_sides()
if __name__ == '__main__':
logging.getLogger().setLevel(logging.INFO)
run(None)