在apache beam中Pcollection之后离开连接的更好方法是什么?
pcoll1 = [('key1', [[('a', 1)],[('b', 2)], [('c', 3)], [('d', 4)],[('e', 5)], [('f', 6)]]), ('key2',[[('a', 12)],[('b', 21)], [('c', 13)]]), ('key3',[[('a', 21)],[('b', 23)], [('c', 31)]])]
pcoll2 = [('key1', [[('x', 10)]]), ('key2', [[('x', 20)]])]
预期的出局是
[('a', 1), ('x', 10)]
[('b', 2), ('x', 10)]
[('c', 3), ('x', 10)]
[('d', 4), ('x', 10)]
[('e', 5), ('x', 10)]
[('f', 6), ('x', 10)]
[('a', 12), ('x', 20)]
[('b', 21), ('x', 20)]
[('c', 13), ('x', 20)]
[('a', 21)]
[('b', 23)]
[('c', 31)]
我使用CoGroupByKey()和Pardo()实现了一个左连接器。在beam Python SDK中是否还有其他方法可以实现左连接器?
left_joined = (
{'left': pcoll1, 'right': pcoll2}
| 'LeftJoiner: Combine' >> beam.CoGroupByKey()
| 'LeftJoiner: ExtractValues' >> beam.Values()
| 'LeftJoiner: JoinValues' >> beam.ParDo(LeftJoinerFn())
)
class LeftJoinerFn(beam.DoFn):
def __init__(self):
super(LeftJoinerFn, self).__init__()
def process(self, row, **kwargs):
left = row['left']
right = row['right']
if left and right:
for each in left:
yield each + right[0]
elif left:
for each in left:
yield each
答案 0 :(得分:0)
如果第二个集合总是较小,则另一种方法是使用side inputs。这需要将正确的集合作为向所有工作人员广播的侧面输入,然后编写一个ParDo来处理来自左集合的元素并读入正确的集合。
答案 1 :(得分:0)
您可以使用下面的代码在联接的右侧使用侧输入,假设右侧总是将一个元素映射到每个键,这意味着其大小总是比左侧pcollection小得多。 。
此外,如果您的pcollection是通过从外部源而不是内存数组中读取而创建的,则需要将right_list=beam.pvalue.asList(pcoll2)
而不是right_list=pcoll2
传递给ParDo。检查Cerberus以获得更多信息
class LeftJoinerFn(beam.DoFn):
def __init__(self):
super(LeftJoinerFn, self).__init__()
def process(self, row, **kwargs):
right_dict = dict(kwargs['right_list'])
left_key = row[0]
if left_key in right_dict:
for each in row[1]:
yield each + right_dict[left_key]
else:
for each in row[1]:
yield each
class Display(beam.DoFn):
def process(self, element):
LOG.info(str(element))
yield element
p = beam.Pipeline(options=pipeline_options)
pcoll1 = [('key1', [[('a', 1)],[('b', 2)], [('c', 3)], [('d', 4)],[('e', 5)], [('f', 6)]]), \
('key2',[[('a', 12)],[('b', 21)], [('c', 13)]]), \
('key3',[[('a', 21)],[('b', 23)], [('c', 31)]])\
]
pcoll2 = [('key1', [[('x', 10)]]), ('key2', [[('x', 20)]])]
left_joined = (
pcoll1
| 'LeftJoiner: JoinValues' >> beam.ParDo(LeftJoinerFn(), right_list=pcoll2)
| 'Display' >> beam.ParDo(Display())
)
p.run()