左连接apache beam

时间:2017-08-04 20:01:28

标签: python apache-beam

在apache beam中Pcollection之后离开连接的更好方法是什么?

pcoll1 = [('key1', [[('a', 1)],[('b', 2)], [('c', 3)], [('d', 4)],[('e', 5)], [('f', 6)]]), ('key2',[[('a', 12)],[('b', 21)], [('c', 13)]]), ('key3',[[('a', 21)],[('b', 23)], [('c', 31)]])]
pcoll2 = [('key1', [[('x', 10)]]), ('key2', [[('x', 20)]])]

预期的出局是

[('a', 1), ('x', 10)]
[('b', 2), ('x', 10)] 
[('c', 3), ('x', 10)] 
[('d', 4), ('x', 10)]
[('e', 5), ('x', 10)] 
[('f', 6), ('x', 10)]
[('a', 12), ('x', 20)]
[('b', 21), ('x', 20)] 
[('c', 13), ('x', 20)]
[('a', 21)]
[('b', 23)]
[('c', 31)]

我使用CoGroupByKey()和Pardo()实现了一个左连接器。在beam Python SDK中是否还有其他方法可以实现左连接器?

left_joined = (
    {'left': pcoll1, 'right': pcoll2}
    | 'LeftJoiner: Combine' >> beam.CoGroupByKey()
    | 'LeftJoiner: ExtractValues' >> beam.Values()
    | 'LeftJoiner: JoinValues' >> beam.ParDo(LeftJoinerFn())
)


class LeftJoinerFn(beam.DoFn):

    def __init__(self):
        super(LeftJoinerFn, self).__init__()

    def process(self, row, **kwargs):

        left = row['left']
        right = row['right']

        if left and right:
            for each in left:
                yield each + right[0]

        elif left:
            for each in left:
                yield each

2 个答案:

答案 0 :(得分:0)

如果第二个集合总是较小,则另一种方法是使用side inputs。这需要将正确的集合作为向所有工作人员广播的侧面输入,然后编写一个ParDo来处理来自左集合的元素并读入正确的集合。

答案 1 :(得分:0)

您可以使用下面的代码在联接的右侧使用侧输入,假设右侧总是将一个元素映射到每个键,这意味着其大小总是比左侧pcollection小得多。 。

此外,如果您的pcollection是通过从外部源而不是内存数组中读取而创建的,则需要将right_list=beam.pvalue.asList(pcoll2)而不是right_list=pcoll2传递给ParDo。检查Cerberus以获得更多信息

class LeftJoinerFn(beam.DoFn):

    def __init__(self):
        super(LeftJoinerFn, self).__init__()

    def process(self, row, **kwargs):

        right_dict = dict(kwargs['right_list'])
        left_key = row[0]

        if left_key in right_dict:
            for each in row[1]:
                yield each + right_dict[left_key]

        else:
            for each in row[1]:
                yield each

class Display(beam.DoFn):
    def process(self, element):
        LOG.info(str(element))
        yield element

p = beam.Pipeline(options=pipeline_options)

pcoll1 = [('key1', [[('a', 1)],[('b', 2)], [('c', 3)], [('d', 4)],[('e', 5)], [('f', 6)]]), \
        ('key2',[[('a', 12)],[('b', 21)], [('c', 13)]]), \
        ('key3',[[('a', 21)],[('b', 23)], [('c', 31)]])\
        ]
pcoll2 = [('key1', [[('x', 10)]]), ('key2', [[('x', 20)]])]


left_joined = (
    pcoll1 
    | 'LeftJoiner: JoinValues' >> beam.ParDo(LeftJoinerFn(), right_list=pcoll2)
    | 'Display' >> beam.ParDo(Display())
)
p.run()