Question

我遇到了以下scala示例，其中解释了aggregateByKey。 Scala示例：

val pairs=sc.parallelize(Array(("a",3),("a",1),("b",7),("a",5)))
import scala.collection.mutable.HashSet
//the initial value is a void Set. Adding an element to a set is the first
//_+_ Join two sets is the  _++_
val sets = pairs.aggregateByKey(new HashSet[Int])(_+_, _++_)
sets.collect

以上scala代码的输出是：

res5: Array[(String, scala.collection.mutable.HashSet[Int])]  =Array((b,Set(7)), (a,Set(1, 5, 3)))

我在python中重写了上面的scala代码：

pair = sc.parallelize([("a",3),("a",1),("b",7),("a",5)])
sets=pair.aggregateByKey((set()),(lambda x,y:x.add(y)),(lambda x,y:x|y))
sets.collect()

我不知道出了什么问题。 Python代码返回以下错误消息：

AttributeError: 'NoneType' object has no attribute 'add'

Answer 1

函数add更新集合并返回NoneType（它不返回更新的集合）。然后将此NoneType传递给函数的下一次迭代，从而得到错误。你的函数应该返回集合：

def my_add(x, y):
    x.add(y)
    return x
sets = pair.aggregateByKey(set(), my_add, lambda x, y: x|y)
sets.collect()

    [('b', {7}), ('a', {1, 3, 5})]

Answer 2

另一种解决方案

sets = pair.aggregateByKey(set(), lambda x,y:x|{y}, lambda x, y: x|y)

使用python集的AggregateBykey

2 个答案: