Question

我试图在pyspark中运行此程序作为自定义累加器的示例。我收到错误＆＃39; int不可迭代＆＃39;。我无法解决这个问题。有人可以帮我这个。

import findspark
findspark.init()
from pyspark import AccumulatorParam, SparkContext
sc = SparkContext('local','local')

rdd = sc.parallelize(xrange(10))

class SAP(AccumulatorParam):
    def zero(self, initialValue):
        s=set()
        s.add(initialValue)
        return s
    def addInPlace(self, v1, v2):

        return v1.union(v2)



ids_seen = sc.accumulator(0, SAP())
def inc(x):
    global ids_seen
    ids_seen += x
    return x

rdd.foreach(inc)

Answer 1

类型方面addInPlace是(R, R) => R而zero是(R) => R。

初始值应与您在累加器中所期望的类型相同，因此您必须使用Accumulator初始化set：

ids_seen = sc.accumulator(set(), SAP())

或

ids_seen = sc.accumulator({0}, SAP())

和zero应为：

def zero(self, initialValue):
    return initialValue.copy()

最后inc应添加set：

def inc(x):
    global ids_seen
    ids_seen += {x}
    return x

Pyspark定制蓄电池

1 个答案: