Question

出于学习目的，我试图在累加器中将字典设置为全局变量，add函数运行良好，但我运行代码并将字典放入map函数中，它总是返回空。

但是将列表设置为全局变量的类似代码

class DictParam(AccumulatorParam):
    def zero(self,  value = ""):
        return dict()

    def addInPlace(self, acc1, acc2):
        acc1.update(acc2)


if  __name__== "__main__":
    sc, sqlContext = init_spark("generate_score_summary", 40)
    rdd = sc.textFile('input')
    #print(rdd.take(5))



    dict1 = sc.accumulator({}, DictParam())


    def file_read(line):
        global dict1
        ls = re.split(',', line)
        dict1+={ls[0]:ls[1]}
        return line


    rdd = rdd.map(lambda x: file_read(x)).cache()
    print(dict1)

Answer 1

我认为print(dict1())只是在rdd.map()之前执行。

在Spark中，有两种operations：

转换，描述未来的计算
和要求采取行动并实际触发执行的行动

仅在some action is executed：

时更新累加器

累加器不会改变Spark的惰性评估模型。如果他们在RDD的操作中正在更新，它们的值仅为一旦RDD作为动作的一部分计算，就更新。

如果您查看文档本节的结尾，有一个与您的完全相同的示例：

accum = sc.accumulator(0)
def g(x):
    accum.add(x)
    return f(x)
data.map(g)
# Here, accum is still 0 because no actions have caused the `map` to be computed.

所以你需要添加一些动作，例如：

rdd = rdd.map(lambda x: file_read(x)).cache() # transformation
foo = rdd.count() # action
print(dict1)

请务必检查各种RDD功能和累加器特性的详细信息，因为这可能会影响结果的正确性。（例如，rdd.take(n)默认为only scan one partition，而不是整个数据集。）

Answer 2

对于仅在操作内执行的累加器更新，其值为仅在RDD作为动作的一部分计算时更新

Answer 3

对于任何到达此线程并为pyspark寻找Dict累加器的人：公认的解决方案不能解决所提出的问题。

问题实际上是在DictParam中定义的，它不会更新原始字典。这有效：

class DictParam(AccumulatorParam):
    def zero(self,  value = ""):
        return dict()

    def addInPlace(self, value1, value2):
        value1.update(value2)
        return value1

原始代码缺少返回值。

在dyspark中使用dict作为全局变量的累加器

3 个答案: