在Spark中导入自定义累加器类型

时间:2016-08-12 09:50:38

标签: python apache-spark import pyspark accumulator

我试图按照Spark documentation使用自定义累加器类。如果我在本地定义类,这是有效的,但当我尝试在另一个模块中定义它并使用sc.addPyFile导入文件时,我得到ImportError

rdd.foreach中导入辅助函数时遇到了同样的问题,我可以通过执行foreach< d>函数中的import 来解决这个问题(按照例子)按照this SO question。但是,相同的修复程序对自定义累加器不起作用(我真的不希望这样做)。

tl; dr:导入自定义累加器类的正确方法是什么?

扩展/ accumulators.py:

class ArrayAccumulatorParam(pyspark.AccumulatorParam):
    def zero(self, initialValue):
        return numpy.zeros(initialValue.shape)

    def addInPlace(self, a, b):
        a += b
        return a

运行/ count.py:

from extensions.accumulators import ArrayAccumulatorParam

def main(sc):
    sc.addPyFile(LIBRARY_PATH + '/import_/logs.py')
    sc.addPyFile(LIBRARY_PATH + '/extensions/accumulators.py')

    rdd = sc.textFile(LOGS_PATH)
    accum = sc.accumulator(numpy.zeros(DIMENSIONS), ArrayAccumulatorParam())

    def count(row)
        import logs # This 'internal import' seems to be required to avoid ImportError for the 'logs' module
        from extensions.accumulators import ArrayAccumulatorParam # Error is thrown both with and without this line

        val = logs.parse(row)
        accum.add(val)

    rdd.foreach(count) # Throws ImportError: No module named extensions.accumulators

if __name__ == '__main__':
    conf = pyspark.SparkConf().setAppName('SOME_COUNT_JOB')
    sc = pyspark.SparkContext(conf=conf)
    main(sc)

错误:

ImportError: No module named extensions.accumulators

0 个答案:

没有答案