Question

我试图按照Spark documentation使用自定义累加器类。如果我在本地定义类，这是有效的，但当我尝试在另一个模块中定义它并使用sc.addPyFile导入文件时，我得到ImportError。

在rdd.foreach中导入辅助函数时遇到了同样的问题，我可以通过执行foreach＆lt; d>函数中的import 来解决这个问题（按照例子）按照this SO question。但是，相同的修复程序对自定义累加器不起作用（我真的不希望这样做）。

tl; dr：导入自定义累加器类的正确方法是什么？

扩展/ accumulators.py：

class ArrayAccumulatorParam(pyspark.AccumulatorParam): def zero(self, initialValue): return numpy.zeros(initialValue.shape) def addInPlace(self, a, b): a += b return a

运行/ count.py：

from extensions.accumulators import ArrayAccumulatorParam def main(sc): sc.addPyFile(LIBRARY_PATH + '/import_/logs.py') sc.addPyFile(LIBRARY_PATH + '/extensions/accumulators.py') rdd = sc.textFile(LOGS_PATH) accum = sc.accumulator(numpy.zeros(DIMENSIONS), ArrayAccumulatorParam()) def count(row) import logs # This 'internal import' seems to be required to avoid ImportError for the 'logs' module from extensions.accumulators import ArrayAccumulatorParam # Error is thrown both with and without this line val = logs.parse(row) accum.add(val) rdd.foreach(count) # Throws ImportError: No module named extensions.accumulators if __name__ == '__main__': conf = pyspark.SparkConf().setAppName('SOME_COUNT_JOB') sc = pyspark.SparkContext(conf=conf) main(sc)

错误：

ImportError: No module named extensions.accumulators

在Spark中导入自定义累加器类型

0 个答案: