Question

我有一个复杂的功能，我使用map函数在spark中运行数据集。它是在一个不同的python模块中。调用map时，执行程序节点没有该代码，然后map函数失败。

s_cobDates = getCobDates() #returns a list of dates
sb_dataset = sc.broadcast(dataset) #fyi - it is not trivial to slice this into chunks per date

def sparkInnerLoop(n_cobDate):
   n_dataset = sb_dataset.value
   import someOtherModule
   return someOtherModule.myComplicatedCalc(n_dataset)

results = s_cobDates.map(sparkInnerLoop).collect()

Spark因为无法导入myOtherModule而失败。

到目前为止，我已经通过创建一个包含someOtherModule的python包并在我的spark作业之前将其部署到集群来解决这个问题，但这并不能实现快速原型设计。

如何将完整的代码发送到执行程序节点，而不将所有代码内联到＆＃34; sparkInnerLoop＆＃34;？该代码在我的解决方案的其他地方使用，我不想要代码重复。

我在独立模式下使用八节点群集v 1.6.2，并且驱动程序在pycharm中的工作站上运行。

Answer 1

以上的答案是有效的，如果您的模块是软件包的一部分，它就会失败。相反，它可以压缩你的模块，然后将zip文件添加到你的spark上下文，然后他们有正确的包名。

def ziplib():
    libpath = os.path.dirname(__file__)  # this should point to your packages directory
    zippath = r'c:\Temp\mylib-' + randstr.randstr(6) + '.zip'
    zippath = os.path.abspath(zippath)
    zf = zipfile.PyZipFile(zippath, mode='w')
    try:
        zf.debug = 3  # making it verbose, good for debugging
        zf.writepy(libpath)
        return zippath  # return path to generated zip archive
    finally:
        zf.close()

sc = SparkContext(conf=conf)

zip_path = ziplib()  # generate zip archive containing your lib
zip_path = pathlib.Path(zip_path).as_uri()
sc.addPyFile(zip_path)  # add the entire archive to SparkContext

如何让Spark在不同模块中查看代码？

1 个答案: