如何将模块文件夹/tar.gz添加到Pyspark中的节点

时间:2017-02-10 15:50:18

标签: apache-spark pyspark

执行以下配置后,我在Ipython Notebook中运行pyspark

export PYSPARK_DRIVER_PYTHON=/usr/local/bin/jupyter
export PYSPARK_DRIVER_PYTHON_OPTS="notebook--NotebookApp.open_browser=False --NotebookApp.ip='*' --NotebookApp.port=8880"
export PYSPARK_PYTHON=/usr/bin/python

我有一个自定义的udf函数,它使用一个名为mzgeohash的模块。但是,我得到模块未找到错误,我想这个模块可能在工作者/节点中丢失。我试图添加sc.addpyfile和所有。但是,在这种情况下,从Ipython添加克隆文件夹或tar.gz python模块的有效方法是什么。

1 个答案:

答案 0 :(得分:0)

我是这样做的,基本上我的想法是创建模块中所有文件的zip并将其传递给sc.addPyFile():

import dictconfig
import zipfile

def ziplib():
    libpath = os.path.dirname(__file__)                  # this should point to your packages directory 
    zippath = '/tmp/mylib-' + rand_str(6) + '.zip'      # some random filename in writable directory
    zf = zipfile.PyZipFile(zippath, mode='w')
    try:
        zf.debug = 3                                              # making it verbose, good for debugging 
        zf.writepy(libpath)
        return zippath                                             # return path to generated zip archive
    finally:
        zf.close()

...
zip_path = ziplib()                                               # generate zip archive containing your lib                            
sc.addPyFile(zip_path)                                       # add the entire archive to SparkContext
...
os.remove(zip_path)                                           # don't forget to remove temporary file, preferably in "finally" clause