执行以下配置后,我在Ipython Notebook中运行pyspark
export PYSPARK_DRIVER_PYTHON=/usr/local/bin/jupyter
export PYSPARK_DRIVER_PYTHON_OPTS="notebook--NotebookApp.open_browser=False --NotebookApp.ip='*' --NotebookApp.port=8880"
export PYSPARK_PYTHON=/usr/bin/python
我有一个自定义的udf函数,它使用一个名为mzgeohash的模块。但是,我得到模块未找到错误,我想这个模块可能在工作者/节点中丢失。我试图添加sc.addpyfile和所有。但是,在这种情况下,从Ipython添加克隆文件夹或tar.gz python模块的有效方法是什么。
答案 0 :(得分:0)
我是这样做的,基本上我的想法是创建模块中所有文件的zip并将其传递给sc.addPyFile():
import dictconfig
import zipfile
def ziplib():
libpath = os.path.dirname(__file__) # this should point to your packages directory
zippath = '/tmp/mylib-' + rand_str(6) + '.zip' # some random filename in writable directory
zf = zipfile.PyZipFile(zippath, mode='w')
try:
zf.debug = 3 # making it verbose, good for debugging
zf.writepy(libpath)
return zippath # return path to generated zip archive
finally:
zf.close()
...
zip_path = ziplib() # generate zip archive containing your lib
sc.addPyFile(zip_path) # add the entire archive to SparkContext
...
os.remove(zip_path) # don't forget to remove temporary file, preferably in "finally" clause