使用spark-submit运行外部python依赖项?

时间:2018-05-16 02:09:01

标签: python apache-spark pyspark spark-submit

我有一个test.py文件

import pandas as pd
import numpy as np
import tensorflow as tf
from sklearn.externals import joblib
import tqdm
import time

print("Successful import")

我已按照此方法创建所有依赖项的独立zip

pip install -t dependencies -r requirements.txt
cd dependencies
zip -r ../dependencies.zip .

创建此树结构(dependencies.zip)

dependencies.zip
     ->pandas
     ->numpy
     ->........

当我跑

spark-submit --py-files /home/ion/Documents/dependencies.zip /home/ion/Documents/sentiment_analysis/test.py

我收到以下错误

2018-05-16 07:36:21 WARN  NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Traceback (most recent call last):
  File "/home/ion/Documents/sentiment_analysis/test.py", line 2, in <module>
    from encoder import Model
  File "/home/ion/Documents/sentiment_analysis/encoder.py", line 2, in <module>
    import numpy as np
  File "/home/ion/Documents/dependencies.zip/numpy/__init__.py", line 142, in <module>
  File "/home/ion/Documents/dependencies.zip/numpy/add_newdocs.py", line 13, in <module>
  File "/home/ion/Documents/dependencies.zip/numpy/lib/__init__.py", line 8, in <module>
  File "/home/ion/Documents/dependencies.zip/numpy/lib/type_check.py", line 11, in <module>
  File "/home/ion/Documents/dependencies.zip/numpy/core/__init__.py", line 26, in <module>
ImportError: 
Importing the multiarray numpy extension module failed.  Most
likely you are trying to import a failed build of numpy.
If you're working with a numpy git repo, try `git clean -xdf` (removes all
files not under version control).  Otherwise reinstall numpy.

Original error was: cannot import name multiarray

2018-05-16 07:36:21 INFO  ShutdownHookManager:54 - Shutdown hook called
2018-05-16 07:36:21 INFO  ShutdownHookManager:54 - Deleting directory /tmp/spark-a3c2ec75-6c12-4ac2-ae2c-b36412209889

有没有办法这样我可以运行这个python脚本作为spark jon而不更改pyspark中的代码或更改最少的代码?

0 个答案:

没有答案