我有一个pyspark工作,它有很多依赖项。所以我做了一个pip install -t reqs -r requirements.txt
,然后压缩了reqs
目录。然后我将其传递给--py-files
,以便它可以在所有节点上使用。
我有一个导入pandas的文件,该文件在zip文件中成功找到。但是,pandas导入numpy,即使它也在该zip文件中也找不到:
File "./spark-job.zip/model/transforms/parser.py", line 3, in <module>
import pandas as pd
File "./reqs.zip/pandas/__init__.py", line 19, in <module>
"Missing required dependencies {0}".format(missing_dependencies))
ImportError: Missing required dependencies ['numpy']
请注意,pandas
位于reqs.zip
,但numpy不是。请注意numpy在zip文件中。我很确定它在那里:
$ unzip -l reqs.zip | grep pandas | head -10
0 2017-06-28 20:58 pandas-0.20.2.dist-info/
97277 2017-06-28 20:58 pandas-0.20.2.dist-info/RECORD
109 2017-06-28 20:58 pandas-0.20.2.dist-info/WHEEL
7 2017-06-28 20:58 pandas-0.20.2.dist-info/top_level.txt
1102 2017-06-28 20:58 pandas-0.20.2.dist-info/metadata.json
4 2017-06-28 20:58 pandas-0.20.2.dist-info/INSTALLER
3471 2017-06-28 20:58 pandas-0.20.2.dist-info/DESCRIPTION.rst
4452 2017-06-28 20:58 pandas-0.20.2.dist-info/METADATA
0 2017-06-28 20:58 pandas/
0 2017-06-28 20:58 pandas/formats/
$ unzip -l reqs.zip | grep numpy | head -40
0 2017-06-28 20:58 numpy-1.13.0.dist-info/
52437 2017-06-28 20:58 numpy-1.13.0.dist-info/RECORD
109 2017-06-28 20:58 numpy-1.13.0.dist-info/WHEEL
6 2017-06-28 20:58 numpy-1.13.0.dist-info/top_level.txt
1330 2017-06-28 20:58 numpy-1.13.0.dist-info/metadata.json
4 2017-06-28 20:58 numpy-1.13.0.dist-info/INSTALLER
884 2017-06-28 20:58 numpy-1.13.0.dist-info/DESCRIPTION.rst
2217 2017-06-28 20:58 numpy-1.13.0.dist-info/METADATA
22668 2017-06-28 20:58 sklearn/externals/joblib/numpy_pickle_utils.py
8440 2017-06-28 20:58 sklearn/externals/joblib/numpy_pickle_compat.py
23222 2017-06-28 20:58 sklearn/externals/joblib/numpy_pickle.py
7468 2017-06-28 20:58 sklearn/externals/joblib/__pycache__/numpy_pickle_compat.cpython-34.pyc
16545 2017-06-28 20:58 sklearn/externals/joblib/__pycache__/numpy_pickle_utils.cpython-34.pyc
15643 2017-06-28 20:58 sklearn/externals/joblib/__pycache__/numpy_pickle.cpython-34.pyc
10508 2017-06-28 20:58 scipy/_lib/__pycache__/_numpy_compat.cpython-34.pyc
11513 2017-06-28 20:58 scipy/_lib/_numpy_compat.py
0 2017-06-28 20:58 pandas/compat/numpy/
12344 2017-06-28 20:58 pandas/compat/numpy/function.py
0 2017-06-28 20:58 pandas/compat/numpy/__pycache__/
2399 2017-06-28 20:58 pandas/compat/numpy/__pycache__/__init__.cpython-34.pyc
10150 2017-06-28 20:58 pandas/compat/numpy/__pycache__/function.cpython-34.pyc
2213 2017-06-28 20:58 pandas/compat/numpy/__init__.py
0 2017-06-28 20:58 numpy/
0 2017-06-28 20:58 numpy/.libs/
38513408 2017-06-28 20:58 numpy/.libs/libopenblasp-r0-39a31c03.2.18.so
1023960 2017-06-28 20:58 numpy/.libs/libgfortran-ed201abd.so.3.0.0
0 2017-06-28 20:58 numpy/testing/
2705 2017-06-28 20:58 numpy/testing/print_coercion_tables.py
8036 2017-06-28 20:58 numpy/testing/decorators.py
75541 2017-06-28 20:58 numpy/testing/utils.py
19120 2017-06-28 20:58 numpy/testing/nosetester.py
13834 2017-06-28 20:58 numpy/testing/noseclasses.py
0 2017-06-28 20:58 numpy/testing/__pycache__/
713 2017-06-28 20:58 numpy/testing/__pycache__/__init__.cpython-34.pyc
9827 2017-06-28 20:58 numpy/testing/__pycache__/noseclasses.cpython-34.pyc
8572 2017-06-28 20:58 numpy/testing/__pycache__/decorators.cpython-34.pyc
67010 2017-06-28 20:58 numpy/testing/__pycache__/utils.cpython-34.pyc
2712 2017-06-28 20:58 numpy/testing/__pycache__/print_coercion_tables.cpython-34.pyc
15253 2017-06-28 20:58 numpy/testing/__pycache__/nosetester.cpython-34.pyc
805 2017-06-28 20:58 numpy/testing/__pycache__/setup.cpython-34.pyc
为什么这不能正常工作?
答案 0 :(得分:0)
我想我可能已回答了我自己的问题。正如https://issues.apache.org/jira/browse/SPARK-6764中所述pyspark使用zipimport
从--py-files
和zipimport
导入内容仅支持.py,.pyc和.pyo文件。值得注意的是,它不支持需要本机代码的模块,如numpy。