pyspark从py文件zip递归导入

时间:2017-06-28 21:02:41

标签: apache-spark pyspark

我有一个pyspark工作,它有很多依赖项。所以我做了一个pip install -t reqs -r requirements.txt,然后压缩了reqs目录。然后我将其传递给--py-files,以便它可以在所有节点上使用。

我有一个导入pandas的文件,该文件在zip文件中成功找到。但是,pandas导入numpy,即使它也在该zip文件中也找不到:

File "./spark-job.zip/model/transforms/parser.py", line 3, in <module>
    import pandas as pd
File "./reqs.zip/pandas/__init__.py", line 19, in <module>
    "Missing required dependencies {0}".format(missing_dependencies))
ImportError: Missing required dependencies ['numpy']

请注意,pandas位于reqs.zip,但numpy不是。请注意numpy在zip文件中。我很确定它在那里:

$ unzip -l reqs.zip | grep pandas | head -10
        0  2017-06-28 20:58   pandas-0.20.2.dist-info/
    97277  2017-06-28 20:58   pandas-0.20.2.dist-info/RECORD
      109  2017-06-28 20:58   pandas-0.20.2.dist-info/WHEEL
        7  2017-06-28 20:58   pandas-0.20.2.dist-info/top_level.txt
     1102  2017-06-28 20:58   pandas-0.20.2.dist-info/metadata.json
        4  2017-06-28 20:58   pandas-0.20.2.dist-info/INSTALLER
     3471  2017-06-28 20:58   pandas-0.20.2.dist-info/DESCRIPTION.rst
     4452  2017-06-28 20:58   pandas-0.20.2.dist-info/METADATA
        0  2017-06-28 20:58   pandas/
        0  2017-06-28 20:58   pandas/formats/

$ unzip -l reqs.zip | grep numpy | head -40
        0  2017-06-28 20:58   numpy-1.13.0.dist-info/
    52437  2017-06-28 20:58   numpy-1.13.0.dist-info/RECORD
      109  2017-06-28 20:58   numpy-1.13.0.dist-info/WHEEL
        6  2017-06-28 20:58   numpy-1.13.0.dist-info/top_level.txt
     1330  2017-06-28 20:58   numpy-1.13.0.dist-info/metadata.json
        4  2017-06-28 20:58   numpy-1.13.0.dist-info/INSTALLER
      884  2017-06-28 20:58   numpy-1.13.0.dist-info/DESCRIPTION.rst
     2217  2017-06-28 20:58   numpy-1.13.0.dist-info/METADATA
    22668  2017-06-28 20:58   sklearn/externals/joblib/numpy_pickle_utils.py
     8440  2017-06-28 20:58   sklearn/externals/joblib/numpy_pickle_compat.py
    23222  2017-06-28 20:58   sklearn/externals/joblib/numpy_pickle.py
     7468  2017-06-28 20:58   sklearn/externals/joblib/__pycache__/numpy_pickle_compat.cpython-34.pyc
    16545  2017-06-28 20:58   sklearn/externals/joblib/__pycache__/numpy_pickle_utils.cpython-34.pyc
    15643  2017-06-28 20:58   sklearn/externals/joblib/__pycache__/numpy_pickle.cpython-34.pyc
    10508  2017-06-28 20:58   scipy/_lib/__pycache__/_numpy_compat.cpython-34.pyc
    11513  2017-06-28 20:58   scipy/_lib/_numpy_compat.py
        0  2017-06-28 20:58   pandas/compat/numpy/
    12344  2017-06-28 20:58   pandas/compat/numpy/function.py
        0  2017-06-28 20:58   pandas/compat/numpy/__pycache__/
     2399  2017-06-28 20:58   pandas/compat/numpy/__pycache__/__init__.cpython-34.pyc
    10150  2017-06-28 20:58   pandas/compat/numpy/__pycache__/function.cpython-34.pyc
     2213  2017-06-28 20:58   pandas/compat/numpy/__init__.py
        0  2017-06-28 20:58   numpy/
        0  2017-06-28 20:58   numpy/.libs/
 38513408  2017-06-28 20:58   numpy/.libs/libopenblasp-r0-39a31c03.2.18.so
  1023960  2017-06-28 20:58   numpy/.libs/libgfortran-ed201abd.so.3.0.0
        0  2017-06-28 20:58   numpy/testing/
     2705  2017-06-28 20:58   numpy/testing/print_coercion_tables.py
     8036  2017-06-28 20:58   numpy/testing/decorators.py
    75541  2017-06-28 20:58   numpy/testing/utils.py
    19120  2017-06-28 20:58   numpy/testing/nosetester.py
    13834  2017-06-28 20:58   numpy/testing/noseclasses.py
        0  2017-06-28 20:58   numpy/testing/__pycache__/
      713  2017-06-28 20:58   numpy/testing/__pycache__/__init__.cpython-34.pyc
     9827  2017-06-28 20:58   numpy/testing/__pycache__/noseclasses.cpython-34.pyc
     8572  2017-06-28 20:58   numpy/testing/__pycache__/decorators.cpython-34.pyc
    67010  2017-06-28 20:58   numpy/testing/__pycache__/utils.cpython-34.pyc
     2712  2017-06-28 20:58   numpy/testing/__pycache__/print_coercion_tables.cpython-34.pyc
    15253  2017-06-28 20:58   numpy/testing/__pycache__/nosetester.cpython-34.pyc
      805  2017-06-28 20:58   numpy/testing/__pycache__/setup.cpython-34.pyc

为什么这不能正常工作?

1 个答案:

答案 0 :(得分:0)

我想我可能已回答了我自己的问题。正如https://issues.apache.org/jira/browse/SPARK-6764中所述pyspark使用zipimport--py-fileszipimport导入内容仅支持.py,.pyc和.pyo文件。值得注意的是,它不支持需要本机代码的模块,如numpy。