spark-submit未能检测到pip中的安装模数

时间:2017-10-13 16:15:25

标签: python-3.x apache-spark pip pyspark

我有一个python代码,它具有以下第三方依赖项:

import boto3
from warcio.archiveiterator import ArchiveIterator
from warcio.recordloader import ArchiveLoadFailed
import requests
import botocore
from requests_file import FileAdapter
....

我已使用 pip 安装了依赖项,并通过命令 pip list 确保它已正确安装。然后,当我尝试将作业提交给spark时,我收到了以下错误:

ImportError: No module named 'boto3'

    at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193)
    at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:234)
    at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
    at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
    at org.apache.spark.api.python.PairwiseRDD.compute(PythonRDD.scala:395)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
    at org.apache.spark.scheduler.Task.run(Task.scala:108)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

没有名为的模块的问题不仅发生在&#39; boto3&#39;还有其他模块。

我尝试了以下内容

  1. Added SparkContext.addPyFile(&#34; .zip文件&#34;)
  2. 使用submit-spark - py-files
  3. 重新安装pip
  4. 确保path env variables export PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH并安装了pip install py4j
  5. 使用python而不是spark-submit
  6. 软件信息:

    • Python版本:3.4.3
    • Spark版本:2.2.0
    • 在EMR-AWS上运行:Linux版本2017.09

2 个答案:

答案 0 :(得分:2)

在执行spark-submit之前,请尝试转到python shell并尝试导入模块。 还要检查默认情况下哪个python shell(检查python路径)正在打开。

如果你能够在python shell中成功导入这些模块(与spark-submit中尝试使用的python版本相同),请检查以下内容:

您在哪种模式下提交申请?尝试standalone或尝试使用client模式。 另请尝试添加export PYSPARK_PYTHON=(your python path)

答案 1 :(得分:2)

All checks mentioned above worked ok but setting PYSPARK_PYTHON solved the issue for me.