我有一个python代码,它具有以下第三方依赖项:
import boto3
from warcio.archiveiterator import ArchiveIterator
from warcio.recordloader import ArchiveLoadFailed
import requests
import botocore
from requests_file import FileAdapter
....
我已使用 pip 安装了依赖项,并通过命令 pip list 确保它已正确安装。然后,当我尝试将作业提交给spark时,我收到了以下错误:
ImportError: No module named 'boto3'
at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193)
at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:234)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.api.python.PairwiseRDD.compute(PythonRDD.scala:395)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
没有名为的模块的问题不仅发生在&#39; boto3&#39;还有其他模块。
我尝试了以下内容:
export PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH
并安装了pip install py4j 软件信息:
答案 0 :(得分:2)
在执行spark-submit
之前,请尝试转到python shell
并尝试导入模块。
还要检查默认情况下哪个python shell
(检查python路径)正在打开。
如果你能够在python shell中成功导入这些模块(与spark-submit
中尝试使用的python版本相同),请检查以下内容:
您在哪种模式下提交申请?尝试standalone
或尝试使用client
模式。
另请尝试添加export PYSPARK_PYTHON=(your python path)
答案 1 :(得分:2)
All checks mentioned above worked ok but setting PYSPARK_PYTHON solved the issue for me.