在为python-submit

时间:2017-05-03 13:01:08

标签: python python-3.x apache-spark amazon-emr spark-submit

我是Spark世界的新手,我正在尝试使用Spark 2.1.0和Python 3.5在Amazon EMR集群上启动一些测试。

为了做到这一点,我创建了一个带有conda的虚拟环境,并使用我启动脚本所需的所有依赖项压缩了站点包,但是我无法在Yarn模式下对集群进行工作。

我尝试使用此命令启动spark-submit:

PYSPARK_PYTHON=python3 spark-submit \ 
--master yarn \
--deploy-mode cluster \
--py--files filters.py,dependencies.zip \
sparker.py

但是我收到了这个导入错误

ERROR Executor: Exception in task 0.3 in stage 1.0 (TID 9)
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/mnt/yarn/usercache/hadoop/appcache/application_1493796156883_0020/container_1493796156883_0020_01_000002/pyspark.zip/pyspark/worker.py", line 174, in main process()
File "/mnt/yarn/usercache/hadoop/appcache/application_1493796156883_0020/container_1493796156883_0020_01_000002/pyspark.zip/pyspark/worker.py", line 169, in process serializer.dump_stream(func(split_index, iterator), outfile)
File "/mnt/yarn/usercache/hadoop/appcache/application_1493796156883_0020/container_1493796156883_0020_01_000002/pyspark.zip/pyspark/serializers.py", line 138, in dump_stream for obj in iterator:
File "/mnt/yarn/usercache/hadoop/appcache/application_1493796156883_0020/container_1493796156883_0020_01_000001/pyspark.zip/pyspark/rdd.py", line 1541, in func
File "sparker.py", line 52, in applier
File "./dependencies.zip/cv2/__init__.py", line 7, in <module>
   from . import cv2
ImportError: cannot import name 'cv2'

我看到火花正在查找正确的目录,但我不明白他为什么无法解决依赖关系。

欢迎任何帮助!在spark中启动python脚本(带依赖项)的任何更简单的替代方案也非常受欢迎!

谢谢!

0 个答案:

没有答案