Spark异常:worker中的Python与驱动程序3.5中的版本不同

时间:2016-08-13 19:16:55

标签: python apache-spark version cluster-computing

我使用的是Amazon EC2,我将主服务器和开发服务器作为一体。我有一个单一工人的实例。

我是新手,但我已经设法以独立模式运行spark。现在我正在尝试集群。主人和工人都是主动的(我可以看到他们正在运行的webUI)。

我有Spark 2.0,我已经安装了Python 3.5.2附带的最新Anaconda 4.1.1。在worker和master中,如果我去pyspark并执行os.version_info,我将获得3.5.2,我也正确设置了所有环境变量(如stackoverflow和google上的其他帖子中所见)(例如,PYSPARK_PYTHON)

无论如何,任何地方都没有3.4版本的python。所以我想知道如何解决这个问题。

我通过运行此命令得到错误:

rdd = sc.parallelize([1,2,3])
rdd.count()    
count()方法发生

错误:

16/08/13 18:44:31 ERROR Executor: Exception in task 1.0 in stage 2.0 (TID 17)
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/opt/spark/python/lib/pyspark.zip/pyspark/worker.py", line 123, in main
    ("%d.%d" % sys.version_info[:2], version))
Exception: Python in worker has different version 3.4 than that in driver 3.5, PySpark cannot run with different minor versions

at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193)
at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:234)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
16/08/13 18:44:31 ERROR Executor: Exception in task 1.1 in stage 2.0 (TID 18)
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/opt/spark/python/lib/pyspark.zip/pyspark/worker.py", line 123, in main
    ("%d.%d" % sys.version_info[:2], version))
Exception: Python in worker has different version 3.4 than that in driver 3.5, PySpark cannot run with different minor versions
at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193)
at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:234)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

2 个答案:

答案 0 :(得分:4)

由于您已经使用过Anaconda,因此您只需使用所需的Python版本创建一个环境:

conda create --name foo python=3.4
source activate foo

python --version
## Python 3.4.5 :: Continuum Analytics, Inc

并将其用作PYSPARK_DRIVER_PYTHON

export PYSPARK_DRIVER_PYTHON=/path/to/anaconda/envs/foo/bin/python

答案 1 :(得分:0)

我有同样的问题。有几个可能的原因:

  • 你的一个工人有python3.4,但你没有注意到它。请到每个工作人员查看#include <iostream> using namespace std; int main() { [pi=3.14]() mutable { pi = 3.1415926; cout << pi << endl; }(); } 。也就是说,如果您设置PYSPARK_PYTHON,请转到每个工作人员键入PYSPARK_PYTHON=python3并检查版本。

  • 你连接错误的工人。检查您的python3配置并确定您的员工是什么。

我花了十多个小时才找到同样的问题。但是我的根本原因是连接错误...... TAT