Python worker无法在Pyspark或Spark 2.3.1版中重新连接

时间:2019-05-20 03:25:34

标签: apache-spark pyspark

在安装了anaconda3并安装了spark(2.3.2)之后,我试图运行示例pyspark代码。

这只是我通过Jupyter运行的示例程序, 我收到类似

的错误

Python worker无法重新连接。

根据堆栈溢出中的以下问题:

Python worker failed to connect back

我可以看到这样的解决方案 我遇到了同样的错误。我解决了安装旧版本的Spark(2.3代替2.4)的问题。现在它可以完美运行,也许是pyspark最新版本的问题。

但是我使用的是Spark版本2.3.1,而python版本是3.7

我仍然面临着这个问题。请帮助我解决此错误

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("mySparkApp").getOrCreate()
testData=spark.sparkContext.parallelize([3,8,2,5])
testData.count()

回溯是:

Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 1.0 failed 1 times, most recent failure: Lost task 2.0 in stage 1.0 (TID 6, localhost, executor driver): org.apache.spark.SparkException: Python worker failed to connect back.
    at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:170)
    at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:97)
    at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:117)
    at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:108)
    at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)

1 个答案:

答案 0 :(得分:0)

按如下所示设置环境变量:

  • PYSPARK_DRIVER_PYTHON=jupyter
  • PYSPARK_DRIVER_PYTHON_OPTS=notebook
  • PYSPARK_PYTHON=python

问题的核心是pyspark和python之间的连接,可以通过更改它们来解决。