" Python工作者没有及时连接"从pycharm运行pyspark时出错

时间:2018-03-22 06:22:18

标签: pyspark pycharm

我是python的新手,目前正在尝试在python中为spark编写单元测试。我已经在windows上安装了pycharm并且使用anaconda作为解释器。我使用python 3.5和spark 2.1.0。 我只是在编写一个简单的wordcount程序进行测试。但是我看到的是如果我在代码下运行它完全正常

mylist= ["the" , "earth" , "revolves" , "arround" , "sun"]
rdd=sc.parallelize(mylist)
output=rdd.collect
print(output)

当我应用任何转换时出现问题。使用像map这样的基本转换,flatMap会出现以下错误

    18/03/22 11:36:07 WARN TaskSetManager: Lost task 3.0 in stage 0.0 (TID 3, localhost, executor driver): org.apache.spark.SparkException: Python worker did not connect back in time
        at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:138)
        at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:67)
        at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:116)
        at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:128)
        at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
        at org.apache.spark.scheduler.Task.run(Task.scala:99)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
        at java.lang.Thread.run(Unknown Source)
    Caused by: java.net.SocketTimeoutException: Accept timed out
        at java.net.DualStackPlainSocketImpl.waitForNewConnection(Native Method)
        at java.net.DualStackPlainSocketImpl.socketAccept(Unknown Source)
        at java.net.AbstractPlainSocketImpl.accept(Unknown Source)
        at java.net.PlainSocketImpl.accept(Unknown Source)
        at java.net.ServerSocket.implAccept(Unknown Source)
        at java.net.ServerSocket.accept(Unknown Source)
        at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:133)
        ... 12 more
18/03/22 11:36:07 ERROR TaskSetManager: Task 3 in stage 0.0 failed 1 times; aborting job
Traceback (most recent call last):
  File "C:\Users\<username>\PycharmProjects\PyUnittest\Testing\WordCount.py", line 30, in <module>
    main()
  File "C:\Users\<username>\PycharmProjects\PyUnittest\Testing\WordCount.py", line 26, in main
    output=rdd.map(lambda x: (x,1)).collect()
  File "C:\Users\<username>\spark\spark-2.1.0-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\rdd.py", line 809, in collect
  File "C:\Users\<username>\spark\spark-2.1.0-bin-hadoop2.7\python\lib\py4j-0.10.4-src.zip\py4j\java_gateway.py", line 1133, in __call__
  File "C:\Users\<username>\spark\spark-2.1.0-bin-hadoop2.7\python\lib\py4j-0.10.4-src.zip\py4j\protocol.py", line 319, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 0.0 failed 1 times, most recent failure: Lost task 3.0 in stage 0.0 (TID 3, localhost, executor driver): org.apache.spark.SparkException: Python worker did not connect back in time
    at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:138)
    at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:67)
    at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:116)
    at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:128)
    at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
    at org.apache.spark.scheduler.Task.run(Task.scala:99)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
    at java.lang.Thread.run(Unknown Source)
Caused by: java.net.SocketTimeoutException: Accept timed out
    at java.net.DualStackPlainSocketImpl.waitForNewConnection(Native Method)
    at java.net.DualStackPlainSocketImpl.socketAccept(Unknown Source)
    at java.net.AbstractPlainSocketImpl.accept(Unknown Source)
    at java.net.PlainSocketImpl.accept(Unknown Source)
    at java.net.ServerSocket.implAccept(Unknown Source)
    at java.net.ServerSocket.accept(Unknown Source)
    at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:133)
    ... 12 more

任何帮助将不胜感激。谢谢。

1 个答案:

答案 0 :(得分:0)

如果您使用的是Anaconda,请尝试关闭内核。

由于您正在使用纱线,我会特别关注spark.apache.org/docs/latest/running-on-yarn.html&#34;调试您的应用程序&#34;部分。

读: SparkException: Python worker did not connect back in time