错误提示:原因:java.net.SocketTimeoutException:接受超时

时间:2019-03-27 04:00:00

标签: python python-3.x pyspark

在使用以下代码的python 3.7在Jupyter Notebook中运行pyspark时出现错误。

from pyspark import SparkContext
from pyspark.sql.session import SparkSession
from pyspark.sql import SQLContext
import pyspark as ps

conf = ps.SparkConf().setMaster("yarn-client").setAppName("sparK-mer")
conf.set("spark.executor.heartbeatInterval","3600s")
sc = SparkContext('local') 
sqlContext = SQLContext(sc)
from pyspark.mllib.linalg import Vector, Vectors
from nltk.stem.wordnet import WordNetLemmatizer
from pyspark.ml.feature import RegexTokenizer, StopWordsRemover, Word2Vec

我正在根据以下代码读取csv文件:

datanew = sqlContext.read.format("csv") \
   .options(header='true', inferschema='true') \
   .load("C://Users//mypath//data.csv")

parts = datanew.rdd.map(lambda l: l.split(","))
datapysp = parts.map(lambda p: Row(uiid=p[0],title=(p[3].strip()),text=(p[4].strip())))
schemaString = "uiid title text"
fields = [StructField(field_name, StringType(), True) for  field_name in schemaString.split()]
schema = StructType(fields)
sqlContext.createDataFrame(datapysp, schema).show()

这是错误消息,我正在接收,并且有提到的列包括UIID,标题和文本。

Py4JJavaError: An error occurred while calling o74.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 2.0 failed 1 times, most recent failure: Lost task 0.0 in stage 2.0 (TID 2, localhost, executor driver): org.apache.spark.SparkException: Python worker failed to connect back.
    at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:170)
    at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:97)
    at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:117)
    at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:108)
    at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
    at org.apache.spark.scheduler.Task.run(Task.scala:121)
    at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
    at java.lang.Thread.run(Unknown Source)
Caused by: java.net.SocketTimeoutException: Accept timed out
    at java.net.DualStackPlainSocketImpl.waitForNewConnection(Native Method)
    at java.net.DualStackPlainSocketImpl.socketAccept(Unknown Source)
    at java.net.AbstractPlainSocketImpl.accept(Unknown Source)
    at java.net.PlainSocketImpl.accept(Unknown Source)
    at java.net.ServerSocket.implAccept(Unknown Source)
    at java.net.ServerSocket.accept(Unknown Source)
    at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:164)

我仔细阅读了这里提供的答案: Pyspark socket timeout exception after application running for a while。我尝试根据提供的答案将代码更改为此。

import pyspark as ps

conf = ps.SparkConf().setMaster("yarn-client").setAppName("sparK-mer")
conf.set("spark.executor.heartbeatInterval","3600s")
sc = ps.SparkContext('local[4]', '', conf=conf)

我收到一个错误消息,说在运行此部分sc = ps.SparkContext('local [4]','',conf = conf)之前,在发送其端口号之前Java网关进程已退出。

也尝试过这种方法,但是仍然收到与我有关接受超时的错误相同的错误。

 parts = datanew.rdd.map(lambda l: l.split(","))
    datapysp = parts.map(lambda p: Row(uiid=p[0],title=(p[3].strip()),text=(p[4].strip())))
    schemaString = "uiid title text"
    fields = [StructField(field_name, StringType(), True) for  field_name in schemaString.split()]
    schema = StructType(fields)
    sqlContext.createDataFrame(datapysp, 
    schema).show().config("sqlContext.executor.heartbeatInterval", "10000s") 
    --added this but still the error is not being resolved. 

如果有人可以帮助我,我将不胜感激。我正在使用Windows 10 64位。

1 个答案:

答案 0 :(得分:0)

根据this website

  

spark.executor.heartbeatInterval 10s每个执行者对驱动程序的心跳之间的间隔。心跳使驾驶员知道执行器仍在运行,并使用正在进行的任务的度量标准对其进行更新。

有鉴于此,我相信您在代码的spark.executor.heartbeatInterval部分遇到了问题。我建议您增加spark.executor.heartbeatInterval