当我在大型数据集上使用Apache spark时出现异常

时间:2016-09-29 19:46:42

标签: hadoop apache-spark pyspark

因此,当我在本地Windows机器上使用PySpark 1.4在相对较大的数据集(~1 MB)上运行一些映射和过滤时,我得到以下异常。我尝试过使用Apache Spark 1.6和2.0并获得完全相同的异常。

但是,如果我的数据集很小(约100行),则代码可以正常运行。

我在笔记本电脑上安装了Apache Spark,并且拥有相当大的存储空间和RAM。我尝试增加spark的内存分配,但它仍然会引发错误。请帮忙!

   Input File: c:/sparklocal/data/in/Emp/est12.txt
16/09/29 15:32:58 ERROR PythonRDD: Python worker exited unexpectedly (crashed)
java.net.SocketException: Connection reset by peer: socket write error
        at java.net.SocketOutputStream.socketWrite0(Native Method)
        at java.net.SocketOutputStream.socketWrite(Unknown Source)
        at java.net.SocketOutputStream.write(Unknown Source)
        at java.io.BufferedOutputStream.flushBuffer(Unknown Source)
        at java.io.BufferedOutputStream.flush(Unknown Source)
        at java.io.DataOutputStream.flush(Unknown Source)
        at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$3.app
ly(PythonRDD.scala:251)
        at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1772)
        at org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scal
a:208)
16/09/29 15:32:58 ERROR PythonRDD: This may have been caused by a prior exceptio
n:
java.net.SocketException: Connection reset by peer: socket write error
        at java.net.SocketOutputStream.socketWrite0(Native Method)
        at java.net.SocketOutputStream.socketWrite(Unknown Source)
        at java.net.SocketOutputStream.write(Unknown Source)
        at java.io.BufferedOutputStream.flushBuffer(Unknown Source)
        at java.io.BufferedOutputStream.flush(Unknown Source)
        at java.io.DataOutputStream.flush(Unknown Source)
        at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$3.app
ly(PythonRDD.scala:251)
        at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1772)
        at org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scal
a:208)
16/09/29 15:32:58 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.net.SocketException: Connection reset by peer: socket write error
        at java.net.SocketOutputStream.socketWrite0(Native Method)
        at java.net.SocketOutputStream.socketWrite(Unknown Source)
        at java.net.SocketOutputStream.write(Unknown Source)
        at java.io.BufferedOutputStream.flushBuffer(Unknown Source)
        at java.io.BufferedOutputStream.flush(Unknown Source)
        at java.io.DataOutputStream.flush(Unknown Source)
        at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$3.app
ly(PythonRDD.scala:251)
        at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1772)
        at org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scal
a:208)
16/09/29 15:32:58 ERROR TaskSetManager: Task 0 in stage 0.0 failed 1 times; abor
ting job
Traceback (most recent call last):
  File "c:/sparklocal/py/test.py", line 27, in <module>
    header = tv_platform_textslines.first()
  File "c:\sparklocal\spark\spark-1.4.1-bin-hadoop2.6\python\lib\pyspark.zip\pys
park\rdd.py", line 1295, in first
  File "c:\sparklocal\spark\spark-1.4.1-bin-hadoop2.6\python\lib\pyspark.zip\pys
park\rdd.py", line 1277, in take
  File "c:\sparklocal\spark\spark-1.4.1-bin-hadoop2.6\python\lib\pyspark.zip\pys
park\context.py", line 897, in runJob
  File "c:\sparklocal\spark\spark-1.4.1-bin-hadoop2.6\python\lib\py4j-0.8.2.1-sr
c.zip\py4j\java_gateway.py", line 538, in __call__
  File "c:\sparklocal\spark\spark-1.4.1-bin-hadoop2.6\python\lib\py4j-0.8.2.1-sr
c.zip\py4j\protocol.py", line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.
api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in s
tage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0,
 localhost): java.net.SocketException: Connection reset by peer: socket write er
ror
        at java.net.SocketOutputStream.socketWrite0(Native Method)
        at java.net.SocketOutputStream.socketWrite(Unknown Source)
        at java.net.SocketOutputStream.write(Unknown Source)
        at java.io.BufferedOutputStream.flushBuffer(Unknown Source)
        at java.io.BufferedOutputStream.flush(Unknown Source)
        at java.io.DataOutputStream.flush(Unknown Source)
        at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$3.app
ly(PythonRDD.scala:251)
        at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1772)
        at org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scal
a:208)

Driver stacktrace:
        at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DA
GScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1273)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(D
AGScheduler.scala:1264)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(D
AGScheduler.scala:1263)
        at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.
scala:59)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
        at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala
:1263)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$
1.apply(DAGScheduler.scala:730)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$
1.apply(DAGScheduler.scala:730)
        at scala.Option.foreach(Option.scala:236)
        at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGSchedu
ler.scala:730)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAG
Scheduler.scala:1457)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAG
Scheduler.scala:1418)
        at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)

0 个答案:

没有答案