pyspark中的EOFError

时间:2018-10-24 22:05:55

标签: python pyspark

我正在尝试从齐柏林飞艇笔记本上运行以下命令。

%livy.pyspark
from pyspark.sql import SparkSession
from pyspark import SparkContext
from pyspark import SparkConf

data = spark.createDataFrame([(1, 0, 10, 0, 1)], ["segment", "games", "time", "spend", "label"])
print(data.head())

但是出现以下错误:

18/10/24 21:55:46 INFO DAGScheduler: Missing parents: List()
18/10/24 21:55:46 INFO DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[8] at head at <stdin>:1), which has no missing parents
18/10/24 21:55:46 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 6.2 KB, free 3.0 GB)
18/10/24 21:55:46 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on node28-24-196.eadpdata.ddns.ea.com:45610 (size: 6.2 KB, free: 3.0 GB)
18/10/24 21:55:46 INFO SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:1039
18/10/24 21:55:46 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 0 (MapPartitionsRDD[8] at head at <stdin>:1) (first 15 tasks are for partitions Vector(0))
18/10/24 21:55:46 INFO YarnClusterScheduler: Adding task set 0.0 with 1 tasks
18/10/24 21:55:46 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, node28-24-41.eadpdata.ddns.ea.com, executor 3, partition 0, PROCESS_LOCAL, 7828 bytes)
18/10/24 21:55:46 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on node28-24-41.eadpdata.ddns.ea.com:35873 (size: 6.2 KB, free: 3.0 GB)
18/10/24 21:55:48 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, node28-24-41.eadpdata.ddns.ea.com, executor 3): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/home/yarn/spark/python/lib/pyspark.zip/pyspark/worker.py", line 98, in main
 command = pickleSer._read_with_length(infile)
File "/home/yarn/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 164, in _read_with_length
 return self.loads(obj)
File "/home/yarn/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 422, in loads
 return pickle.loads(obj)
EOFError

at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:298)
at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:438)
at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:421)
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:252)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:253)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)

我正在使用python 2.7.5和spark 2.3.0。有人知道哪里出了问题以及如何解决?

0 个答案:

没有答案