我正在尝试在pyspark中使用19/03/21 17:01:36 ERROR PythonRDD: Error while sending iterator
java.net.SocketException: Connection reset
at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:115)
at java.net.SocketOutputStream.write(SocketOutputStream.java:155)
at java.io.BufferedOutputStream.write(BufferedOutputStream.java:122)
at java.io.DataOutputStream.write(DataOutputStream.java:107)
at java.io.FilterOutputStream.write(FilterOutputStream.java:97)
at org.apache.spark.api.python.PythonRDD$.org$apache$spark$api$python$PythonRDD$$write$1(PythonRDD.scala:515)
at org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$1.apply(PythonRDD.scala:527)
at org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$1.apply(PythonRDD.scala:527)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:527)
at org.apache.spark.api.python.PythonRDD$$anon$2$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:728)
at org.apache.spark.api.python.PythonRDD$$anon$2$$anonfun$run$1.apply(PythonRDD.scala:728)
at org.apache.spark.api.python.PythonRDD$$anon$2$$anonfun$run$1.apply(PythonRDD.scala:728)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1340)
at org.apache.spark.api.python.PythonRDD$$anon$2.run(PythonRDD.scala:729)
之类的循环对rdd数据进行本地迭代,但出现此错误:
rdd.collect()
当我在小型表上使用spark.driver.memory
时,效果很好,但是对于大型数据集,使用了过多的内存。
我已经尝试增加spark.executor.memory
,spark.executor.heartbeatInterval
和for source_row in rdd.toLocalIterator():
print(source_row)
。
Spark版本:2.2.1
我该如何解决?
这是我正在使用的最少代码:
iterator = rdd.toLocalIterator()
next(iterator)
我的错误相同:
{{1}}