Pyspak:使用toLocalIterator重置连接

时间:2019-03-21 16:30:13

标签: python apache-spark pyspark

我正在尝试在pyspark中使用19/03/21 17:01:36 ERROR PythonRDD: Error while sending iterator java.net.SocketException: Connection reset at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:115) at java.net.SocketOutputStream.write(SocketOutputStream.java:155) at java.io.BufferedOutputStream.write(BufferedOutputStream.java:122) at java.io.DataOutputStream.write(DataOutputStream.java:107) at java.io.FilterOutputStream.write(FilterOutputStream.java:97) at org.apache.spark.api.python.PythonRDD$.org$apache$spark$api$python$PythonRDD$$write$1(PythonRDD.scala:515) at org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$1.apply(PythonRDD.scala:527) at org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$1.apply(PythonRDD.scala:527) at scala.collection.Iterator$class.foreach(Iterator.scala:893) at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:527) at org.apache.spark.api.python.PythonRDD$$anon$2$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:728) at org.apache.spark.api.python.PythonRDD$$anon$2$$anonfun$run$1.apply(PythonRDD.scala:728) at org.apache.spark.api.python.PythonRDD$$anon$2$$anonfun$run$1.apply(PythonRDD.scala:728) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1340) at org.apache.spark.api.python.PythonRDD$$anon$2.run(PythonRDD.scala:729) 之类的循环对rdd数据进行本地迭代,但出现此错误:

rdd.collect()

当我在小型表上使用spark.driver.memory时,效果很好,但是对于大型数据集,使用了过多的内存。

我已经尝试增加spark.executor.memoryspark.executor.heartbeatIntervalfor source_row in rdd.toLocalIterator(): print(source_row)

Spark版本:2.2.1

我该如何解决?

这是我正在使用的最少代码:

iterator = rdd.toLocalIterator()
next(iterator)

我的错误相同:

{{1}}

0 个答案:

没有答案