我正在尝试使用 pyspark读取文件(~600M csv 文件)。但是我收到以下错误。
令人惊讶的是,相同的代码适用于 Scala 。
我找到了这个问题页面 https://issues.apache.org/jira/browse/SPARK-12261但不起作用。
阅读代码:
import os
from pyspark import SparkContext
from pyspark import SparkConf
datasetDir = 'D:\\Datasets\\movieLens\\ml-latest\\'
ratingFile = 'ratings.csv'
conf = SparkConf().setAppName("movie_recommendation-server").setMaster('local[2]')
sc = SparkContext(conf=conf)
ratingRDD = sc.textFile(os.path.join(datasetDir, ratingFile))
print(ratingRDD.take(1)[0])
我收到此错误:
16/04/25 09:00:04 ERROR PythonRunner: Python worker exited unexpectedly (crashed)
java.net.SocketException: Connection reset by peer: socket write error
at java.net.SocketOutputStream.socketWrite0(Native Method)
at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:109)
at java.net.SocketOutputStream.write(SocketOutputStream.java:153)
at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
at java.io.BufferedOutputStream.write(BufferedOutputStream.java:126)
at java.io.DataOutputStream.write(DataOutputStream.java:107)
at java.io.FilterOutputStream.write(FilterOutputStream.java:97)
at org.apache.spark.api.python.PythonRDD$.writeUTF(PythonRDD.scala:622)
at org.apache.spark.api.python.PythonRDD$.org$apache$spark$api$python$PythonRDD$$write$1(PythonRDD.scala:442)
at org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$1.apply(PythonRDD.scala:452)
at org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$1.apply(PythonRDD.scala:452)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:452)
at org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3.apply(PythonRDD.scala:280)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1765)
at org.apache.spark.api.python.PythonRunner$WriterThread.run(PythonRDD.scala:239)