当我运行我的spark python代码时如下:
import pyspark
conf = (pyspark.SparkConf()
.setMaster("local")
.setAppName("My app")
.set("spark.executor.memory", "512m"))
sc = pyspark.SparkContext(conf = conf) #start the conf
data =sc.textFile('/Users/tsangbosco/Downloads/transactions')
data = data.flatMap(lambda x:x.split()).take(all)
文件大约是20G而我的电脑有8G内存,当我以独立模式运行程序时,它会引发OutOfMemoryError:
Exception in thread "Local computation of job 12" java.lang.OutOfMemoryError: Java heap space
at org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:131)
at org.apache.spark.api.python.PythonRDD$$anon$1.next(PythonRDD.scala:119)
at org.apache.spark.api.python.PythonRDD$$anon$1.next(PythonRDD.scala:112)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at org.apache.spark.api.python.PythonRDD$$anon$1.foreach(PythonRDD.scala:112)
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at org.apache.spark.api.python.PythonRDD$$anon$1.to(PythonRDD.scala:112)
at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
at org.apache.spark.api.python.PythonRDD$$anon$1.toBuffer(PythonRDD.scala:112)
at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
at org.apache.spark.api.python.PythonRDD$$anon$1.toArray(PythonRDD.scala:112)
at org.apache.spark.api.java.JavaRDDLike$$anonfun$1.apply(JavaRDDLike.scala:259)
at org.apache.spark.api.java.JavaRDDLike$$anonfun$1.apply(JavaRDDLike.scala:259)
at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:884)
at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:884)
at org.apache.spark.scheduler.DAGScheduler.runLocallyWithinThread(DAGScheduler.scala:681)
at org.apache.spark.scheduler.DAGScheduler$$anon$1.run(DAGScheduler.scala:666)
火花无法处理比我的公羊更大的档案吗?你能告诉我怎么修理它吗?
答案 0 :(得分:3)
Spark可以处理某些情况。但是您正在使用take
强制Spark将所有数据提取到数组(在内存中)。在这种情况下,您应该将它们存储到文件中,例如使用saveAsTextFile
。
如果您有兴趣查看部分数据,可以使用sample
或takeSample
。