我有一个非常大的文件。对于文件每行的每个bigram(双字),我必须检查整个文件。我在scala中做的事情显然是错误的,但我不知道如何解决它。
此函数返回文件的所有行(大约3百万!)
def allSentences() : ArrayList[String] = {
val res: ArrayList[String] = new ArrayList[String]()
val filename = "/path/test.txt"
val fstream: FileInputStream = new FileInputStream(filename)
val br: BufferedReader = new BufferedReader(new InputStreamReader(fstream))
var strLine: String = null
while ({strLine = br.readLine(); strLine!= null})
res.add(strLine)
br.close()
return res
}
这就是我使用它的方式:
val p = sc.textFile("file:///path/test.txt")
val result11 = p
.flatMap(line => biTuple(line))
.map(word => (word, 1))
.reduceByKey(_ + _)
val result10 = result11
.flatMap { tuple => allSentences().map(tuple._1 -> _) }
.map(tuple => (tuple._1, count10(tuple._1,tuple._2)))
.reduceByKey(_ + _)
我几乎可以肯定问题出在.flatMap { tuple => allSentences().map(tuple._1 -> _) }
,但还有其他方法吗?
P.S:biTuple()
返回该行所有双字母的ArrayList
。如果bigram的第一个单词存在于行中,则count10()
返回1,而第二个单词则不存在。
result11
是所有bugrams的RDD,其计数如(“word1 word2”,count)
这是错误输出:
java.lang.OutOfMemoryError: GC overhead limit exceeded
at org.apache.spark.serializer.DeserializationStream$$anon$2.getNext(Serializer.scala:201)
at org.apache.spark.serializer.DeserializationStream$$anon$2.getNext(Serializer.scala:198)
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:152)
at org.apache.spark.Aggregator.combineCombinersByKey(Aggregator.scala:58)
at org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:83)
at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:98)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
请注意,我有SPARK_WORKER_MEMORY=90G
和SPARK_DRIVER_MEMORY=90G
。
答案 0 :(得分:2)
看起来你要做的是result11
和p
(你的原始句子列表)的笛卡尔积,但是你通过打开和阅读整个句子来做将文件存入result11
中每个条目的内存中。这肯定会给垃圾收集器带来压力,尽管我无法肯定地说这是造成GC问题的原因。 Spark在RDD上有一个cartesian
方法,如果我对你尝试做的事情的解释是正确的,它可能会更好。 (但是,它会通过网络进行大量数据复制。)
您还可以调查是否应在过滤操作中使用count10
逻辑,从而减少最终reduceByKey
需要处理的条目数。