我正在尝试生成一个包含json数组形式的非常大的测试数据的文件。
由于我的测试数据非常大,我不能使用“mkString”。这也是我使用Streams和尾递归的原因。
但我的程序仍然会出现内存不足异常
package com.abhi
import java.io.FileWriter
import scala.annotation.tailrec
object GenerateTestFile {
def jsonField(name: String, value : String) : String = {
s""""$name":"$value""""
}
def writeRecords(records : Stream[String], fw : FileWriter) : Unit = {
@tailrec
def inner(records: Stream[String]) : Unit = {
records match {
case head #:: Stream.Empty => fw.write(head)
case head #:: tail => fw.write(head + ","); inner(tail)
case Stream.Empty =>
}
}
inner(records)
}
def main(args: Array[String]) : Unit = {
val fileWriter = new FileWriter("./sample.json", true)
fileWriter.write("[")
val lines = (1 to 10000000)
.toStream
.map(x => (
jsonField("id", x.toString),
jsonField("FieldA", "FieldA" + x),
jsonField("FieldB", "FieldB" +x),
jsonField("FieldC", "FieldC" + x),
jsonField("FieldD", "FieldD" + x),
jsonField("FieldE", "FieldE" + x)
))
.map (t => t.productIterator.mkString("{",",", "}"))
writeRecords(lines, fileWriter)
fileWriter.write("]")
fileWriter.close()
}
}
异常
[error] (run-main-0) java.lang.OutOfMemoryError: GC overhead limit exceeded
java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.util.Arrays.copyOf(Arrays.java:2367)
at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:130)
at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:114)
at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:415)
at java.lang.StringBuilder.append(StringBuilder.java:132)
at java.lang.StringBuilder.append(StringBuilder.java:128)
at scala.StringContext.standardInterpolator(StringContext.scala:125)
at scala.StringContext.s(StringContext.scala:95)
at com.abhi.GenerateTestFile$.jsonField(GenerateTestFile.scala:9)
at com.abhi.GenerateTestFile$$anonfun$1.apply(GenerateTestFile.scala:32)
答案 0 :(得分:4)
您正在寻找的不是流,而是一个逐个实现该值列表的迭代器;在你处理它之后,即保存到文件会丢弃内存。
当你创建你的json时,你实际上在内存中保存了整个数字序列,除了你为每个元素生成一个新的大文本博客,你稍后将其放入文件中。在记忆方面,与文本的大小相比,初始序列是无关紧要的。 我所做的是我使用了一个理解来创建一个逐个释放元素的迭代器。使用foldLeft我确保将元素映射到json字符串,将其写入磁盘并释放内存(对任何创建的对象的引用都会丢失,因此GC可以启动回收内存。不幸的是,这种方法无法使用并行性特征。
def main(args: Array[String]): Unit = {
val fileWriter = new FileWriter("./sample.json", true)
fileWriter.write("[")
fileWriter.write(createObject(1).productIterator.mkString("{", ",", "}"))
val lines = (for (v <- 2 to 10000000) yield v)
.foldLeft(0)((_, x) => {
if (x % 50000 == 0)
println(s"We've reached $x element")
fileWriter.write(createObject(x).productIterator.mkString(",{", ",", "}"))
x
})
fileWriter.write("]")
fileWriter.close()
}
def createObject(x: Int) =
(jsonField("id", x.toString),
jsonField("FieldA", "FieldA" + x),
jsonField("FieldB", "FieldB" + x),
jsonField("FieldC", "FieldC" + x),
jsonField("FieldD", "FieldD" + x),
jsonField("FieldE", "FieldE" + x))
答案 1 :(得分:2)
来自Stream
的来源:
* - One must be cautious of memoization; you can very quickly eat up large
* amounts of memory if you're not careful. The reason for this is that the
* memoization of the `Stream` creates a structure much like
* [[scala.collection.immutable.List]]. So long as something is holding on to
* the head, the head holds on to the tail, and so it continues recursively.
* If, on the other hand, there is nothing holding on to the head (e.g. we used
* `def` to define the `Stream`) then once it is no longer being used directly,
* it disappears.
因此,如果您内联lines
(或将其写为def
),并重写您的writeRecords
函数以保留对初始头的引用,(或写入param为'使用名称调用'值使用箭头:records: => Stream[String]
,它与def
vs val
基本相同的事情)元素应该被垃圾收集,因为流的其余部分被处理:
@tailrec
def writeRecords(records : Stream[String], fw : FileWriter) : Unit = {
records match {
case head #:: Stream.Empty => fw.write(head)
case head #:: tail => fw.write(head + ","); writeRecords(tail, fw)
case Stream.Empty =>
}
}
writeRecords((1 to 10000000)
.toStream
.map(x => (
jsonField("id", x.toString),
jsonField("FieldA", "FieldA" + x),
jsonField("FieldB", "FieldB" +x),
jsonField("FieldC", "FieldC" + x),
jsonField("FieldD", "FieldD" + x),
jsonField("FieldE", "FieldE" + x)
))
.map (t => t.productIterator.mkString("{",",", "}")),
fileWriter)