Question

我正在尝试生成一个包含json数组形式的非常大的测试数据的文件。

由于我的测试数据非常大，我不能使用“mkString”。这也是我使用Streams和尾递归的原因。

但我的程序仍然会出现内存不足异常

package com.abhi

import java.io.FileWriter

import scala.annotation.tailrec

object GenerateTestFile {
  def jsonField(name: String, value : String) : String = {
      s""""$name":"$value""""
  }

  def writeRecords(records : Stream[String], fw : FileWriter) : Unit = {
    @tailrec
    def inner(records: Stream[String]) : Unit = {
      records match {
        case head #:: Stream.Empty => fw.write(head)
        case head #:: tail => fw.write(head + ","); inner(tail)
        case Stream.Empty =>
      }
    }
    inner(records)
  }

  def main(args: Array[String]) : Unit = {
    val fileWriter = new FileWriter("./sample.json", true)
    fileWriter.write("[")
    val lines = (1 to 10000000)
      .toStream
      .map(x => (
        jsonField("id", x.toString),
        jsonField("FieldA", "FieldA" + x),
        jsonField("FieldB", "FieldB" +x),
        jsonField("FieldC", "FieldC" + x),
        jsonField("FieldD", "FieldD" + x),
        jsonField("FieldE", "FieldE" + x)
      ))
      .map (t => t.productIterator.mkString("{",",", "}"))

    writeRecords(lines, fileWriter)
    fileWriter.write("]")
    fileWriter.close()
  }
}

异常

[error] (run-main-0) java.lang.OutOfMemoryError: GC overhead limit exceeded
java.lang.OutOfMemoryError: GC overhead limit exceeded
    at java.util.Arrays.copyOf(Arrays.java:2367)
    at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:130)
    at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:114)
    at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:415)
    at java.lang.StringBuilder.append(StringBuilder.java:132)
    at java.lang.StringBuilder.append(StringBuilder.java:128)
    at scala.StringContext.standardInterpolator(StringContext.scala:125)
    at scala.StringContext.s(StringContext.scala:95)
    at com.abhi.GenerateTestFile$.jsonField(GenerateTestFile.scala:9)
    at com.abhi.GenerateTestFile$$anonfun$1.apply(GenerateTestFile.scala:32)

Answer 1

您正在寻找的不是流，而是一个逐个实现该值列表的迭代器;在你处理它之后，即保存到文件会丢弃内存。

当你创建你的json时，你实际上在内存中保存了整个数字序列，除了你为每个元素生成一个新的大文本博客，你稍后将其放入文件中。在记忆方面，与文本的大小相比，初始序列是无关紧要的。我所做的是我使用了一个理解来创建一个逐个释放元素的迭代器。使用foldLeft我确保将元素映射到json字符串，将其写入磁盘并释放内存（对任何创建的对象的引用都会丢失，因此GC可以启动回收内存。不幸的是，这种方法无法使用并行性特征。

  def main(args: Array[String]): Unit = {
    val fileWriter = new FileWriter("./sample.json", true)
    fileWriter.write("[")

    fileWriter.write(createObject(1).productIterator.mkString("{", ",", "}"))

    val lines = (for (v <- 2 to 10000000) yield v)
                .foldLeft(0)((_, x) => {
                  if (x % 50000 == 0)
                    println(s"We've reached $x element")

                  fileWriter.write(createObject(x).productIterator.mkString(",{", ",", "}"))
                  x
                })

    fileWriter.write("]")
    fileWriter.close()
  }

  def createObject(x: Int) =
    (jsonField("id", x.toString),
      jsonField("FieldA", "FieldA" + x),
      jsonField("FieldB", "FieldB" + x),
      jsonField("FieldC", "FieldC" + x),
      jsonField("FieldD", "FieldD" + x),
      jsonField("FieldE", "FieldE" + x))

Answer 2

来自Stream的来源：

 *  - One must be cautious of memoization; you can very quickly eat up large
 *  amounts of memory if you're not careful.  The reason for this is that the
 *  memoization of the `Stream` creates a structure much like
 *  [[scala.collection.immutable.List]].  So long as something is holding on to
 *  the head, the head holds on to the tail, and so it continues recursively.
 *  If, on the other hand, there is nothing holding on to the head (e.g. we used
 *  `def` to define the `Stream`) then once it is no longer being used directly,
 *  it disappears.

因此，如果您内联lines（或将其写为def），并重写您的writeRecords函数以保留对初始头的引用，（或写入param为'使用名称调用'值使用箭头：records: => Stream[String]，它与def vs val基本相同的事情）元素应该被垃圾收集，因为流的其余部分被处理：

@tailrec
def writeRecords(records : Stream[String], fw : FileWriter) : Unit = {
  records match {
    case head #:: Stream.Empty => fw.write(head)
    case head #:: tail => fw.write(head + ","); writeRecords(tail, fw)
    case Stream.Empty =>
  }
}


writeRecords((1 to 10000000)
  .toStream
  .map(x => (
    jsonField("id", x.toString),
    jsonField("FieldA", "FieldA" + x),
    jsonField("FieldB", "FieldB" +x),
    jsonField("FieldC", "FieldC" + x),
    jsonField("FieldD", "FieldD" + x),
    jsonField("FieldE", "FieldE" + x)
  ))
  .map (t => t.productIterator.mkString("{",",", "}")), 
fileWriter)

使用Streams ......仍然没有内存

2 个答案: