在Scala中读取CSV时达到的GC开销限制

时间:2015-03-13 09:42:43

标签: scala

我在Scala中读取文件时遇到问题 - 仍然有点像Scala noob我害怕。我必须读取一个大约500Mb的文件,将其拆分为分隔符,然后添加到地图中以供以后查找。

我的代码是这样的:

val inF = args(0)
for(lines: String <- scala.io.Source.fromFile(inF).getLines) {
    val xs = lines.split(",")
    // do some work on the result
    // update a hashmap
}

在几秒钟内,我收到一个错误:

java.lang.OutOfMemoryError: GC overhead limit exceeded
>         at java.util.ArrayList.subList(ArrayList.java:955)
>         at java.lang.String.split(String.java:2311)
>         at java.lang.String.split(String.java:2355)
>         at Main$$anon$1$$anonfun$5.apply(cosine.scala:41)
>         at Main$$anon$1$$anonfun$5.apply(cosine.scala:37)
>         at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>         at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>         at Main$$anon$1.<init>(cosine.scala:37)
>         at Main$.main(cosine.scala:1)
>         at Main.main(cosine.scala)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>         at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>         at java.lang.reflect.Method.invoke(Method.java:606)
>         at scala.tools.nsc.util.ScalaClassLoader$$anonfun$run$1.apply(ScalaClassLoader.scala:71)
>         at scala.tools.nsc.util.ScalaClassLoader$class.asContext(ScalaClassLoader.scala:31)
>         at scala.tools.nsc.util.ScalaClassLoader$URLClassLoader.asContext(ScalaClassLoader.scala:139)
>         at scala.tools.nsc.util.ScalaClassLoader$class.run(ScalaClassLoader.scala:71)
>         at scala.tools.nsc.util.ScalaClassLoader$URLClassLoader.run(ScalaClassLoader.scala:139)
>         at scala.tools.nsc.CommonRunner$class.run(ObjectRunner.scala:28)
>         at scala.tools.nsc.ObjectRunner$.run(ObjectRunner.scala:45)
>         at scala.tools.nsc.CommonRunner$class.runAndCatch(ObjectRunner.scala:35)
>         at scala.tools.nsc.ObjectRunner$.runAndCatch(ObjectRunner.scala:45)
>         at scala.tools.nsc.ScriptRunner.scala$tools$nsc$ScriptRunner$$runCompiled(ScriptRunner.scala:171)
>         at scala.tools.nsc.ScriptRunner$$anonfun$runScript$1.apply(ScriptRunner.scala:188)
>         at scala.tools.nsc.ScriptRunner$$anonfun$runScript$1.apply(ScriptRunner.scala:188)
>         at scala.tools.nsc.ScriptRunner$$anonfun$withCompiledScript$1.apply$mcZ$sp(ScriptRunner.scala:157)
>         at scala.tools.nsc.ScriptRunner$$anonfun$withCompiledScript$1.apply(ScriptRunner.scala:131)
>         at scala.tools.nsc.ScriptRunner$$anonfun$withCompiledScript$1.apply(ScriptRunner.scala:131)
>         at scala.tools.nsc.util.package$.trackingThreads(package.scala:51)
>         at scala.tools.nsc.util.package$.waitingForThreads(package.scala:35)
>         at scala.tools.nsc.ScriptRunner.withCompiledScript(ScriptRunner.scala:130)

非常感谢任何帮助!

更新:有关此问题的更多信息:

我想为每个变量1形成一个类型(变量2 - >值)的稀疏向量。然后,我需要比较每对变量1的稀疏向量之间的相似性,这可能是人或唯一ID

我的CSV看起来像这样:

variable1,variable2,value
"Alice","A",0.9
"Alice","B",0.8
"Alice","C",0.9
"Bob","A",0.5
"Bob","B",0.7
"Bob","D",0.9

我的整个代码是这样的(减去相似度函数):

val m = new scala.collection.mutable.HashMap[String, scala.collection.mutable.HashMap[String, Double]]

for(lines: String <- scala.io.Source.fromFile(inF).getLines) {
    lines match {
        case "variable1,variable2,rating" => println("header skipping")
        case _ =>
    val xs = lines.split(",")
    val var1 = xs(0)
    val var2 = xs(1)
    val rat = xs(2).toDouble
    val map = m.get(var1)
    map match {
        case Some(x) => x.update(var2, rat)
                        m.update(var1, x)
        case None    => val tmpMap = new scala.collection.mutable.HashMap[String, Double]
                        tmpMap.update(var2, rat)
                        m.update(var1, tmpMap)
    }
    }
}

val data = m.par

val results = for {
    (var1, xs) <- data
    (var2, ys) <- m
    if (var1 < var2)
} yield( (var1, var2, similarity(xs, ys)))

所以我必须找到并比较(变量1,稀疏向量)对,并得到它们之间的相似性。

1 个答案:

答案 0 :(得分:0)

java.lang.OutOfMemoryError: GC overhead limit exceeded表示JVM垃圾收集器消耗大约70%的整个JVM CPU时间。这可能表明您创建了许多由许多小对象组成的垃圾。在您的情况下,我希望您正在运行Java 7或更高版本,并且堆使用拆分的不可重复字符串而过度使用。 // update a hashmap也很可疑。你究竟做了什么?