我在Scala中读取文件时遇到问题 - 仍然有点像Scala noob我害怕。我必须读取一个大约500Mb的文件,将其拆分为分隔符,然后添加到地图中以供以后查找。
我的代码是这样的:
val inF = args(0)
for(lines: String <- scala.io.Source.fromFile(inF).getLines) {
val xs = lines.split(",")
// do some work on the result
// update a hashmap
}
在几秒钟内,我收到一个错误:
java.lang.OutOfMemoryError: GC overhead limit exceeded
> at java.util.ArrayList.subList(ArrayList.java:955)
> at java.lang.String.split(String.java:2311)
> at java.lang.String.split(String.java:2355)
> at Main$$anon$1$$anonfun$5.apply(cosine.scala:41)
> at Main$$anon$1$$anonfun$5.apply(cosine.scala:37)
> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
> at Main$$anon$1.<init>(cosine.scala:37)
> at Main$.main(cosine.scala:1)
> at Main.main(cosine.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at scala.tools.nsc.util.ScalaClassLoader$$anonfun$run$1.apply(ScalaClassLoader.scala:71)
> at scala.tools.nsc.util.ScalaClassLoader$class.asContext(ScalaClassLoader.scala:31)
> at scala.tools.nsc.util.ScalaClassLoader$URLClassLoader.asContext(ScalaClassLoader.scala:139)
> at scala.tools.nsc.util.ScalaClassLoader$class.run(ScalaClassLoader.scala:71)
> at scala.tools.nsc.util.ScalaClassLoader$URLClassLoader.run(ScalaClassLoader.scala:139)
> at scala.tools.nsc.CommonRunner$class.run(ObjectRunner.scala:28)
> at scala.tools.nsc.ObjectRunner$.run(ObjectRunner.scala:45)
> at scala.tools.nsc.CommonRunner$class.runAndCatch(ObjectRunner.scala:35)
> at scala.tools.nsc.ObjectRunner$.runAndCatch(ObjectRunner.scala:45)
> at scala.tools.nsc.ScriptRunner.scala$tools$nsc$ScriptRunner$$runCompiled(ScriptRunner.scala:171)
> at scala.tools.nsc.ScriptRunner$$anonfun$runScript$1.apply(ScriptRunner.scala:188)
> at scala.tools.nsc.ScriptRunner$$anonfun$runScript$1.apply(ScriptRunner.scala:188)
> at scala.tools.nsc.ScriptRunner$$anonfun$withCompiledScript$1.apply$mcZ$sp(ScriptRunner.scala:157)
> at scala.tools.nsc.ScriptRunner$$anonfun$withCompiledScript$1.apply(ScriptRunner.scala:131)
> at scala.tools.nsc.ScriptRunner$$anonfun$withCompiledScript$1.apply(ScriptRunner.scala:131)
> at scala.tools.nsc.util.package$.trackingThreads(package.scala:51)
> at scala.tools.nsc.util.package$.waitingForThreads(package.scala:35)
> at scala.tools.nsc.ScriptRunner.withCompiledScript(ScriptRunner.scala:130)
非常感谢任何帮助!
更新:有关此问题的更多信息:
我想为每个变量1形成一个类型(变量2 - >值)的稀疏向量。然后,我需要比较每对变量1的稀疏向量之间的相似性,这可能是人或唯一ID
我的CSV看起来像这样:
variable1,variable2,value
"Alice","A",0.9
"Alice","B",0.8
"Alice","C",0.9
"Bob","A",0.5
"Bob","B",0.7
"Bob","D",0.9
我的整个代码是这样的(减去相似度函数):
val m = new scala.collection.mutable.HashMap[String, scala.collection.mutable.HashMap[String, Double]]
for(lines: String <- scala.io.Source.fromFile(inF).getLines) {
lines match {
case "variable1,variable2,rating" => println("header skipping")
case _ =>
val xs = lines.split(",")
val var1 = xs(0)
val var2 = xs(1)
val rat = xs(2).toDouble
val map = m.get(var1)
map match {
case Some(x) => x.update(var2, rat)
m.update(var1, x)
case None => val tmpMap = new scala.collection.mutable.HashMap[String, Double]
tmpMap.update(var2, rat)
m.update(var1, tmpMap)
}
}
}
val data = m.par
val results = for {
(var1, xs) <- data
(var2, ys) <- m
if (var1 < var2)
} yield( (var1, var2, similarity(xs, ys)))
所以我必须找到并比较(变量1,稀疏向量)对,并得到它们之间的相似性。
答案 0 :(得分:0)
java.lang.OutOfMemoryError: GC overhead limit exceeded
表示JVM垃圾收集器消耗大约70%的整个JVM CPU时间。这可能表明您创建了许多由许多小对象组成的垃圾。在您的情况下,我希望您正在运行Java 7或更高版本,并且堆使用拆分的不可重复字符串而过度使用。
// update a hashmap
也很可疑。你究竟做了什么?