使用scala处理某些xml时出现内存不足错误

时间:2015-08-16 10:02:59

标签: java xml scala lucene

我已经将wiki xml转储分解为1M的许多小部分并尝试清理它(在用别人的其他程序清理之后)

我得到一个内存不足错误,我不知道如何解决。谁能开导我?

我收到以下错误消息:

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
    at org.apache.lucene.index.FreqProxTermsWriterPerField$FreqProxPostingsArray.<init>(FreqProxTermsWriterPerField.java:212)
    at org.apache.lucene.index.FreqProxTermsWriterPerField$FreqProxPostingsArray.newInstance(FreqProxTermsWriterPerField.java:235)
    at org.apache.lucene.index.ParallelPostingsArray.grow(ParallelPostingsArray.java:48)
    at org.apache.lucene.index.TermsHashPerField$PostingsBytesStartArray.grow(TermsHashPerField.java:252)
    at org.apache.lucene.util.BytesRefHash.add(BytesRefHash.java:292)
    at org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:151)
    at org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:645)
    at org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:342)
    at org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:301)
    at org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:241)
    at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:454)
    at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1541)
    at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1256)
    at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1237)
    at qa.main.ja.Indexing$$anonfun$5$$anonfun$apply$4.apply(SearchDocument.scala:234)
    at qa.main.ja.Indexing$$anonfun$5$$anonfun$apply$4.apply(SearchDocument.scala:224)
    at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
    at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
    at scala.collection.Iterator$class.foreach(Iterator.scala:750)
    at scala.collection.AbstractIterator.foreach(Iterator.scala:1202)
    at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
    at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
    at scala.collection.TraversableLike$class.map(TraversableLike.scala:245)
    at scala.collection.AbstractTraversable.map(Traversable.scala:104)
    at qa.main.ja.Indexing$$anonfun$5.apply(SearchDocument.scala:224)
    at qa.main.ja.Indexing$$anonfun$5.apply(SearchDocument.scala:220)
    at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
    at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
    at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
    at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
    at scala.collection.TraversableLike$class.map(TraversableLike.scala:245)
    at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186) 

第234行如下:

writer.addDocument(document)

它正在向Lucene添加一些文件

并且第224行如下:

for (doc <- target_xml \\ "doc") yield {

这是for循环的第一行,用于在索引中添加各种元素作为字段。

是代码问题,设置问题还是硬件问题?

修改

嗨,这是我的for循环:

for (knowledgeFile <- knowledgeFiles) yield {
System.err.println(s"processing file: ${knowledgeFile}")
val target_xml=XML.loadString("    <file>"+cleanFile(knowledgeFile).mkString+"</file>")
for (doc <- target_xml \\ "doc") yield {
val id = (doc \ "@id").text
val title = (doc \ "@title").text
val text = doc.text
val document = new Document()
document.add(new StringField("id", id, Store.YES))
document.add(new TextField("text", new StringReader(title + text)))
writer.addDocument(document)
val xml_doc = <page><title>{ title }</title><text>{ text }</text></page>
id -> xml_doc
}
}).flatten.toArray` 

内部循环只是循环遍历每个doc元素。外部循环遍历每个文件。是否嵌套了问题的根源?

以下是cleanFile函数供参考:

def cleanFile(fileName:String):Array[String] = {
val tagRe = """<\/?doc.*?>""".r
val lines = Source.fromFile(fileName).getLines.toArray
val outLines = new Array[String](lines.length)
for ((line,lineNo) <- lines.zipWithIndex) yield {
if (tagRe.findFirstIn(line)!=None)
{
outLines(lineNo) = line
}
else
{
outLines(lineNo) = StringEscapeUtils.escapeXml11(line)
}
}
outLines
}

再次感谢

1 个答案:

答案 0 :(得分:0)

看起来你想通过使用-xmx jvm参数来尝试增加堆大小?