我已经将wiki xml转储分解为1M的许多小部分并尝试清理它(在用别人的其他程序清理之后)
我得到一个内存不足错误,我不知道如何解决。谁能开导我?
我收到以下错误消息:
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at org.apache.lucene.index.FreqProxTermsWriterPerField$FreqProxPostingsArray.<init>(FreqProxTermsWriterPerField.java:212)
at org.apache.lucene.index.FreqProxTermsWriterPerField$FreqProxPostingsArray.newInstance(FreqProxTermsWriterPerField.java:235)
at org.apache.lucene.index.ParallelPostingsArray.grow(ParallelPostingsArray.java:48)
at org.apache.lucene.index.TermsHashPerField$PostingsBytesStartArray.grow(TermsHashPerField.java:252)
at org.apache.lucene.util.BytesRefHash.add(BytesRefHash.java:292)
at org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:151)
at org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:645)
at org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:342)
at org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:301)
at org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:241)
at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:454)
at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1541)
at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1256)
at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1237)
at qa.main.ja.Indexing$$anonfun$5$$anonfun$apply$4.apply(SearchDocument.scala:234)
at qa.main.ja.Indexing$$anonfun$5$$anonfun$apply$4.apply(SearchDocument.scala:224)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
at scala.collection.Iterator$class.foreach(Iterator.scala:750)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1202)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:245)
at scala.collection.AbstractTraversable.map(Traversable.scala:104)
at qa.main.ja.Indexing$$anonfun$5.apply(SearchDocument.scala:224)
at qa.main.ja.Indexing$$anonfun$5.apply(SearchDocument.scala:220)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:245)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
第234行如下:
writer.addDocument(document)
它正在向Lucene添加一些文件
并且第224行如下:
for (doc <- target_xml \\ "doc") yield {
这是for循环的第一行,用于在索引中添加各种元素作为字段。
是代码问题,设置问题还是硬件问题?
修改
嗨,这是我的for循环:
for (knowledgeFile <- knowledgeFiles) yield {
System.err.println(s"processing file: ${knowledgeFile}")
val target_xml=XML.loadString(" <file>"+cleanFile(knowledgeFile).mkString+"</file>")
for (doc <- target_xml \\ "doc") yield {
val id = (doc \ "@id").text
val title = (doc \ "@title").text
val text = doc.text
val document = new Document()
document.add(new StringField("id", id, Store.YES))
document.add(new TextField("text", new StringReader(title + text)))
writer.addDocument(document)
val xml_doc = <page><title>{ title }</title><text>{ text }</text></page>
id -> xml_doc
}
}).flatten.toArray`
内部循环只是循环遍历每个doc元素。外部循环遍历每个文件。是否嵌套了问题的根源?
以下是cleanFile函数供参考:
def cleanFile(fileName:String):Array[String] = {
val tagRe = """<\/?doc.*?>""".r
val lines = Source.fromFile(fileName).getLines.toArray
val outLines = new Array[String](lines.length)
for ((line,lineNo) <- lines.zipWithIndex) yield {
if (tagRe.findFirstIn(line)!=None)
{
outLines(lineNo) = line
}
else
{
outLines(lineNo) = StringEscapeUtils.escapeXml11(line)
}
}
outLines
}
再次感谢
答案 0 :(得分:0)
看起来你想通过使用-xmx jvm参数来尝试增加堆大小?