OOM使用poi从大型docx文件中读取文本内容

时间:2013-11-28 15:17:45

标签: java file apache-poi docx

我正在尝试查找docx文件中可用的文本内容的长度。我可以使用以下代码提取内容。但是当尺寸太大时,我会得到OOM异常。有更好的方法吗?

    OPCPackage opcPackage = OPCPackage.open(file.getAbsolutePath());
    XWPFDocument doc = new XWPFDocument(opcPackage);
    XWPFWordExtractor we = new XWPFWordExtractor(doc);
    String paragraphs = we.getText();
    System.out.println("Total Paragraphs: "+paragraphs.length() / 1024);

我在下面的行中收到错误

    XWPFDocument doc = new XWPFDocument(opcPackage);

以下是例外

    Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
    at org.apache.xmlbeans.impl.store.CharUtil.allocate(CharUtil.java:397)
    at org.apache.xmlbeans.impl.store.CharUtil.saveChars(CharUtil.java:441)
    at org.apache.xmlbeans.impl.store.Cur$CurLoadContext.text(Cur.java:2922)
    at org.apache.xmlbeans.impl.store.Cur$CurLoadContext.attr(Cur.java:3043)
    at org.apache.xmlbeans.impl.store.Cur$CurLoadContext.attr(Cur.java:3060)
    at org.apache.xmlbeans.impl.store.Locale$SaxHandler.startElement(Locale.java:3254)
    at org.apache.xmlbeans.impl.piccolo.xml.Piccolo.reportStartTag(Piccolo.java:1082)
    at org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseAttributesNS(PiccoloLexer.java:1802)
    at org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseOpenTagNS(PiccoloLexer.java:1521)
    at org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseTagNS(PiccoloLexer.java:1362)
    at org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseXMLNS(PiccoloLexer.java:1293)
    at org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseXML(PiccoloLexer.java:1261)
    at org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.yylex(PiccoloLexer.java:4808)
    at org.apache.xmlbeans.impl.piccolo.xml.Piccolo.yylex(Piccolo.java:1290)
    at org.apache.xmlbeans.impl.piccolo.xml.Piccolo.yyparse(Piccolo.java:1400)
    at org.apache.xmlbeans.impl.piccolo.xml.Piccolo.parse(Piccolo.java:714)
    at org.apache.xmlbeans.impl.store.Locale$SaxLoader.load(Locale.java:3439)
    at org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1270)
    at org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1257)
    at org.apache.xmlbeans.impl.schema.SchemaTypeLoaderBase.parse
    (SchemaTypeLoaderBase.java:345)
    at org.openxmlformats.schemas.wordprocessingml.x2006.main.DocumentDocument$Factory.
    parse(Unknown Source)
    at org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead(XWPFDocument.java:135)
    at org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:190)
    at org.apache.poi.xwpf.usermodel.XWPFDocument.<init>(XWPFDocument.java:107)
    at ReadDocFileFromJava.readMyDocument(ReadDocFileFromJava.java:24)
    at ReadDocFileFromJava.main(ReadDocFileFromJava.java:15)

0 个答案:

没有答案