如何避免TikaOnDotnet.TextExtractor中的System.OutOfMemoryException

时间:2018-03-21 16:45:25

标签: c# .net apache-tika

我正在使用TikaOnDotnet.TextExtractor来提取各种类型的文件。它作为控制台应用程序在Windows 10(x64)上运行。但有时它会为某些文件抛出System.OutOfMemoryException。

以下是示例代码:

using System;
using TikaOnDotNet.TextExtraction;

namespace TikaRnD
{
    class Program
    {
        static void Main(string[] args)
        {
            Type IKVM_OpenJDK_A = typeof(com.sun.codemodel.@internal.ClassType);
            Type IKVM_OpenJDK_B = typeof(com.sun.org.apache.xalan.@internal.xsltc.trax.TransformerFactoryImpl);

            var textExtractor = new TextExtractor();
            try
            {
                var teResult = textExtractor.Extract(@"c:\Temp\Largefile.docx");
                Console.WriteLine(teResult.Text.Length);
            }
            catch(Exception ex)
            {
                Console.WriteLine(ex.ToString());
            }
        }
    }
}

Largefile.docx是大约6MB的文档,包含大量文本和嵌入图像。说到运行它我可以看到进程开始消耗越来越多的系统内存。 4GB的RAM是不够的,它以例外结束:

TikaOnDotNet.TextExtraction.TextExtractionException: Extraction of text from the file 'c:\Temp\TestData\Largefile.docx' failed. ---> TikaOnDotNet.TextExtraction.TextExtractionException: Extraction failed. ---> System.OutOfMemoryException: Exception of type 'System.OutOfMemoryException' was thrown.
   at org.apache.poi.extractor.ExtractorFactory.createExtractor(OPCPackage pkg)
   at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(InputStream stream, ContentHandler baseHandler, Metadata metadata, ParseContext context)
   at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context)
   at org.apache.tika.parser.CompositeParser.parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context)
   at org.apache.tika.parser.CompositeParser.parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context)
   at org.apache.tika.parser.AutoDetectParser.parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context)
   at TikaOnDotNet.TextExtraction.Stream.StreamTextExtractor.Extract(Func`2 streamFactory, Stream outputStream) in C:\projects\tikaondotnet\src\TikaOnDotnet.TextExtractor\Stream\StreamTextExtractor.cs:line 31
   --- End of inner exception stack trace ---
   at TikaOnDotNet.TextExtraction.Stream.StreamTextExtractor.Extract(Func`2 streamFactory, Stream outputStream) in C:\projects\tikaondotnet\src\TikaOnDotnet.TextExtractor\Stream\StreamTextExtractor.cs:line 43
   at TikaOnDotNet.TextExtraction.TextExtractor.Extract(Func`2 streamFactory) in C:\projects\tikaondotnet\src\TikaOnDotnet.TextExtractor\TextExtractor.cs:line 53
   at TikaOnDotNet.TextExtraction.TextExtractor.Extract(String filePath) in C:\projects\tikaondotnet\src\TikaOnDotnet.TextExtractor\TextExtractor.cs:line 19
   --- End of inner exception stack trace ---
   at TikaOnDotNet.TextExtraction.TextExtractor.Extract(String filePath) in C:\projects\tikaondotnet\src\TikaOnDotnet.TextExtractor\TextExtractor.cs:line 28
   at TikaRnD.Program.Main(String[] args) in c:\users\norbert\source\repos\TikaRnD\TikaRnD\Program.cs:line 20

当我在具有更多内存的系统上运行具有相同文件的示例代码时,它消耗高达~10GB的RAM并成功完成提取 - 提取的内容大小为~50MB。

任何人都可以帮助我理解为什么会发生高得惊人的高内存消耗以及如果可能的话如何预防?

0 个答案:

没有答案