我正在使用TikaOnDotnet.TextExtractor来提取各种类型的文件。它作为控制台应用程序在Windows 10(x64)上运行。但有时它会为某些文件抛出System.OutOfMemoryException。
以下是示例代码:
using System;
using TikaOnDotNet.TextExtraction;
namespace TikaRnD
{
class Program
{
static void Main(string[] args)
{
Type IKVM_OpenJDK_A = typeof(com.sun.codemodel.@internal.ClassType);
Type IKVM_OpenJDK_B = typeof(com.sun.org.apache.xalan.@internal.xsltc.trax.TransformerFactoryImpl);
var textExtractor = new TextExtractor();
try
{
var teResult = textExtractor.Extract(@"c:\Temp\Largefile.docx");
Console.WriteLine(teResult.Text.Length);
}
catch(Exception ex)
{
Console.WriteLine(ex.ToString());
}
}
}
}
Largefile.docx是大约6MB的文档,包含大量文本和嵌入图像。说到运行它我可以看到进程开始消耗越来越多的系统内存。 4GB的RAM是不够的,它以例外结束:
TikaOnDotNet.TextExtraction.TextExtractionException: Extraction of text from the file 'c:\Temp\TestData\Largefile.docx' failed. ---> TikaOnDotNet.TextExtraction.TextExtractionException: Extraction failed. ---> System.OutOfMemoryException: Exception of type 'System.OutOfMemoryException' was thrown.
at org.apache.poi.extractor.ExtractorFactory.createExtractor(OPCPackage pkg)
at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(InputStream stream, ContentHandler baseHandler, Metadata metadata, ParseContext context)
at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context)
at org.apache.tika.parser.CompositeParser.parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context)
at org.apache.tika.parser.CompositeParser.parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context)
at org.apache.tika.parser.AutoDetectParser.parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context)
at TikaOnDotNet.TextExtraction.Stream.StreamTextExtractor.Extract(Func`2 streamFactory, Stream outputStream) in C:\projects\tikaondotnet\src\TikaOnDotnet.TextExtractor\Stream\StreamTextExtractor.cs:line 31
--- End of inner exception stack trace ---
at TikaOnDotNet.TextExtraction.Stream.StreamTextExtractor.Extract(Func`2 streamFactory, Stream outputStream) in C:\projects\tikaondotnet\src\TikaOnDotnet.TextExtractor\Stream\StreamTextExtractor.cs:line 43
at TikaOnDotNet.TextExtraction.TextExtractor.Extract(Func`2 streamFactory) in C:\projects\tikaondotnet\src\TikaOnDotnet.TextExtractor\TextExtractor.cs:line 53
at TikaOnDotNet.TextExtraction.TextExtractor.Extract(String filePath) in C:\projects\tikaondotnet\src\TikaOnDotnet.TextExtractor\TextExtractor.cs:line 19
--- End of inner exception stack trace ---
at TikaOnDotNet.TextExtraction.TextExtractor.Extract(String filePath) in C:\projects\tikaondotnet\src\TikaOnDotnet.TextExtractor\TextExtractor.cs:line 28
at TikaRnD.Program.Main(String[] args) in c:\users\norbert\source\repos\TikaRnD\TikaRnD\Program.cs:line 20
当我在具有更多内存的系统上运行具有相同文件的示例代码时,它消耗高达~10GB的RAM并成功完成提取 - 提取的内容大小为~50MB。
任何人都可以帮助我理解为什么会发生高得惊人的高内存消耗以及如果可能的话如何预防?