Question

请帮助我理解我的解决方案是否正确。

我尝试使用LocationTextExtractionStrategy解析器从PDF文件中提取文本。我得到异常，因为ParseContentMethod试图解析内联图像？代码很简单，看起来很像：

RenderFilter[] filter = { new RegionTextRenderFilter(cropBox) };
ITextExtractionStrategy strategy = new FilteredTextRenderListener(new LocationTextExtractionStrategy(), filter);
PdfTextExtractor.GetTextFromPage(pdfReader, pageNumber, strategy);

我意识到图片在内容流中但我有一个PDF文件由于内嵌图像而无法提取文本。它返回＆＃34; UnsupportedPdfException;不支持过滤器/ DCTDECODE＆＃34;然后它终于失败并且＆＃34;和＃34;无法找到图像数据或EI＆＃34;，当我真正关心的是文本时。 BI / EI存在于我的文件中，所以我认为这个失败是因为/ DCTDECODE异常。但同样，我并不关心图像，我正在寻找文字。

我目前的解决方案是在InlineImageUtils类中添加一个filterHandler，它将Filter_DoNothing()过滤器分配给DCTDECODE filterHandler字典。这样，当我使用带有DCTDECODE的InlineImages时，我不会遇到异常。像这样：

private static bool InlineImageStreamBytesAreComplete(byte[] samples, PdfDictionary imageDictionary) {
    try {
        IDictionary<PdfName, FilterHandlers.IFilterHandler> handlers = new Dictionary<PdfName, FilterHandlers.IFilterHandler>(FilterHandlers.GetDefaultFilterHandlers());
        handlers[PdfName.DCTDECODE] = new Filter_DoNothing();
        PdfReader.DecodeBytes(samples, imageDictionary, handlers);
        return true;
    } catch (IOException e) {
        return false;
    }
}

public class Filter_DoNothing : FilterHandlers.IFilterHandler
{
    public byte[] Decode(byte[] b, PdfName filterName, PdfObject decodeParams, PdfDictionary streamDictionary)
    {
        return b;
    }
}

我的问题＆＃34;修复＆＃34;是我必须改变iTextSharp库。我宁愿不这样做，所以我可以尝试与未来的版本保持兼容。

以下是有问题的PDF： https://app.box.com/s/7eaewzu4mnby9ogpl2frzjswgqxn9rz5

文本提取，不是图像提取

0 个答案: