如何在字体大小的情况下搜索pdf文件或在搜索时保持页脚分开?

时间:2015-12-12 10:03:45

标签: c# pdf itextsharp

我正在开发一个项目,我必须在一些pdf文件中搜索一些文本。这些pdf文件的页面有页脚部分。在页脚中,文本字体大小与主要内容不同。我正在使用iTextSharp的PdfReader类,我不希望它搜索我在页脚部分中给出的文本。我认为解决方案必须是按字体大小搜索,或忽略页脚。有什么想法吗?

这是我的代码:

private List<int> ReadPdfFile(string fileName, String searchText, int index)
    {
        List<int> pages = new List<int>();
        if (File.Exists(fileName))
        {
            for (int page = 1; page <= pdfReaders[index].NumberOfPages; page++)
            {
                ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
                string currentPageText = PdfTextExtractor.GetTextFromPage(pdfReaders[index], page, strategy);
                if (currentPageText.Contains(searchText))
                {
                    pages.Add(page);
                }
            }
        }
        return pages;
    }

1 个答案:

答案 0 :(得分:1)

If one only wants to extract a certain part of the text of a page, e.g.

  • only text located in a given part of the page area, for example the left half page (in case of two columns), between given y values (to exclude headers and footers), or outside the crop box (to detect text hidden there), or

  • only text in a given style, for example only red text, only text of a given size range, ...

one can filter the information the text extraction strategy receives as input by using a FilteredTextRenderListener with matching RenderFilter instances:

RenderFilter filter = ...;
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
ITextExtractionStrategy filtered = new FilteredTextRenderListener(strategy, filter);
string filteredCurrentPageText = PdfTextExtractor.GetTextFromPage(pdfReaders[index], page, filtered);

Your filter class merely must extend the abstract class RenderFilter and override the Allow* methods as desired:

public abstract class RenderFilter
{
    public virtual bool AllowText(TextRenderInfo renderInfo)
    {
        return true;
    }

    public virtual bool AllowImage(ImageRenderInfo renderInfo)
    {
        return true;
    }
}

TextRenderInfo makes many properties of the inflowing text chunks available to filter by.