我正在开发一个项目,我必须在一些pdf文件中搜索一些文本。这些pdf文件的页面有页脚部分。在页脚中,文本字体大小与主要内容不同。我正在使用iTextSharp的PdfReader类,我不希望它搜索我在页脚部分中给出的文本。我认为解决方案必须是按字体大小搜索,或忽略页脚。有什么想法吗?
这是我的代码:
private List<int> ReadPdfFile(string fileName, String searchText, int index)
{
List<int> pages = new List<int>();
if (File.Exists(fileName))
{
for (int page = 1; page <= pdfReaders[index].NumberOfPages; page++)
{
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
string currentPageText = PdfTextExtractor.GetTextFromPage(pdfReaders[index], page, strategy);
if (currentPageText.Contains(searchText))
{
pages.Add(page);
}
}
}
return pages;
}
答案 0 :(得分:1)
If one only wants to extract a certain part of the text of a page, e.g.
only text located in a given part of the page area, for example the left half page (in case of two columns), between given y values (to exclude headers and footers), or outside the crop box (to detect text hidden there), or
only text in a given style, for example only red text, only text of a given size range, ...
one can filter the information the text extraction strategy receives as input by using a FilteredTextRenderListener
with matching RenderFilter
instances:
RenderFilter filter = ...;
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
ITextExtractionStrategy filtered = new FilteredTextRenderListener(strategy, filter);
string filteredCurrentPageText = PdfTextExtractor.GetTextFromPage(pdfReaders[index], page, filtered);
Your filter class merely must extend the abstract class RenderFilter
and override the Allow*
methods as desired:
public abstract class RenderFilter
{
public virtual bool AllowText(TextRenderInfo renderInfo)
{
return true;
}
public virtual bool AllowImage(ImageRenderInfo renderInfo)
{
return true;
}
}
TextRenderInfo
makes many properties of the inflowing text chunks available to filter by.