I have a bunch of PDF documents and I am generally able to read all of the documents using the method iText.Kernel.Pdf.Canvas.Parser.PdfTextExtractor.GetTextFromPage
Some of the documents have a block of text which is not being read. E.g. In the picture attached, I am unable to read text in the region encircled with yellow.
I guess, that this is entity is not a picture because I am unable to copy paste using the mouse. Also, I am able to read images in the document by handling EventType.RENDER_IMAGE
in a custom strategy object. And, the encircled region does not get extracted as an image.
Any suggestions on how this could be read?
答案 0 :(得分:0)
如果您没有同时获得该内容的RENDER_TEXT
或RENDER_IMAGE
事件,则很可能使用矢量图形说明进行绘制。
你也可以检索这些指令,但你得到的是一系列路径定义(移动到,行到,曲线到......)和路径渲染(描边,填充......)信息为RENDER_PATH
个事件。