PdfTextExtractor.GetTextFromPage未返回正确的文本

时间:2014-10-28 23:43:10

标签: pdf itextsharp

使用iTextSharp,我有以下代码,成功地为大多数PDF读取了PDF的文本,我试图阅读...

PdfReader reader = new PdfReader(fileName);
for (int i = 1; i <= reader.NumberOfPages; i++)
{
    text += PdfTextExtractor.GetTextFromPage(reader, i);
}
reader.Close();

但是,我的一些PDF格式有XFA表格(已经填写完毕),这会导致“

"Please wait... \n  \nIf this message is not eventually replaced by the proper contents of the document, your PDF \nviewer may not be able to display this type of document. \n  \nYou can upgrade to the latest version of Adobe Reader for Windows®, Mac, or Linux® by \nvisiting  http://www.adobe.com/products/acrobat/readstep2.html. \n  \nFor more assistance with Adobe Reader visit  http://www.adobe.com/support/products/\nacrreader.html. \n  \nWindows is either a registered trademark or a trademark of Microsoft Corporation in the United States and/or other countries. Mac is a trademark \nof Apple Inc., registered in the United States and other countries. Linux is the registered trademark of Linus Torvalds in the U.S. and other \ncountries."

我该如何解决这个问题?我尝试使用iTextSharp中的PdfStamper [1]来压缩PDF,但这不起作用 - 结果流具有相同的垃圾文本。

[1] How to flatten already filled out PDF form using iTextSharp

1 个答案:

答案 0 :(得分:1)

您遇到的PDF充当XML流的容器。此XML流基于XML Forms Architecture(XFA)。您看到的消息是不是垃圾!这是在查看器中打开文档时显示的PDF页面中显示的消息,该文件读取文件就像普通PDF一样。

例如:如果您在Apple Preview中打开文档,您将看到完全相同的消息,因为Apple Preview无法呈现XFA表单。在使用iText解析文件中包含的PDF时,您收到此消息并不会让您感到惊讶。这正是文件中存在的PDF内容。在Adobe Reader中打开文档时看到的内容不是以PDF语法存储的,而是存储为XML流。

您说您已按照问题How to flatten already filled out PDF form using iTextSharp的答案中所述尝试压扁PDF。 但是,这个问题是关于基于AcroForm技术的表单的扁平化。它不应该与XFA表单一起使用。如果您想展平XFA表单,则需要在iText上使用XFA Worker

<强> [JAVA]

Document document = new Document();
PdfWriter writer = PdfWriter.getInstance(document, new FileOutputStream(dest));
XFAFlattener xfaf = new XFAFlattener(document, writer);
xfaf.flatten(new PdfReader(baos.toByteArray()));
document.close();

<强> [C#]

Document document = new Document();
PdfWriter writer = PdfWriter.GetInstance(document, new FileStream(dest, FileMode.Create));
XFAFlattener xfaf = new XFAFlattener(document, writer);
ms.Position = 0;
xfaf.Flatten(new PdfReader(ms));
document.Close();

此展平过程的结果是普通PDF,可以通过原始代码进行解析。