Question

我已经停止了一个程序，使用iTextSharp和iTextSharp.pdfa从pdf中取出元数据值。我想从pdf中取一个值“First Name”请注意我的书面程序的以下详细信息，并且有人帮助我在pdf中搜索vaule

对象引用错误正在这行中击中!!!

string abc = document.CustomValues [“你的名字：”]。ToString（）;

如果我想从pdf中找到“你的名字：”，怎么做？

using PdfSharp.Pdf;
using PdfSharp.Pdf.IO;
using System.Windows.Forms;

namespace WinPdfSP
{
    class PdfDocuments
    {     
      static public class CompatiblePdfReader
      {
public void ExtractMetadata(string sourcePath="C:\\Users\\UserName1\\Desktop\\SampleData.pdf")
            {
                PdfDocument document = CompatiblePdfReader.Open(sourcePath);

                string abc = document.CustomValues["Your First Name:"].ToString();

                    string docdet=

                    document.Info.Author.ToString() + Environment.NewLine +
                    document.Info.CreationDate.ToString() + Environment.NewLine +
                    document.Info.Creator.ToString() + Environment.NewLine +
                    document.Info.Keywords.ToString() + Environment.NewLine +
                    document.Info.ModificationDate.ToString() + Environment.NewLine +
                    document.Info.Producer.ToString() + Environment.NewLine +
                    document.Info.Subject.ToString() + Environment.NewLine +
                    document.Info.Title.ToString() + Environment.NewLine +
                    document.FileSize.ToString() + Environment.NewLine +
                    document.FullPath.ToString() + Environment.NewLine +
                    document.Guid.ToString() + Environment.NewLine +
                    document.Language.ToString() + Environment.NewLine +
                    document.PageCount.ToString() + Environment.NewLine +
                    document.Version.ToString();

                    document.Tag.ToString();
}

Answer 1

如果您的PDF确实仍包含表单域（AFAIK PDF / A不允许使用acrofields），则可以在itextsharp中访问Acrofields对象中的表单域。

[更新：您使用的是什么版本的itextsharp？这段代码片段适用于java itext 2.1.7（LGPL版本），但无论如何它应该指向正确的方向。如果您有XFA字段，则itext *支持在某些方面受到限制。我建议您查看行动手册中 itext的第6章中的代码示例 http://itextpdf.com/examples/iia.php?id=121]

要访问字段，请使用以下代码段（这是Java，但itextsharp应该类似）：

AcroFields fields = reader.getAcroFields();
if( fields != null ) {
    String value = fields.getFieldItem("My Field Name").getValue(0);
    [ ... do sth with the value ... ]
}

希望这会让你跑步。 Acofields是一个野兽，有时候很奇怪。只要你谈论文本域，你应该没问题。对于Radiobuttons或Checkboxes，您应该查看PDF参考中的AppearanceState描述。

Answer 2

如果我从您的评论中正确理解，则字段标签（“您的名字”和“您的姓氏”）不存在于pdf中的文本以外的任何内容（“实际上不是元数据中的值”）。如果是这种情况，那么可能并不是一种非常好的方式，因为你真的不能保证文本在pdf中的存储方式。

因此，除非你不想真正深入研究pdf格式，否则你可能会失败。但你可能也很幸运，“你的名字：约翰面团”实际上是作为一个字符串存储在一起（而不是说，例如，2个不同的对象：“你的名字：”和单独的“约翰·道夫“）。

如果有一个对象，您可以使用此处提到的任何方法to extract all text from the pdf。其中一个解决方案使用您已经使用的iTextSharp。我个人已成功使用PDFBox（也在链接中提到）。转换为文本后，您可以查看文本并查看相对于字段标签存储名称的位置，并构建用于提取它的正则表达式。

为了实现这一点，输入pdf必须足够相似，以便“to string”转换产生一致且可查找的关于名称相对于标签放置位置的模式。如上所述：如果你很幸运：他们将紧挨着彼此。不太幸运：介于两者之间的很多其他文字。运气不好：文本字符串只是pdf中的字符，看似随机顺序。

祝你好运，

/亚当

如何从pdf中取出值 - SharePoint 2010

2 个答案: