Question

PDFbox内容流是按页面完成的，但这些字段来自目录中的表单，该表格来自pdf文档本身。所以我不确定哪些字段在哪些页面上，以及它导致将文本写入错误的位置/页面。

即。我正在处理每页的字段，但不确定哪些字段在哪些页面上。

有没有办法告诉哪个字段在哪个页面上？或者，有没有办法只获取当前页面上的字段？

谢谢！

标记

代码段：

PDDocument pdfDoc = PDDocument.load(file);
PDDocumentCatalog docCatalog = pdfDoc.getDocumentCatalog();
PDAcroForm acroForm = docCatalog.getAcroForm();

// Get field names
List<PDField> fieldList = acroForm.getFields();
List<PDPage> pages = pdfDoc.getDocumentCatalog().getAllPages();
for (PDPage page : pages) {
  PDPageContentStream contentStream = new PDPageContentStream(pdfDoc, page, true, true, true);
  processFields(acroForm, fieldList, contentStream, page);
  contentStream.close();
}

Answer 1

PDFbox内容流是按页面完成的，但这些字段来自目录中的表单，该表格来自pdf文档本身。所以我不确定哪些字段在哪些页面上

原因是PDF包含定义表单的全局对象结构。此结构中的表单字段可以在0,1或更多实际PDF页面上具有0,1或更多可视化。此外，在仅1个可视化的情况下，允许合并字段对象和可视化对象。

PDFBox 1.8.x

不幸的是，PDAcroForm和PDField对象中的PDFBox仅代表此对象结构，并且不提供对关联页面的轻松访问。但是，通过访问底层结构，您可以构建连接。

以下代码应说明如何执行此操作：

@SuppressWarnings("unchecked")
public void printFormFields(PDDocument pdfDoc) throws IOException {
    PDDocumentCatalog docCatalog = pdfDoc.getDocumentCatalog();

    List<PDPage> pages = docCatalog.getAllPages();
    Map<COSDictionary, Integer> pageNrByAnnotDict = new HashMap<COSDictionary, Integer>();
    for (int i = 0; i < pages.size(); i++) {
        PDPage page = pages.get(i);
        for (PDAnnotation annotation : page.getAnnotations())
            pageNrByAnnotDict.put(annotation.getDictionary(), i + 1);
    }

    PDAcroForm acroForm = docCatalog.getAcroForm();

    for (PDField field : (List<PDField>)acroForm.getFields()) {
        COSDictionary fieldDict = field.getDictionary();

        List<Integer> annotationPages = new ArrayList<Integer>();
        List<COSObjectable> kids = field.getKids();
        if (kids != null) {
            for (COSObjectable kid : kids) {
                COSBase kidObject = kid.getCOSObject();
                if (kidObject instanceof COSDictionary)
                    annotationPages.add(pageNrByAnnotDict.get(kidObject));
            }
        }

        Integer mergedPage = pageNrByAnnotDict.get(fieldDict);

        if (mergedPage == null)
            if (annotationPages.isEmpty())
                System.out.printf("i Field '%s' not referenced (invisible).\n", field.getFullyQualifiedName());
            else
                System.out.printf("a Field '%s' referenced by separate annotation on %s.\n", field.getFullyQualifiedName(), annotationPages);
        else
            if (annotationPages.isEmpty())
                System.out.printf("m Field '%s' referenced as merged on %s.\n", field.getFullyQualifiedName(), mergedPage);
            else
                System.out.printf("x Field '%s' referenced as merged on %s and by separate annotation on %s. (Not allowed!)\n", field.getFullyQualifiedName(), mergedPage, annotationPages);
    }
}

谨防，PDFBox PDAcroForm表单字段处理有两个缺点：

PDF规范允许定义表单的全局对象结构是深树，即实际字段不必是根的直接子节点，而是可以通过内部树节点来组织。 PDFBox忽略了这一点，并希望这些字段是根的直接子节点。
野外的一些PDF，最重要的旧版本，不包含字段树，但只通过可视化窗口小部件注释引用页面中的字段对象。 PDFBox在PDAcroForm.getFields列表中没有看到这些字段。

@mikhailvs

PS： his answer正确显示您可以使用PDField.getWidget().getPage()从字段小部件中检索页面对象，并使用{{确定其页码1}}。虽然速度很快，但这个catalog.getAllPages().indexOf方法有一个缺点：它从小部件注释字典的可选条目中检索页面引用。因此，如果您处理的PDF是由填充该条目的软件创建的，那么一切都很好，但如果PDF创建者没有填写该条目，那么您获得的只是getPage()页面。

PDFBox 2.0.x

在2.0.x中，一些用于访问相关元素的方法已经改变，但整体情况并未改变，为了安全地检索小部件的页面，您仍然需要遍历页面并找到引用该注释的页面。

安全方法：

null

快速方法

int determineSafe(PDDocument document, PDAnnotationWidget widget) throws IOException
{
    COSDictionary widgetObject = widget.getCOSObject();
    PDPageTree pages = document.getPages();
    for (int i = 0; i < pages.getCount(); i++)
    {
        for (PDAnnotation annotation : pages.get(i).getAnnotations())
        {
            COSDictionary annotationObject = annotation.getCOSObject();
            if (annotationObject.equals(widgetObject))
                return i;
        }
    }
    return -1;
}

用法：

int determineFast(PDDocument document, PDAnnotationWidget widget)
{
    PDPage page = widget.getPage();
    return page != null ? document.getPages().indexOf(page) : -1;
}

（DetermineWidgetPage.java）

（与1.8.x代码相比，此处的安全方法只搜索单个字段的页面。如果在您的代码中您必须确定许多小部件的页面，则应创建查找PDAcroForm acroForm = document.getDocumentCatalog().getAcroForm(); if (acroForm != null) { for (PDField field : acroForm.getFieldTree()) { System.out.println(field.getFullyQualifiedName()); for (PDAnnotationWidget widget : field.getWidgets()) { System.out.print(widget.getAnnotationName() != null ? widget.getAnnotationName() : "(NN)"); System.out.printf(" - fast: %s", determineFast(document, widget)); System.out.printf(" - safe: %s\n", determineSafe(document, widget)); } } }比如在1.8.x的情况下。）

示例文档

快速方法失败的文档：aFieldTwice.pdf

快速方法适用的文档：test_duplicate_field2.pdf

Answer 2

当然这个答案可能对OP没有帮助（一年后），但是如果其他人遇到它，这就是解决方案：

PDDocumentCatalog catalog = doc.getDocumentCatalog();

int pageNumber = catalog.getAllPages().indexOf(yourField.getWidget().getPage());

Answer 3

此示例使用Lucee（cfml）authentication

非常感谢mkl，因为他的上述回答非常宝贵，没有他的帮助，我无法构建此功能。

调用函数：pageForSignature（doc，fieldName），它将返回该字段名所在的页面编号。如果未找到fieldName，则返回-1。

  <cfscript>
  try{

  /*
  java is used by using CreateObject()
  */

  variables.File = CreateObject("java", "java.io.File");

  //references lucee bundle directory - typically on tomcat: /usr/local/tomcat/lucee-server/bundles
  variables.PDDocument = CreateObject("java", "org.apache.pdfbox.pdmodel.PDDocument", "org.apache.pdfbox.app", "2.0.18")

  function determineSafe(doc, widget){

    var i = '';
    var widgetObject = widget.getCOSObject();
    var pages = doc.getPages();
    var annotation = '';
    var annotationObject = '';

    for (i = 0; i < pages.getCount(); i=i+1){

    for (annotation in pages.get(i).getAnnotations()){
        if(annotation.getSubtype() eq 'widget'){
            annotationObject = annotation.getCOSObject();
            if (annotationObject.equals(widgetObject)){
                return i;
            }
        }
    }

    }
    return -1;
  }

  function pageForSignature(doc, fieldName){
    try{
    var acroForm = doc.getDocumentCatalog().getAcroForm();
    var field = '';
    var widget = '';
    var annotation = '';
    var pageNo = '';

    for(field in acroForm.getFields()){

    if(field.getPartialName() == fieldName){

        for(widget in field.getWidgets()){

           for(annotation in widget.getPage().getAnnotations()){

             if(annotation.getSubtype() == 'widget'){

                pageNo = determineSafe(doc, widget);
                doc.close();
                return pageNo;
             }
           }

        }
    }
  }
return -1;  
}catch(e){
    doc.close();
writeDump(label="catch error",var='#e#');
  }
} 

doc = PDDocument.init().load(File.init('/**********/myfile.pdf'));

//returns no,  page numbers start at 0
pageNo = pageForSignature(doc, 'twtzceuxvx');

writeDump(label="pageForSignature(doc, fieldName)", var="#pageNo#");
</cfscript

Answer 4

单个或多个小部件的通用解决方案（单页的重复限定名称）..

List<PDAnnotationWidget>  widget=field.getWidgets();
PDDocumentCatalog catalog = doc.getDocumentCatalog();
for(int i=0;i<widget.size();i++) {
int pageNumber = 1+ catalog.getPages().indexOf(field.getWidgets().get(i).getPage());

/ * field co ordinate也可以在这里获得单个或多个它可以工作.. * /

// PDRectangle r = widget.get（i）.getRectangle（）;

如何知道字段是否在特定页面上？

4 个答案:

PDFBox 1.8.x

PDFBox 2.0.x

示例文档