Question

我对具有父字段的文本字段进行了一些acrofield操作。这个工作到目前为止，但表单还包含一些复选框，不会更改。但是当我将操纵的pdf存储到磁盘并检查复选框的值时，我可以看到cb_a.0的值已从ÄÖÜ？ß更改为?????

由于这种意外的改变，我的进一步处理失败了，任何想法如何防止这种情况？

我的测试用例

@Test
public void changeBoxedFieldsToOne() throws IOException {

    File encodingPdfFile = new File(classLoader.getResource("./prefill/TestFormEncoding.pdf").getFile());
    byte[] encodingPdfByte = Files.readAllBytes(encodingPdfFile.toPath());

    PdfAcrofieldManipulator pdfMani = new PdfAcrofieldManipulator(encodingPdfByte);
    assertTrue(pdfMani.getTextFieldsWithMoreThan2Children().size() > 0);
    pdfMani.changeBoxedFieldsToOne();
    byte[] changedPdf = pdfMani.savePdf();
    Files.write(Paths.get("./build/changeBoxedFieldsToOne.pdf"), changedPdf);

    pdfMani = new PdfAcrofieldManipulator(changedPdf);
    assertTrue(pdfMani.getTextFieldsWithMoreThan2Children().size() == 0);

}

public void changeBoxedFieldsToOne() {
    PDDocumentCatalog docCatalog = pdDocument.getDocumentCatalog();
    PDAcroForm acroForm = docCatalog.getAcroForm();

    List<PDNonTerminalField> textFieldWithMoreThan2Childrens = getTextFieldsWithMoreThan2Children();

    for (PDField field : textFieldWithMoreThan2Childrens) {
            int amountOfChilds = ((PDNonTerminalField) field).getChildren().size();
            String currentFieldName = field.getPartialName();
            LOG.info("merging fields of fieldnam {0} to one field", currentFieldName);

            PDField firstChild = getChildWithPartialName((PDNonTerminalField) field, "0");
            if (firstChild == null ) {
                LOG.debug("found field which has a dot but starts not with 0, skipping this field");
                continue;
            }
            PDField lastChild = getChildWithPartialName((PDNonTerminalField) field, Integer.toString(amountOfChilds - 1));
            PDPage pageWhichContainsField = firstChild.getWidgets().get(0).getPage();

            try {
                removeField(pdDocument, currentFieldName);
            } catch (IOException e) {
                LOG.error("Error while removing field {0}", currentFieldName, e);
            }
            PDField newField = creatNewField(acroForm, field, firstChild, lastChild, pageWhichContainsField);
            acroForm.getFields().add(newField);
            PDAnnotationWidget newFieldWidget = createWidgetForField(newField, pageWhichContainsField, firstChild, lastChild);
            try {
                pageWhichContainsField.getAnnotations().add(newFieldWidget);
            } catch (IOException e) {
                LOG.error("error while adding new field to page");
        }
    }
}

public byte[] savePdf() throws IOException {
    try (final ByteArrayOutputStream out = new ByteArrayOutputStream()) {
        //pdDocument.saveIncremental(out);
        pdDocument.save(out);
        pdDocument.close();
        return out.toByteArray();
    }
}

我正在使用PDFBox 2.0.8

以下是源PDF：https://ufile.io/gr01f或此处https://www.file-upload.net/download-12928052/TestFormEncoding.pdf.html

此处输出：https://ufile.io/k8cr3或此处https://www.file-upload.net/download-12928049/changeBoxedFieldsToOne.pdf.html

Answer 1

这确实是PDFBox中的一个错误：PDFBox无法正确处理包含值超出US_ASCII范围的字节的PDF名称对象（特别是在0..127范围之外，而你的变音符号在外面）。

PDF名称处理中的第一个错误是PDFBox在混合UTF-8 / CP-1252解码策略之后在内部将它们表示为字符串。这是错误的，根据PDF规范，名称对象是由除null（字符代码0）之外的任何字符（8位值）序列唯一定义的原子符号。 [...]

通常，构成名称的字节永远不会被视为要呈现给人类用户或PDF处理器外部应用程序的文本。但是，偶尔需要将名称对象视为文本，例如表示字体名称[...]，分色或DeviceN颜色空间中的着色剂名称或结构类型[...]

在这种情况下，构成名称对象的字节序列应根据UTF-8进行解释，UTF-8是一种可变长度的字节编码表示。

因此，将名称视为字节序列以外的其他内容通常没有意义。只有在某些上下文中使用的名称才有意义，因为UTF-8编码的字符串。

此外，混合UTF-8 / CP-1252解码策略，即首先尝试使用UTF-8进行解码并且在失败的情况下再次使用CP-1252进行解码，可以为不同的名称实体创建相同的字符串表示形式，所以这可以通过使不平等的名字相等来伪造。

在您的情况下，这不是问题，但您可以解释您使用的名称。

第二个错误是，在序列化PDF时，它只对表示来自US_ASCII的名称的字符串中的字符进行了正确编码，所有其他字符都替换为“？”：

public void writePDF(OutputStream output) throws IOException
{
    output.write('/');
    byte[] bytes = getName().getBytes(Charsets.US_ASCII);
    for (byte b : bytes)
    {
        [...]
    }
}

（来自org.apache.pdfbox.cos.COSName.writePDF(OutputStream)）

这是您的复选框值（内部由PDF名称对象表示）损坏无法修复的地方......

显示问题的一个更简单的例子是：

PDDocument document = new PDDocument();
PDPage page = new PDPage();
document.addPage(page);
document.getDocumentCatalog().getCOSObject().setString(COSName.getPDFName("äöüß"), "äöüß");
document.save(new File(RESULT_FOLDER, "non-ascii-name.pdf"));
document.close();

在结果中，带有自定义条目的目录如下所示：

1 0 obj
<<
/Type /Catalog
/Version /1.4
/Pages 2 0 R
/#3F#3F#3F#3F <E4F6FCDF>
>>

在名称键中，所有字符都替换为“？”以十六进制编码形式（#3F），而在字符串值中，字符被适当地编码。

经过一番搜索后，我偶然发现了answer关于这个话题，差不多两年前。那时，PDF Name对象字节总是被解释为UTF-8编码，这导致了该问题的问题。

因此创建了问题PDFBOX-3347。为了解决这个问题，引入了混合UTF-8 / CP-1252解码策略。如上所述，我不是那种策略的朋友。

在该堆栈溢出答案中，我还讨论了与PDF序列化期间US_ASCII的使用相关的问题，但该方面尚未得到解决。

另一个相关问题是PDFBOX-3519，但它的解决方案也减少了，试图修复PDF名称的解析，忽略了它的序列化。

另一个相关问题是PDFBOX-2836。

使用pdfbox操作acrofields会更改onValue

1 个答案: