PDFBox:从PDF中删除单个字段

时间:2016-12-10 01:08:44

标签: java pdf pdfbox

我能描述问题的最简单方法是,我们使用PDFbox只从一个从HelloSign发送给我们的PDF中删除一个字段。 (例如信用卡号码)

  1. 相关数据将始终位于最后一页,并且它始终位于页面中的相同坐标处。
  2. 需要从PDF中完全删除数据。我们不能简单地将字体更改为白色或在顶部绘制一个框,因为它仍然可以选择,因此可以复制。
  3. 只能删除一个字段。我们仍然需要其他字段和签名。
  4. 我创建了一个示例文档并将其上传到Dropbox。 input.pdf
  5. 为了这个问题,让我们假设要删除的字段是我上传的文件中的街道地址。不是城市,州,邮编,签名或日期。 (在现实生活中,它将是一个敏感的数据字段,如信用卡号或SSN。)
  6. 我在下面的第一条评论中对这个问题以及我迄今为止所做的尝试做了一个冗长的解释。

1 个答案:

答案 0 :(得分:0)

此答案中的代码可能出现,因为它首先确定文档中的字段映射,然后允许删除文本字段的任意组合。但请注意,它仅使用此问题中的单个示例PDF进行开发。因此,我无法确定我是否正确理解了HelloSign标记字段的方式,特别是HelloSign填充这些字段的方式。

这个答案提供了两个类,一个分析HelloSign表单,另一个通过清除选定字段来操作它;后者依赖于前者收集的信息。这两个类都是基于PDFBox PDFTextStripper实用程序类构建的。

该代码是为当前的PDFBox开发版本2.1.0-SNAPSHOT开发的。最有可能它适用于所有2.0.x版本。

HelloSignAnalyzer

此课程分析给定PDDocument寻找序列

  • [$varname ]似乎定义了用于放置表单字段内容的占位符,以及
  • [def:$varname|type|req|signer|display|label]似乎定义了占位符的属性。

它创建了一个HelloSignField个实例的集合,每个实例都描述了这样一个占位符。如果可以在占位符上找到文本,它们还包含相应字段的值。

此外,它存储页面上绘制的最后一个xobject的名称,如果样本文档是HelloSign绘制其字段内容的位置。

public class HelloSignAnalyzer extends PDFTextStripper
{
    public class HelloSignField
    {
        public String getName()
        { return name; }
        public String getValue()
        { return value; }
        public float getX()
        { return x; }
        public float getY()
        { return y; }
        public float getWidth()
        { return width; }
        public String getType()
        { return type; }
        public boolean isOptional()
        { return optional; }
        public String getSigner()
        { return signer; }
        public String getDisplay()
        { return display; }
        public String getLabel()
        { return label; }
        public float getLastX()
        { return lastX; }

        String name = null;
        String value = "";
        float x = 0, y = 0, width = 0;
        String type = null;
        boolean optional = false;
        String signer = null;
        String display = null;
        String label = null;

        float lastX = 0;

        @Override
        public String toString()
        {
            return String.format("[Name: '%s'; Value: `%s` Position: %s, %s; Width: %s; Type: '%s'; Optional: %s; Signer: '%s'; Display: '%s', Label: '%s']",
                    name, value, x, y, width, type, optional, signer, display, label);
        }

        void checkForValue(List<TextPosition> textPositions)
        {
            for (TextPosition textPosition : textPositions)
            {
                if (inField(textPosition))
                {
                    float textX = textPosition.getTextMatrix().getTranslateX();
                    if (textX > lastX + textPosition.getWidthOfSpace() / 2 && value.length() > 0)
                        value += " ";
                    value += textPosition.getUnicode();
                    lastX = textX + textPosition.getWidth();
                }
            }
        }

        boolean inField(TextPosition textPosition)
        {
            float yPos = textPosition.getTextMatrix().getTranslateY();
            float xPos = textPosition.getTextMatrix().getTranslateX();

            return inField(xPos, yPos);
        }

        boolean inField(float xPos, float yPos)
        {
            if (yPos < y - 3 || yPos > y + 3)
                return false;

            if (xPos < x - 1 || xPos > x + width + 1)
                return false;

            return true;
        }
    }

    public HelloSignAnalyzer(PDDocument pdDocument) throws IOException
    {
        super();
        this.pdDocument = pdDocument;
    }

    public Map<String, HelloSignField> analyze() throws IOException
    {
        if (!analyzed)
        {
            fields = new HashMap<>();

            setStartPage(pdDocument.getNumberOfPages());
            getText(pdDocument);

            analyzed = true;
        }
        return Collections.unmodifiableMap(fields);
    }

    public String getLastFormName()
    {
        return lastFormName;
    }

    //
    // PDFTextStripper overrides
    //
    @Override
    protected void writeString(String text, List<TextPosition> textPositions) throws IOException
    {
        {
            for (HelloSignField field : fields.values())
            {
                field.checkForValue(textPositions);
            }
        }

        int position = -1;
        while ((position = text.indexOf('[', position + 1)) >= 0)
        {
            int endPosition = text.indexOf(']', position);
            if (endPosition < 0)
                continue;
            if (endPosition > position + 1 && text.charAt(position + 1) == '$')
            {
                String fieldName = text.substring(position + 2, endPosition);
                int spacePosition = fieldName.indexOf(' ');
                if (spacePosition >= 0)
                    fieldName = fieldName.substring(0, spacePosition);
                HelloSignField field = getOrCreateField(fieldName);

                TextPosition start = textPositions.get(position);
                field.x = start.getTextMatrix().getTranslateX();
                field.y = start.getTextMatrix().getTranslateY();
                TextPosition end = textPositions.get(endPosition);
                field.width = end.getTextMatrix().getTranslateX() + end.getWidth() - field.x;
            }
            else if (endPosition > position + 5 && "def:$".equals(text.substring(position + 1, position + 6)))
            {
                String definition = text.substring(position + 6, endPosition);
                String[] pieces = definition.split("\\|");
                if (pieces.length == 0)
                    continue;
                HelloSignField field = getOrCreateField(pieces[0]);

                if (pieces.length > 1)
                    field.type = pieces[1];
                if (pieces.length > 2)
                    field.optional = !"req".equals(pieces[2]);
                if (pieces.length > 3)
                    field.signer = pieces[3];
                if (pieces.length > 4)
                    field.display = pieces[4];
                if (pieces.length > 5)
                    field.label = pieces[5];
            }
        }

        super.writeString(text, textPositions);
    }

    @Override
    protected void processOperator(Operator operator, List<COSBase> operands) throws IOException
    {
        String currentFormName = formName; 
        if (operator != null && "Do".equals(operator.getName()) && operands != null && operands.size() > 0)
        {
            COSBase base0 = operands.get(0);
            if (base0 instanceof COSName)
            {
                formName = ((COSName)base0).getName();
                if (currentFormName == null)
                    lastFormName = formName;
            }
        }
        try
        {
            super.processOperator(operator, operands);
        }
        finally
        {
            formName = currentFormName;
        }
    }

    //
    // helper methods
    //
    HelloSignField getOrCreateField(String name)
    {
        HelloSignField field = fields.get(name);
        if (field == null)
        {
            field = new HelloSignField();
            field.name = name;
            fields.put(name, field);
        }
        return field;
    }

    //
    // inner member variables
    //
    final PDDocument pdDocument;
    boolean analyzed = false;
    Map<String, HelloSignField> fields = null;
    String formName = null;
    String lastFormName = null;
}

HelloSignAnalyzer.java

用法

可以将HelloSignAnalyzer应用于文档,如下所示:

PDDocument pdDocument = PDDocument.load(...);

HelloSignAnalyzer helloSignAnalyzer = new HelloSignAnalyzer(pdDocument);

Map<String, HelloSignField> fields = helloSignAnalyzer.analyze();

System.out.printf("Found %s fields:\n\n", fields.size());

for (Map.Entry<String, HelloSignField> entry : fields.entrySet())
{
    System.out.printf("%s -> %s\n", entry.getKey(), entry.getValue());
}

System.out.printf("\nLast form name: %s\n", helloSignAnalyzer.getLastFormName());

PlayWithHelloSign.java测试方法testAnalyzeInput

如果是OP的样本文档,则输出为

Found 8 fields:

var1001 -> [Name: 'var1001'; Value: `123 Main St.` Position: 90.0, 580.0; Width: 165.53601; Type: 'text'; Optional: false; Signer: 'signer1'; Display: 'Address', Label: 'address1']
var1004 -> [Name: 'var1004'; Value: `12345` Position: 210.0, 564.0; Width: 45.53601; Type: 'text'; Optional: false; Signer: 'signer1'; Display: 'Postal Code', Label: 'zip']
var1002 -> [Name: 'var1002'; Value: `TestCity` Position: 90.0, 564.0; Width: 65.53601; Type: 'text'; Optional: false; Signer: 'signer1'; Display: 'City', Label: 'city']
var1003 -> [Name: 'var1003'; Value: `AA` Position: 161.0, 564.0; Width: 45.53601; Type: 'text'; Optional: false; Signer: 'signer1'; Display: 'State', Label: 'state']
date2 -> [Name: 'date2'; Value: `2016/12/09` Position: 397.0, 407.0; Width: 124.63202; Type: 'date'; Optional: false; Signer: 'signer2'; Display: 'null', Label: 'null']
signature1 -> [Name: 'signature1'; Value: `` Position: 88.0, 489.0; Width: 236.624; Type: 'sig'; Optional: false; Signer: 'signer1'; Display: 'null', Label: 'null']
date1 -> [Name: 'date1'; Value: `2016/12/09` Position: 397.0, 489.0; Width: 124.63202; Type: 'date'; Optional: false; Signer: 'signer1'; Display: 'null', Label: 'null']
signature2 -> [Name: 'signature2'; Value: `` Position: 88.0, 407.0; Width: 236.624; Type: 'sig'; Optional: false; Signer: 'signer2'; Display: 'null', Label: 'null']

Last form name: Xi0

HelloSignManipulator

此课程使用HelloSignAnalyzer收集的信息来清除其姓名所给出的文本字段的内容。

public class HelloSignManipulator extends PDFTextStripper
{
    public HelloSignManipulator(HelloSignAnalyzer helloSignAnalyzer) throws IOException
    {
        super();
        this.helloSignAnalyzer = helloSignAnalyzer;
        addOperator(new SelectiveDrawObject());
    }

    public void clearFields(Iterable<String> fieldNames) throws IOException
    {
        try
        {
            Map<String, HelloSignField> fieldMap = helloSignAnalyzer.analyze();
            List<HelloSignField> selectedFields = new ArrayList<>();
            for (String fieldName : fieldNames)
            {
                selectedFields.add(fieldMap.get(fieldName));
            }
            fields = selectedFields;

            PDDocument pdDocument = helloSignAnalyzer.pdDocument;
            setStartPage(pdDocument.getNumberOfPages());
            getText(pdDocument);
        }
        finally
        {
            fields = null;
        }
    }

    class SelectiveDrawObject extends OperatorProcessor
    {
        @Override
        public void process(Operator operator, List<COSBase> arguments) throws IOException
        {
            if (arguments.size() < 1)
            {
                throw new MissingOperandException(operator, arguments);
            }
            COSBase base0 = arguments.get(0);
            if (!(base0 instanceof COSName))
            {
                return;
            }
            COSName name = (COSName) base0;

            if (replacement != null || !helloSignAnalyzer.getLastFormName().equals(name.getName()))
            {
                return;
            }

            if (context.getResources().isImageXObject(name))
            {
                throw new IllegalArgumentException("The form xobject to edit turned out to be an image.");
            }

            PDXObject xobject = context.getResources().getXObject(name);

            if (xobject instanceof PDTransparencyGroup)
            {
                throw new IllegalArgumentException("The form xobject to edit turned out to be a transparency group.");
            }
            else if (xobject instanceof PDFormXObject)
            {
                PDFormXObject form = (PDFormXObject) xobject;
                PDFormXObject formReplacement = new PDFormXObject(helloSignAnalyzer.pdDocument);
                formReplacement.setBBox(form.getBBox());
                formReplacement.setFormType(form.getFormType());
                formReplacement.setMatrix(form.getMatrix().createAffineTransform());
                formReplacement.setResources(form.getResources());
                OutputStream outputStream = formReplacement.getContentStream().createOutputStream(COSName.FLATE_DECODE);
                replacement = new ContentStreamWriter(outputStream);

                context.showForm(form);

                outputStream.close();
                getResources().put(name, formReplacement);
                replacement = null;
            }
        }

        @Override
        public String getName()
        {
            return "Do";
        }
    }

    //
    // PDFTextStripper overrides
    //
    @Override
    protected void processOperator(Operator operator, List<COSBase> operands) throws IOException
    {
        if (replacement != null)
        {
            boolean copy = true;
            if (TjTJ.contains(operator.getName()))
            {
                Matrix transformation = getTextMatrix().multiply(getGraphicsState().getCurrentTransformationMatrix());
                float xPos = transformation.getTranslateX();
                float yPos = transformation.getTranslateY();
                for (HelloSignField field : fields)
                {
                    if (field.inField(xPos, yPos))
                    {
                        copy = false;
                    }
                }
            }

            if (copy)
            {
                replacement.writeTokens(operands);
                replacement.writeToken(operator);
            }
        }
        super.processOperator(operator, operands);
    }

    //
    // helper methods
    //
    final HelloSignAnalyzer helloSignAnalyzer;
    final Collection<String> TjTJ = Arrays.asList("Tj", "TJ");
    Iterable<HelloSignField> fields;
    ContentStreamWriter replacement = null;
}

HelloSignManipulator.java

用法:清除单个字段

可以按如下方式将HelloSignManipulator应用于文档以清除单个字段:

PDDocument pdDocument = PDDocument.load(...);

HelloSignAnalyzer helloSignAnalyzer = new HelloSignAnalyzer(pdDocument);

HelloSignManipulator helloSignManipulator = new HelloSignManipulator(helloSignAnalyzer);

helloSignManipulator.clearFields(Collections.singleton("var1001"));

pdDocument.save(...);

PlayWithHelloSign.java测试方法testClearAddress1Input

用法:一次清除多个字段

可以按如下方式将HelloSignManipulator应用于文档,以便一次清除多个字段:

PDDocument pdDocument = PDDocument.load(...);

HelloSignAnalyzer helloSignAnalyzer = new HelloSignAnalyzer(pdDocument);

HelloSignManipulator helloSignManipulator = new HelloSignManipulator(helloSignAnalyzer);

helloSignManipulator.clearFields(Arrays.asList("var1004", "var1003", "date2"));

pdDocument.save(...);

PlayWithHelloSign.java测试方法testClearZipStateDate2Input

用法:连续清除多个字段

可以按如下方式将HelloSignManipulator应用于文档,以便连续清除多个字段:

PDDocument pdDocument = PDDocument.load(...);

HelloSignAnalyzer helloSignAnalyzer = new HelloSignAnalyzer(pdDocument);

HelloSignManipulator helloSignManipulator = new HelloSignManipulator(helloSignAnalyzer);

helloSignManipulator.clearFields(Collections.singleton("var1004"));
helloSignManipulator.clearFields(Collections.singleton("var1003"));
helloSignManipulator.clearFields(Collections.singleton("date2"));

pdDocument.save(...);

PlayWithHelloSign.java测试方法testClearZipStateDate2SuccessivelyInput

买者

这些课程仅仅是概念证明。一方面,它们是基于单个示例HelloSign文件构建的,因此很有可能错过了重要的细节。另一方面,有一些内在的假设,例如在HelloSignField方法inField中。

此外,通常操纵签名的HelloSign文件可能不是一个好主意。如果我正确地理解了他们的概念,他们会存储每个签名文档的哈希以允许验证内容,如果文档被操作如上所示,则哈希值将不再匹配。