阅读遗留的Word表单复选框转换为PDF

时间:2016-11-11 14:22:45

标签: c# pdf checkbox itext

我们的客户以PDF格式向我们发送订单,该表格是使用传统表格构建的Word文档生成的。

目前,我们客户中心的人员正在将订单打入我们的系统,但我们已决定尝试自动执行此任务。

我能够通过每页简单的PdfReader阅读PDF的内容:

    public static string GetPdfText(string path)
    { 
        var text = string.Empty;
        using (var reader = new PdfReader(path))
        {
            for (var page = 1; page <= reader.NumberOfPages; page++)
            {
                text += PdfTextExtractor.GetTextFromPage(reader, page);
            }
        }
        return text;
    }

但不是复选框......

我能够在浏览PDF中的每个对象时将复选框检测为字典,但我无法将它们与其他对象区分开来或读取值...

    public static IEnumerable<PdfDictionary> ReadCheckboxes(string path)
    {
        using (var reader = new PdfReader(path))
        {
            var checkboxes = new List<PdfDictionary>();
            for (var i = 0; i < reader.XrefSize; i++)
            {
                var pdfObject = reader.GetPdfObject(i);
                checkboxes.Add((PdfDictionary) pdfObject);
            }
            return checkboxes;
        }
    }

我错过了什么?我也试过阅读AcroFields,但他们已经空了......

我上传了一份包含旧版复选框的示例PDF here

目前,我们无法在我们的系统之间进行集成,也无法对基础PDF或Word文档进行任何更改。

1 个答案:

答案 0 :(得分:2)

OP在评论中指出,在 x 0 y 处返回类似&#34;复选框的输出的解决方案> 0 ,已检查;位置 x 1 y 1 的复选框,未选中; ...&#34;就足够了,即他的'#34;形式&#34;是足够静态的,以便这些位置允许识别相应复选框的含义。因此,这里是这个变体的实现。

我刚刚看到问题被标记为,而我使用Java实现了搜索。这不应该是一个太大的问题,代码应该很容易移植。如果移植存在问题,我将在此处添加C#版本。

由于复选框是使用矢量图形绘制的,因此OP已使用的文本提取找不到它们。幸运的是,iText解析框架也可用于查找矢量图形。

因此,我们首先需要收集框的ExtRenderListener(iTextSharp中的IExtRenderListener)。它只有接口方法modifyPathrenderPath

的非平凡实现
@Override
public void modifyPath(PathConstructionRenderInfo renderInfo)
{
    switch (renderInfo.getOperation())
    {
    case PathConstructionRenderInfo.RECT:
    {
        float x = renderInfo.getSegmentData().get(0);
        float y = renderInfo.getSegmentData().get(1);
        float w = renderInfo.getSegmentData().get(2);
        float h = renderInfo.getSegmentData().get(3);
        rectangle = new Rectangle(x, y, x+w, y+h);
    }
    case PathConstructionRenderInfo.MOVETO:
    {
        float x = renderInfo.getSegmentData().get(0);
        float y = renderInfo.getSegmentData().get(1);
        moveToVector = new Vector(x, y, 1);
        lineToVector = null;
        break;
    }
    case PathConstructionRenderInfo.LINETO:
    {
        if (moveToVector != null)
        {
            float x = renderInfo.getSegmentData().get(0);
            float y = renderInfo.getSegmentData().get(1);
            lineToVector = new Vector(x, y, 1);
        }
        break;
    }
    default:
        moveToVector = null;
        lineToVector = null;
    }
}

@Override
public Path renderPath(PathPaintingRenderInfo renderInfo)
{
    if (renderInfo.getOperation() != PathPaintingRenderInfo.NO_OP)
    {
        if (rectangle != null)
        {
            Vector a = new Vector(rectangle.getLeft(), rectangle.getBottom(), 1).cross(renderInfo.getCtm());
            Vector b = new Vector(rectangle.getRight(), rectangle.getBottom(), 1).cross(renderInfo.getCtm());
            Vector c = new Vector(rectangle.getRight(), rectangle.getTop(), 1).cross(renderInfo.getCtm());
            Vector d = new Vector(rectangle.getLeft(), rectangle.getTop(), 1).cross(renderInfo.getCtm());

            Box box = new Box(new LineSegment(a, c), new LineSegment(b, d));
            boxes.add(box);

        }
        if (moveToVector != null && lineToVector != null)
        {
            if (!boxes.isEmpty())
            {
                Vector from = moveToVector.cross(renderInfo.getCtm());
                Vector to = lineToVector.cross(renderInfo.getCtm());

                boxes.get(boxes.size() - 1).selectDiagonal(new LineSegment(from, to));
            }
        }
    }

    moveToVector = null;
    lineToVector = null;
    rectangle = null;
    return null;
}

Vector moveToVector = null;
Vector lineToVector = null;
Rectangle rectangle = null;

public Iterable<Box> getBoxes()
{
    return boxes;
}

final List<Box> boxes = new ArrayList<Box>();

(来自CheckBoxExtractionStrategy.java

它使用辅助类Box,使用各自的对角线对复选框进行建模:

public class Box
{
    public LineSegment getDiagonal()
    {
        return diagonalA;
    }

    public boolean isChecked()
    {
        return selectedA && selectedB;
    }

    Box(LineSegment diagonalA, LineSegment diagonalB)
    {
        this.diagonalA = diagonalA;
        this.diagonalB = diagonalB;
    }

    void selectDiagonal(LineSegment diagonal)
    {
        if (approximatelyEquals(diagonal, diagonalA))
            selectedA = true;
        else if (approximatelyEquals(diagonal, diagonalB))
            selectedB = true;
    }

    boolean approximatelyEquals(LineSegment a, LineSegment b)
    {
        float permissiveness = a.getLength() / 10.0f;
        if (approximatelyEquals(a.getStartPoint(), b.getStartPoint(), permissiveness) &&
                approximatelyEquals(a.getEndPoint(), b.getEndPoint(), permissiveness))
            return true;
        if (approximatelyEquals(a.getStartPoint(), b.getEndPoint(), permissiveness) &&
                approximatelyEquals(a.getEndPoint(), b.getStartPoint(), permissiveness))
            return true;
        return false;
    }

    boolean approximatelyEquals(Vector a, Vector b, float permissiveness)
    {
        return a.subtract(b).length() < permissiveness;
    }

    boolean selectedA = false;
    boolean selectedB = false;
    final LineSegment diagonalA, diagonalB;
}

CheckBoxExtractionStrategy.java中的内部课程)

将此类应用于示例文档:

for (int page = 1; page <= pdfReader.getNumberOfPages(); page++)
{
    System.out.printf("\nPage %s\n====\n", page);

    CheckBoxExtractionStrategy strategy = new CheckBoxExtractionStrategy();
    PdfReaderContentParser parser = new PdfReaderContentParser(pdfReader);
    parser.processContent(page, strategy);

    for (Box box : strategy.getBoxes())
    {
        Vector basePoint = box.getDiagonal().getStartPoint();
        System.out.printf("at %s, %s - %s\n", basePoint.get(Vector.I1), basePoint.get(Vector.I2),
                box.isChecked() ? "checked" : "unchecked");
    }
}

获得输出

Page 1
====
at 73.104, 757.8 - checked
at 86.544, 757.8 - checked
at 99.984, 757.8 - unchecked
OP的文件

Screenshot of Doc1.pdf