我们的客户以PDF格式向我们发送订单,该表格是使用传统表格构建的Word文档生成的。
目前,我们客户中心的人员正在将订单打入我们的系统,但我们已决定尝试自动执行此任务。
我能够通过每页简单的PdfReader阅读PDF的内容:
public static string GetPdfText(string path)
{
var text = string.Empty;
using (var reader = new PdfReader(path))
{
for (var page = 1; page <= reader.NumberOfPages; page++)
{
text += PdfTextExtractor.GetTextFromPage(reader, page);
}
}
return text;
}
但不是复选框......
我能够在浏览PDF中的每个对象时将复选框检测为字典,但我无法将它们与其他对象区分开来或读取值...
public static IEnumerable<PdfDictionary> ReadCheckboxes(string path)
{
using (var reader = new PdfReader(path))
{
var checkboxes = new List<PdfDictionary>();
for (var i = 0; i < reader.XrefSize; i++)
{
var pdfObject = reader.GetPdfObject(i);
checkboxes.Add((PdfDictionary) pdfObject);
}
return checkboxes;
}
}
我错过了什么?我也试过阅读AcroFields,但他们已经空了......
我上传了一份包含旧版复选框的示例PDF here。
目前,我们无法在我们的系统之间进行集成,也无法对基础PDF或Word文档进行任何更改。
答案 0 :(得分:2)
OP在评论中指出,在 x 0 , y 处返回类似&#34;复选框的输出的解决方案> 0 ,已检查;位置 x 1 , y 1 的复选框,未选中; ...&#34;就足够了,即他的'#34;形式&#34;是足够静态的,以便这些位置允许识别相应复选框的含义。因此,这里是这个变体的实现。
我刚刚看到问题被标记为c#,而我使用Java实现了搜索。这不应该是一个太大的问题,代码应该很容易移植。如果移植存在问题,我将在此处添加C#版本。
由于复选框是使用矢量图形绘制的,因此OP已使用的文本提取找不到它们。幸运的是,iText解析框架也可用于查找矢量图形。
因此,我们首先需要收集框的ExtRenderListener
(iTextSharp中的IExtRenderListener
)。它只有接口方法modifyPath
和renderPath
:
@Override
public void modifyPath(PathConstructionRenderInfo renderInfo)
{
switch (renderInfo.getOperation())
{
case PathConstructionRenderInfo.RECT:
{
float x = renderInfo.getSegmentData().get(0);
float y = renderInfo.getSegmentData().get(1);
float w = renderInfo.getSegmentData().get(2);
float h = renderInfo.getSegmentData().get(3);
rectangle = new Rectangle(x, y, x+w, y+h);
}
case PathConstructionRenderInfo.MOVETO:
{
float x = renderInfo.getSegmentData().get(0);
float y = renderInfo.getSegmentData().get(1);
moveToVector = new Vector(x, y, 1);
lineToVector = null;
break;
}
case PathConstructionRenderInfo.LINETO:
{
if (moveToVector != null)
{
float x = renderInfo.getSegmentData().get(0);
float y = renderInfo.getSegmentData().get(1);
lineToVector = new Vector(x, y, 1);
}
break;
}
default:
moveToVector = null;
lineToVector = null;
}
}
@Override
public Path renderPath(PathPaintingRenderInfo renderInfo)
{
if (renderInfo.getOperation() != PathPaintingRenderInfo.NO_OP)
{
if (rectangle != null)
{
Vector a = new Vector(rectangle.getLeft(), rectangle.getBottom(), 1).cross(renderInfo.getCtm());
Vector b = new Vector(rectangle.getRight(), rectangle.getBottom(), 1).cross(renderInfo.getCtm());
Vector c = new Vector(rectangle.getRight(), rectangle.getTop(), 1).cross(renderInfo.getCtm());
Vector d = new Vector(rectangle.getLeft(), rectangle.getTop(), 1).cross(renderInfo.getCtm());
Box box = new Box(new LineSegment(a, c), new LineSegment(b, d));
boxes.add(box);
}
if (moveToVector != null && lineToVector != null)
{
if (!boxes.isEmpty())
{
Vector from = moveToVector.cross(renderInfo.getCtm());
Vector to = lineToVector.cross(renderInfo.getCtm());
boxes.get(boxes.size() - 1).selectDiagonal(new LineSegment(from, to));
}
}
}
moveToVector = null;
lineToVector = null;
rectangle = null;
return null;
}
Vector moveToVector = null;
Vector lineToVector = null;
Rectangle rectangle = null;
public Iterable<Box> getBoxes()
{
return boxes;
}
final List<Box> boxes = new ArrayList<Box>();
(来自CheckBoxExtractionStrategy.java)
它使用辅助类Box
,使用各自的对角线对复选框进行建模:
public class Box
{
public LineSegment getDiagonal()
{
return diagonalA;
}
public boolean isChecked()
{
return selectedA && selectedB;
}
Box(LineSegment diagonalA, LineSegment diagonalB)
{
this.diagonalA = diagonalA;
this.diagonalB = diagonalB;
}
void selectDiagonal(LineSegment diagonal)
{
if (approximatelyEquals(diagonal, diagonalA))
selectedA = true;
else if (approximatelyEquals(diagonal, diagonalB))
selectedB = true;
}
boolean approximatelyEquals(LineSegment a, LineSegment b)
{
float permissiveness = a.getLength() / 10.0f;
if (approximatelyEquals(a.getStartPoint(), b.getStartPoint(), permissiveness) &&
approximatelyEquals(a.getEndPoint(), b.getEndPoint(), permissiveness))
return true;
if (approximatelyEquals(a.getStartPoint(), b.getEndPoint(), permissiveness) &&
approximatelyEquals(a.getEndPoint(), b.getStartPoint(), permissiveness))
return true;
return false;
}
boolean approximatelyEquals(Vector a, Vector b, float permissiveness)
{
return a.subtract(b).length() < permissiveness;
}
boolean selectedA = false;
boolean selectedB = false;
final LineSegment diagonalA, diagonalB;
}
(CheckBoxExtractionStrategy.java中的内部课程)
将此类应用于示例文档:
for (int page = 1; page <= pdfReader.getNumberOfPages(); page++)
{
System.out.printf("\nPage %s\n====\n", page);
CheckBoxExtractionStrategy strategy = new CheckBoxExtractionStrategy();
PdfReaderContentParser parser = new PdfReaderContentParser(pdfReader);
parser.processContent(page, strategy);
for (Box box : strategy.getBoxes())
{
Vector basePoint = box.getDiagonal().getStartPoint();
System.out.printf("at %s, %s - %s\n", basePoint.get(Vector.I1), basePoint.get(Vector.I2),
box.isChecked() ? "checked" : "unchecked");
}
}
获得输出
OP的文件Page 1 ==== at 73.104, 757.8 - checked at 86.544, 757.8 - checked at 99.984, 757.8 - unchecked