我有一个1500多页的pdf,其中有一些'随机'文字,我必须从中提取一些文字...... 我可以像那样识别那个块:
bla bla bla bla bla
...
...
...
-------------------------- (separator blue image)
XXX: TEXT TEXT TEXT
TEXT TEXT TEXT TEXT
...
-------------------------- (separator blue image)
bla bla bla bla
...
...
-------------------------- (separator blue image)
XXX: TEXT2 TEXT2 TEXT2
TEXT2 TEXT2 TEXT TEXT2
...
-------------------------- (separator blue image)
我需要提取分隔符之间的所有文本(所有块) ' XXX'存在于所有块的开头,但我没有任何方法来检测块的结束。是否可以在解析器中使用图像分隔符?怎么样?
还有其他可能的方法吗?
编辑更多信息 没有背景,文字是复制和可接受的
示例pdf:1
查看示例第320页
由于
答案 0 :(得分:5)
如果你的sample PDF使用矢量图形创建了分隔符:
0.58 0.17 0 0.47 K
q 1 0 0 1 56.6929 772.726 cm
0 0 m
249.118 0 l
S
Q
q 1 0 0 1 56.6929 690.9113 cm
0 0 m
249.118 0 l
S
等
解析矢量图形是iText(夏普)的一个相当新的补充,在这方面,API可以进行一些更改。目前(版本5.5.6),您可以使用接口ExtRenderListener
(Java)/ IExtRenderListener
(。Net)的实现来解析矢量图形。
您现在有了一些方法来完成任务:
LocationTextExtractionStrategy
解析页面,并使用适当的ITextChunkFilter
使用GetResultantText(ITextChunkFilter)
重载请求每个矩形的文本。(因为我在Java中比在C#中更流利,我在Java中为iText实现了这个示例。应该很容易移植到C#和iTextSharp。)
此实现尝试提取由分隔符分隔的文本部分,如示例PDF中所示。
这是一个一次通过的解决方案,同时通过从该策略中推导出来重新使用现有的LocationTextExtractionStrategy
功能。
在同一个传递中,这个策略收集文本块(感谢它的父类)和分隔线(由于它实现了ExtRenderListener
额外的方法)。
解析了一个页面后,该策略通过方法Section
提供了getSections()
个实例的列表,每个实例都代表了一个页面的一部分,该页面由上面和/或下面的分隔线分隔。每个文本列的最顶部和最底部分在顶部或底部打开,由匹配的边界线隐式分隔。
Section
实现了TextChunkFilter
接口,因此,可以使用父类的方法getResultantText(TextChunkFilter)
来检索页面各部分中的文本。
这仅仅是一个POC,它设计用于使用与样本文档完全相同的分隔符从文档中提取部分,即使用 moveTo-lineTo-stroke 绘制的水平线。 section,出现在内容流中按列逐列排序。样本PDF可能还有更多隐式假设。
public class DividerAwareTextExtrationStrategy extends LocationTextExtractionStrategy implements ExtRenderListener
{
//
// constructor
//
/**
* The constructor accepts top and bottom margin lines in user space y coordinates
* and left and right margin lines in user space x coordinates.
* Text outside those margin lines is ignored.
*/
public DividerAwareTextExtrationStrategy(float topMargin, float bottomMargin, float leftMargin, float rightMargin)
{
this.topMargin = topMargin;
this.bottomMargin = bottomMargin;
this.leftMargin = leftMargin;
this.rightMargin = rightMargin;
}
//
// Divider derived section support
//
public List<Section> getSections()
{
List<Section> result = new ArrayList<Section>();
// TODO: Sort the array columnwise. In case of the OP's document, the lines already appear in the
// correct order, so there was no need for sorting in the POC.
LineSegment previous = null;
for (LineSegment line : lines)
{
if (previous == null)
{
result.add(new Section(null, line));
}
else if (Math.abs(previous.getStartPoint().get(Vector.I1) - line.getStartPoint().get(Vector.I1)) < 2) // 2 is a magic number...
{
result.add(new Section(previous, line));
}
else
{
result.add(new Section(previous, null));
result.add(new Section(null, line));
}
previous = line;
}
return result;
}
public class Section implements TextChunkFilter
{
LineSegment topLine;
LineSegment bottomLine;
final float left, right, top, bottom;
Section(LineSegment topLine, LineSegment bottomLine)
{
float left, right, top, bottom;
if (topLine != null)
{
this.topLine = topLine;
top = Math.max(topLine.getStartPoint().get(Vector.I2), topLine.getEndPoint().get(Vector.I2));
right = Math.max(topLine.getStartPoint().get(Vector.I1), topLine.getEndPoint().get(Vector.I1));
left = Math.min(topLine.getStartPoint().get(Vector.I1), topLine.getEndPoint().get(Vector.I1));
}
else
{
top = topMargin;
left = leftMargin;
right = rightMargin;
}
if (bottomLine != null)
{
this.bottomLine = bottomLine;
bottom = Math.min(bottomLine.getStartPoint().get(Vector.I2), bottomLine.getEndPoint().get(Vector.I2));
right = Math.max(bottomLine.getStartPoint().get(Vector.I1), bottomLine.getEndPoint().get(Vector.I1));
left = Math.min(bottomLine.getStartPoint().get(Vector.I1), bottomLine.getEndPoint().get(Vector.I1));
}
else
{
bottom = bottomMargin;
}
this.top = top;
this.bottom = bottom;
this.left = left;
this.right = right;
}
//
// TextChunkFilter
//
@Override
public boolean accept(TextChunk textChunk)
{
// TODO: This code only checks the text chunk starting point. One should take the
// whole chunk into consideration
Vector startlocation = textChunk.getStartLocation();
float x = startlocation.get(Vector.I1);
float y = startlocation.get(Vector.I2);
return (left <= x) && (x <= right) && (bottom <= y) && (y <= top);
}
}
//
// ExtRenderListener implementation
//
/**
* <p>
* This method stores targets of <code>moveTo</code> in {@link #moveToVector}
* and targets of <code>lineTo</code> in {@link #lineToVector}. Any unexpected
* contents or operations result in clearing of the member variables.
* </p>
* <p>
* So this method is implemented for files with divider lines exactly like in
* the OP's sample file.
* </p>
*
* @see ExtRenderListener#modifyPath(PathConstructionRenderInfo)
*/
@Override
public void modifyPath(PathConstructionRenderInfo renderInfo)
{
switch (renderInfo.getOperation())
{
case PathConstructionRenderInfo.MOVETO:
{
float x = renderInfo.getSegmentData().get(0);
float y = renderInfo.getSegmentData().get(1);
moveToVector = new Vector(x, y, 1);
lineToVector = null;
break;
}
case PathConstructionRenderInfo.LINETO:
{
float x = renderInfo.getSegmentData().get(0);
float y = renderInfo.getSegmentData().get(1);
if (moveToVector != null)
{
lineToVector = new Vector(x, y, 1);
}
break;
}
default:
moveToVector = null;
lineToVector = null;
}
}
/**
* This method adds the current path to {@link #lines} if it consists
* of a single line, the operation is no no-op, and the line is
* approximately horizontal.
*
* @see ExtRenderListener#renderPath(PathPaintingRenderInfo)
*/
@Override
public Path renderPath(PathPaintingRenderInfo renderInfo)
{
if (moveToVector != null && lineToVector != null &&
renderInfo.getOperation() != PathPaintingRenderInfo.NO_OP)
{
Vector from = moveToVector.cross(renderInfo.getCtm());
Vector to = lineToVector.cross(renderInfo.getCtm());
Vector extent = to.subtract(from);
if (Math.abs(20 * extent.get(Vector.I2)) < Math.abs(extent.get(Vector.I1)))
{
LineSegment line;
if (extent.get(Vector.I1) >= 0)
line = new LineSegment(from, to);
else
line = new LineSegment(to, from);
lines.add(line);
}
}
moveToVector = null;
lineToVector = null;
return null;
}
/* (non-Javadoc)
* @see com.itextpdf.text.pdf.parser.ExtRenderListener#clipPath(int)
*/
@Override
public void clipPath(int rule)
{
}
//
// inner members
//
final float topMargin, bottomMargin, leftMargin, rightMargin;
Vector moveToVector = null;
Vector lineToVector = null;
final List<LineSegment> lines = new ArrayList<LineSegment>();
}
(DividerAwareTextExtrationStrategy.java)
可以像这样使用
String extractAndStore(PdfReader reader, String format, int from, int to) throws IOException
{
StringBuilder builder = new StringBuilder();
for (int page = from; page <= to; page++)
{
PdfReaderContentParser parser = new PdfReaderContentParser(reader);
DividerAwareTextExtrationStrategy strategy = parser.processContent(page, new DividerAwareTextExtrationStrategy(810, 30, 20, 575));
List<Section> sections = strategy.getSections();
int i = 0;
for (Section section : sections)
{
String sectionText = strategy.getResultantText(section);
Files.write(Paths.get(String.format(format, page, i)), sectionText.getBytes("UTF8"));
builder.append("--\n")
.append(sectionText)
.append('\n');
i++;
}
builder.append("\n\n");
}
return builder.toString();
}
(DividerAwareTextExtraction.java方法extractAndStore
)
将此方法应用于样本PDF的第319和320页
PdfReader reader = new PdfReader("20150211600.PDF");
String content = extractAndStore(reader, new File(RESULT_FOLDER, "20150211600.%s.%s.txt").toString(), 319, 320);
(DividerAwareTextExtraction.java test test20150211600_320
)
结果
--
do(s) bem (ns) exceder o seu crédito, depositará, no prazo de 3 (três)
dias, a diferença, sob pena de ser tornada sem efeito a arrematação
[...]
EDITAL DE INTIMAÇÃO DE ADVOGADOS
RELAÇÃO Nº 0041/2015
--
ADV: RODRIGO TADEU PIMENTA DE OLIVEIRA (OAB 16752/
SC)
Processo 0033473-16.2010.8.24.0023 (023.10.033473-6) - Ação Penal
Militar - Procedimento Ordinário - Militar - Autor: Ministério Público
do Estado de Santa Catarina - Réu: João Gabriel Adler - Publicada a
sentença neste ato, lida às partes e intimados os presentes. Registre-se.
A defesa manifesta o interesse em recorrer da sentença.
--
ADV: RODRIGO TADEU PIMENTA DE OLIVEIRA (OAB 16752/
SC), CARLOS ROBERTO PEREIRA (OAB 29179/SC), ROBSON
LUIZ CERON (OAB 22475/SC)
Processo 0025622-86.2011.8.24.0023 (023.11.025622-3) - Ação
[...]
1, NIVAEL MARTINS PADILHA, Mat. 928313-7, ANDERSON
VOGEL e ANTÔNIO VALDEMAR FORTES, no ato deprecado.
--
--
ADV: RODRIGO TADEU PIMENTA DE OLIVEIRA (OAB 16752/
SC)
Processo 0006958-36.2013.8.24.0023 (023.13.006958-5) - Ação Penal
Militar - Procedimento Ordinário - Crimes Militares - Autor: Ministério
Público do Estado de Santa Catarina - Réu: Pedro Conceição Bungarten
- Ficam intimadas as partes, da decisão de fls. 289/290, no prazo de
05 (cinco) dias.
--
ADV: RODRIGO TADEU PIMENTA DE OLIVEIRA (OAB 16752/
SC), ROBSON LUIZ CERON (OAB 22475/SC)
Processo 0006967-95.2013.8.24.0023 (023.13.006967-4) - Ação Penal
[...]
a presença dos réus no ato deprecado.
--
ADV: RODRIGO TADEU PIMENTA DE OLIVEIRA (OAB 16752/
SC)
Processo 0016809-02.2013.8.24.0023 - Ação Penal Militar -
[...]
prazo de 05 (cinco) dias.
--
ADV: RODRIGO TADEU PIMENTA DE OLIVEIRA (OAB 16752/
SC), ELIAS NOVAIS PEREIRA (OAB 30513/SC), ROBSON LUIZ
CERON (OAB 22475/SC)
Processo 0021741-33.2013.8.24.0023 - Ação Penal Militar -
[...]
a presença dos réus no ato deprecado.
--
ADV: RODRIGO TADEU PIMENTA DE OLIVEIRA (OAB 16752/
SC)
Processo 0024568-17.2013.8.24.0023 - Ação Penal Militar -
[...]
do CPPM
--
ADV: RODRIGO TADEU PIMENTA DE OLIVEIRA (OAB 16752/
SC)
Processo 0034522-87.2013.8.24.0023 - Ação Penal Militar -
[...]
diligências, consoante o art. 427 do CPPM
--
ADV: SANDRO MARCELO PEROTTI (OAB 8949/SC), NOEL
ANTÔNIO BARATIERI (OAB 16462/SC), RODRIGO TADEU
PIMENTA DE OLIVEIRA (OAB 16752/SC)
Processo 0041634-10.2013.8.24.0023 - Ação Penal Militar -
Procedimento Ordinário - Crimes Militares - Autor: M. P. E. - Réu: J. P.
D. - Defiro a juntada dos documentos de pp. 3214-3217. Oficie-se com
urgência à Comarca de Porto União (ref. Carta Precatória n. 0000463-
--
15.2015.8.24.0052), informando a habilitação dos procuradores. Intime-
se, inclusive os novos constituídos da designação do ato.
--
ADV: SANDRO MARCELO PEROTTI (OAB 8949/SC), NOEL
ANTÔNIO BARATIERI (OAB 16462/SC), RODRIGO TADEU
PIMENTA DE OLIVEIRA (OAB 16752/SC)
Processo 0041634-10.2013.8.24.0023 - Ação Penal Militar -
[...]
imprescindível a presença dos réus no ato deprecado.
--
ADV: RODRIGO TADEU PIMENTA DE OLIVEIRA (OAB 16752/
SC)
Processo 0043998-52.2013.8.24.0023 - Ação Penal Militar -
[...]
de parcelas para desconto remuneratório. Intimem-se.
--
ADV: RODRIGO TADEU PIMENTA DE OLIVEIRA (OAB 16752/
SC)
Processo 0049304-02.2013.8.24.0023 - Ação Penal Militar -
[...]
Rel. Ângela Maria Silveira).
--
ADV: ROBSON LUIZ CERON (OAB 22475/SC)
Processo 0000421-87.2014.8.24.0023 - Ação Penal Militar -
[...]
prazo de 05 (cinco) dias.
--
ADV: RODRIGO TADEU PIMENTA DE OLIVEIRA (OAB 16752/
SC)
Processo 0003198-45.2014.8.24.0023 - Ação Penal Militar -
[...]
de 05 (cinco) dias.
--
ADV: ISAEL MARCELINO COELHO (OAB 13878/SC), ROBSON
LUIZ CERON (OAB 22475/SC)
Processo 0010380-82.2014.8.24.0023 - Ação Penal Militar -
Procedimento Ordinário - Crimes Militares - Autor: Ministério Público
Estadual - Réu: Vilson Diocimar Antunes - HOMOLOGO o pedido
de desistência. Intime-se a defesa para o que preceitua o artigo 417,
§2º, do Código de Processo Penal Militar.
(由于显而易见的原因缩短了一点)
OP在评论中写道:
还有一点,我如何识别部分内部的字体大小/颜色变化?在某些没有分隔符的情况下我需要它(只有更大的标题)(示例页面346,“Armazém”应该结束该部分)
作为一个例子,我扩展了上面的DividerAwareTextExtrationStrategy
,将给定颜色的上升文本行添加到已经找到的分隔线中:
public class DividerAndColorAwareTextExtractionStrategy extends DividerAwareTextExtrationStrategy
{
//
// constructor
//
public DividerAndColorAwareTextExtractionStrategy(float topMargin, float bottomMargin, float leftMargin, float rightMargin, BaseColor headerColor)
{
super(topMargin, bottomMargin, leftMargin, rightMargin);
this.headerColor = headerColor;
}
//
// DividerAwareTextExtrationStrategy overrides
//
/**
* As the {@link DividerAwareTextExtrationStrategy#lines} are not
* properly sorted anymore (the additional lines come after all
* divider lines of the same column), we have to sort that {@link List}
* first.
*/
@Override
public List<Section> getSections()
{
Collections.sort(lines, new Comparator<LineSegment>()
{
@Override
public int compare(LineSegment o1, LineSegment o2)
{
Vector start1 = o1.getStartPoint();
Vector start2 = o2.getStartPoint();
float v1 = start1.get(Vector.I1), v2 = start2.get(Vector.I1);
if (Math.abs(v1 - v2) < 2)
{
v1 = start2.get(Vector.I2);
v2 = start1.get(Vector.I2);
}
return Float.compare(v1, v2);
}
});
return super.getSections();
}
/**
* The ascender lines of text rendered using a fill color approximately
* like the given header color are added to the divider lines.
*/
@Override
public void renderText(TextRenderInfo renderInfo)
{
if (approximates(renderInfo.getFillColor(), headerColor))
{
lines.add(renderInfo.getAscentLine());
}
super.renderText(renderInfo);
}
/**
* This method checks whether two colors are approximately equal. As the
* sample document only uses CMYK colors, only this comparison has been
* implemented yet.
*/
boolean approximates(BaseColor colorA, BaseColor colorB)
{
if (colorA == null || colorB == null)
return colorA == colorB;
if (colorA instanceof CMYKColor && colorB instanceof CMYKColor)
{
CMYKColor cmykA = (CMYKColor) colorA;
CMYKColor cmykB = (CMYKColor) colorB;
float c = Math.abs(cmykA.getCyan() - cmykB.getCyan());
float m = Math.abs(cmykA.getMagenta() - cmykB.getMagenta());
float y = Math.abs(cmykA.getYellow() - cmykB.getYellow());
float k = Math.abs(cmykA.getBlack() - cmykB.getBlack());
return c+m+y+k < 0.01;
}
// TODO: Implement comparison for other color types
return false;
}
final BaseColor headerColor;
}
(DividerAndColorAwareTextExtractionStrategy.java)
在renderText
中,我们会识别headerColor
中的文字,并将各自的第一行添加到lines
列表中。
注意: 我们以给定颜色添加每个块的上升线。实际上我们应该加入所有文本块的上升线,形成一个标题行。由于示例文档中的蓝色标题行仅包含一个块,因此我们不需要在此示例代码中。必须适当扩展通用解决方案。
由于lines
不再正确排序(额外的上升线位于同一列的所有分隔线之后),我们必须先对该列表进行排序。
请注意 此处使用的Comparator
不是很合适:它会忽略x坐标中的某个差异,这使得它不会真的过渡。只有当同一列的各行具有与不同列明显不同的起始x坐标时,它才有效。
在试运行中(参见DividerAndColorAwareTextExtraction.java方法test20150211600_346
),找到的部分也分为蓝色标题“Armazém”和“BalneárioCamboriú”。
请注意我上面提到的限制。如果是如果要在示例文档中的灰色标题处进行拆分,则必须改进上述方法,因为这些标题不会出现在单个块中。