使用ITextSharp在两个分隔线之间从PDF中提取文本

时间:2015-07-30 17:22:38

标签: c# pdf itextsharp

我有一个1500多页的pdf,其中有一些'随机'文字,我必须从中提取一些文字...... 我可以像那样识别那个块:

bla bla bla bla bla 
...
...
...
-------------------------- (separator blue image)
XXX: TEXT TEXT TEXT
TEXT TEXT TEXT TEXT
...
-------------------------- (separator blue image)
bla bla bla bla
...
...
-------------------------- (separator blue image)
XXX: TEXT2 TEXT2 TEXT2
TEXT2 TEXT2 TEXT TEXT2
...
-------------------------- (separator blue image)

我需要提取分隔符之间的所有文本(所有块) ' XXX'存在于所有块的开头,但我没有任何方法来检测块的结束。是否可以在解析器中使用图像分隔符?怎么样?

还有其他可能的方法吗?

编辑更多信息 没有背景,文字是复制和可接受的

示例pdf:1

查看示例第320页

由于

1 个答案:

答案 0 :(得分:5)

理论

如果你的sample PDF使用矢量图形创建了分隔符:

0.58 0.17 0 0.47 K
q 1 0 0 1 56.6929 772.726 cm
0 0 m
249.118 0 l
S
Q
q 1 0 0 1 56.6929 690.9113 cm
0 0 m
249.118 0 l
S 

解析矢量图形是iText(夏普)的一个相当新的补充,在这方面,API可以进行一些更改。目前(版本5.5.6),您可以使用接口ExtRenderListener(Java)/ IExtRenderListener(。Net)的实现来解析矢量图形。

您现在有了一些方法来完成任务:

  • (多遍)您可以仅收集线路的方式实现上述接口。从这些行中,您可以得到包含每个部分的矩形,对于这些矩形中的每一个,您都可以提取文本应用区域文本过滤。
  • (两次通过)就像上面一样,你可以用一种只收集线条的方式实现上述接口,从这些线条中你可以得到包含每个部分的矩形。然后使用LocationTextExtractionStrategy解析页面,并使用适当的ITextChunkFilter使用GetResultantText(ITextChunkFilter)重载请求每个矩形的文本。
  • (一次通过)您可以采用以下方式实现上述接口:收集行,收集文本片段,从行中导出矩形并排列位于这些矩形中的文本片段。

示例实现

(因为我在Java中比在C#中更流利,我在Java中为iText实现了这个示例。应该很容易移植到C#和iTextSharp。)

此实现尝试提取由分隔符分隔的文本部分,如示例PDF中所示。

这是一个一次通过的解决方案,同时通过从该策略中推导出来重新使用现有的LocationTextExtractionStrategy功能。

在同一个传递中,这个策略收集文本块(感谢它的父类)和分隔线(由于它实现了ExtRenderListener额外的方法)。

解析了一个页面后,该策略通过方法Section提供了getSections()个实例的列表,每个实例都代表了一个页面的一部分,该页面由上面和/或下面的分隔线分隔。每个文本列的最顶部和最底部分在顶部或底部打开,由匹配的边界线隐式分隔。

Section实现了TextChunkFilter接口,因此,可以使用父类的方法getResultantText(TextChunkFilter)来检索页面各部分中的文本。

这仅仅是一个POC,它设计用于使用与样本文档完全相同的分隔符从文档中提取部分,即使用 moveTo-lineTo-stroke 绘制的水平线。 section,出现在内容流中按列逐列排序。样本PDF可能还有更多隐式假设。

public class DividerAwareTextExtrationStrategy extends LocationTextExtractionStrategy implements ExtRenderListener
{
    //
    // constructor
    //
    /**
     * The constructor accepts top and bottom margin lines in user space y coordinates
     * and left and right margin lines in user space x coordinates.
     * Text outside those margin lines is ignored. 
     */
    public DividerAwareTextExtrationStrategy(float topMargin, float bottomMargin, float leftMargin, float rightMargin)
    {
        this.topMargin = topMargin;
        this.bottomMargin = bottomMargin;
        this.leftMargin = leftMargin;
        this.rightMargin = rightMargin;
    }

    //
    // Divider derived section support
    //
    public List<Section> getSections()
    {
        List<Section> result = new ArrayList<Section>();
        // TODO: Sort the array columnwise. In case of the OP's document, the lines already appear in the
        // correct order, so there was no need for sorting in the POC. 

        LineSegment previous = null;
        for (LineSegment line : lines)
        {
            if (previous == null)
            {
                result.add(new Section(null, line));
            }
            else if (Math.abs(previous.getStartPoint().get(Vector.I1) - line.getStartPoint().get(Vector.I1)) < 2) // 2 is a magic number... 
            {
                result.add(new Section(previous, line));
            }
            else
            {
                result.add(new Section(previous, null));
                result.add(new Section(null, line));
            }
            previous = line;
        }

        return result;
    }

    public class Section implements TextChunkFilter
    {
        LineSegment topLine;
        LineSegment bottomLine;

        final float left, right, top, bottom;

        Section(LineSegment topLine, LineSegment bottomLine)
        {
            float left, right, top, bottom;
            if (topLine != null)
            {
                this.topLine = topLine;
                top = Math.max(topLine.getStartPoint().get(Vector.I2), topLine.getEndPoint().get(Vector.I2));
                right = Math.max(topLine.getStartPoint().get(Vector.I1), topLine.getEndPoint().get(Vector.I1));
                left = Math.min(topLine.getStartPoint().get(Vector.I1), topLine.getEndPoint().get(Vector.I1));
            }
            else
            {
                top = topMargin;
                left = leftMargin;
                right = rightMargin;
            }

            if (bottomLine != null)
            {
                this.bottomLine = bottomLine;
                bottom = Math.min(bottomLine.getStartPoint().get(Vector.I2), bottomLine.getEndPoint().get(Vector.I2));
                right = Math.max(bottomLine.getStartPoint().get(Vector.I1), bottomLine.getEndPoint().get(Vector.I1));
                left = Math.min(bottomLine.getStartPoint().get(Vector.I1), bottomLine.getEndPoint().get(Vector.I1));
            }
            else
            {
                bottom = bottomMargin;
            }

            this.top = top;
            this.bottom = bottom;
            this.left = left;
            this.right = right;
        }

        //
        // TextChunkFilter
        //
        @Override
        public boolean accept(TextChunk textChunk)
        {
            // TODO: This code only checks the text chunk starting point. One should take the 
            // whole chunk into consideration
            Vector startlocation = textChunk.getStartLocation();
            float x = startlocation.get(Vector.I1);
            float y = startlocation.get(Vector.I2);

            return (left <= x) && (x <= right) && (bottom <= y) && (y <= top);
        }
    }

    //
    // ExtRenderListener implementation
    //
    /**
     * <p>
     * This method stores targets of <code>moveTo</code> in {@link #moveToVector}
     * and targets of <code>lineTo</code> in {@link #lineToVector}. Any unexpected
     * contents or operations result in clearing of the member variables.
     * </p>
     * <p>
     * So this method is implemented for files with divider lines exactly like in
     * the OP's sample file.
     * </p>
     *  
     * @see ExtRenderListener#modifyPath(PathConstructionRenderInfo)
     */
    @Override
    public void modifyPath(PathConstructionRenderInfo renderInfo)
    {
        switch (renderInfo.getOperation())
        {
        case PathConstructionRenderInfo.MOVETO:
        {
            float x = renderInfo.getSegmentData().get(0);
            float y = renderInfo.getSegmentData().get(1);
            moveToVector = new Vector(x, y, 1);
            lineToVector = null;
            break;
        }
        case PathConstructionRenderInfo.LINETO:
        {
            float x = renderInfo.getSegmentData().get(0);
            float y = renderInfo.getSegmentData().get(1);
            if (moveToVector != null)
            {
                lineToVector = new Vector(x, y, 1);
            }
            break;
        }
        default:
            moveToVector = null;
            lineToVector = null;
        }
    }

    /**
     * This method adds the current path to {@link #lines} if it consists
     * of a single line, the operation is no no-op, and the line is
     * approximately horizontal.
     *  
     * @see ExtRenderListener#renderPath(PathPaintingRenderInfo)
     */
    @Override
    public Path renderPath(PathPaintingRenderInfo renderInfo)
    {
        if (moveToVector != null && lineToVector != null &&
            renderInfo.getOperation() != PathPaintingRenderInfo.NO_OP)
        {
            Vector from = moveToVector.cross(renderInfo.getCtm());
            Vector to = lineToVector.cross(renderInfo.getCtm());
            Vector extent = to.subtract(from);

            if (Math.abs(20 * extent.get(Vector.I2)) < Math.abs(extent.get(Vector.I1)))
            {
                LineSegment line;
                if (extent.get(Vector.I1) >= 0)
                    line = new LineSegment(from, to);
                else
                    line = new LineSegment(to, from);
                lines.add(line);
            }
        }

        moveToVector = null;
        lineToVector = null;
        return null;
    }

    /* (non-Javadoc)
     * @see com.itextpdf.text.pdf.parser.ExtRenderListener#clipPath(int)
     */
    @Override
    public void clipPath(int rule)
    {
    }

    //
    // inner members
    //
    final float topMargin, bottomMargin, leftMargin, rightMargin;
    Vector moveToVector = null;
    Vector lineToVector = null;
    final List<LineSegment> lines = new ArrayList<LineSegment>();
}

DividerAwareTextExtrationStrategy.java

可以像这样使用

String extractAndStore(PdfReader reader, String format, int from, int to) throws IOException
{
    StringBuilder builder = new StringBuilder();

    for (int page = from; page <= to; page++)
    {
        PdfReaderContentParser parser = new PdfReaderContentParser(reader);
        DividerAwareTextExtrationStrategy strategy = parser.processContent(page, new DividerAwareTextExtrationStrategy(810, 30, 20, 575));

        List<Section> sections = strategy.getSections();
        int i = 0;
        for (Section section : sections)
        {
            String sectionText = strategy.getResultantText(section);
            Files.write(Paths.get(String.format(format, page, i)), sectionText.getBytes("UTF8"));

            builder.append("--\n")
                   .append(sectionText)
                   .append('\n');
            i++;
        }
        builder.append("\n\n");
    }

    return builder.toString();
}

DividerAwareTextExtraction.java方法extractAndStore

将此方法应用于样本PDF的第319和320页

PdfReader reader = new PdfReader("20150211600.PDF");
String content = extractAndStore(reader, new File(RESULT_FOLDER, "20150211600.%s.%s.txt").toString(), 319, 320);

DividerAwareTextExtraction.java test test20150211600_320

结果

--
do(s) bem (ns) exceder o seu crédito, depositará, no prazo de 3 (três) 
dias, a diferença, sob pena de ser tornada sem efeito a arrematação 
[...]
EDITAL DE INTIMAÇÃO DE ADVOGADOS
RELAÇÃO Nº 0041/2015
--
ADV: RODRIGO TADEU PIMENTA DE OLIVEIRA (OAB 16752/
SC)
Processo 0033473-16.2010.8.24.0023 (023.10.033473-6) - Ação Penal
Militar - Procedimento Ordinário - Militar - Autor: Ministério Público 
do Estado de Santa Catarina - Réu: João Gabriel Adler - Publicada a 
sentença neste ato, lida às partes e intimados os presentes. Registre-se.
A defesa manifesta o interesse em recorrer da sentença.
--
ADV: RODRIGO TADEU PIMENTA DE OLIVEIRA (OAB 16752/
SC), CARLOS ROBERTO PEREIRA (OAB 29179/SC), ROBSON 
LUIZ CERON (OAB 22475/SC)
Processo 0025622-86.2011.8.24.0023 (023.11.025622-3) - Ação
[...]
1, NIVAEL MARTINS PADILHA, Mat. 928313-7, ANDERSON
VOGEL e ANTÔNIO VALDEMAR FORTES, no ato deprecado.


--

--
ADV: RODRIGO TADEU PIMENTA DE OLIVEIRA (OAB 16752/
SC)
Processo 0006958-36.2013.8.24.0023 (023.13.006958-5) - Ação Penal
Militar - Procedimento Ordinário - Crimes Militares - Autor: Ministério
Público do Estado de Santa Catarina - Réu: Pedro Conceição Bungarten
- Ficam intimadas as partes, da decisão de fls. 289/290, no prazo de 
05 (cinco) dias.
--
ADV: RODRIGO TADEU PIMENTA DE OLIVEIRA (OAB 16752/
SC), ROBSON LUIZ CERON (OAB 22475/SC)
Processo 0006967-95.2013.8.24.0023 (023.13.006967-4) - Ação Penal
[...]
a presença dos réus no ato deprecado.
--
ADV: RODRIGO TADEU PIMENTA DE OLIVEIRA (OAB 16752/
SC)
Processo 0016809-02.2013.8.24.0023 - Ação Penal Militar -
[...]
prazo de 05 (cinco) dias.
--
ADV: RODRIGO TADEU PIMENTA DE OLIVEIRA (OAB 16752/
SC), ELIAS NOVAIS PEREIRA (OAB 30513/SC), ROBSON LUIZ 
CERON (OAB 22475/SC)
Processo 0021741-33.2013.8.24.0023 - Ação Penal Militar -
[...]
a presença dos réus no ato deprecado.
--
ADV: RODRIGO TADEU PIMENTA DE OLIVEIRA (OAB 16752/
SC)
Processo 0024568-17.2013.8.24.0023 - Ação Penal Militar -
[...]
do CPPM
--
ADV: RODRIGO TADEU PIMENTA DE OLIVEIRA (OAB 16752/
SC)
Processo 0034522-87.2013.8.24.0023 - Ação Penal Militar -
[...]
diligências, consoante o art. 427 do CPPM
--
ADV: SANDRO MARCELO PEROTTI (OAB 8949/SC), NOEL 
ANTÔNIO BARATIERI (OAB 16462/SC), RODRIGO TADEU 
PIMENTA DE OLIVEIRA (OAB 16752/SC)
Processo 0041634-10.2013.8.24.0023 - Ação Penal Militar -
Procedimento Ordinário - Crimes Militares - Autor: M. P. E. - Réu: J. P. 
D. - Defiro a juntada dos documentos de pp. 3214-3217. Oficie-se com
urgência à Comarca de Porto União (ref. Carta Precatória n. 0000463-
--
15.2015.8.24.0052), informando a habilitação dos procuradores. Intime-
se, inclusive os novos constituídos da designação do ato.
--
ADV: SANDRO MARCELO PEROTTI (OAB 8949/SC), NOEL 
ANTÔNIO BARATIERI (OAB 16462/SC), RODRIGO TADEU 
PIMENTA DE OLIVEIRA (OAB 16752/SC)
Processo 0041634-10.2013.8.24.0023 - Ação Penal Militar -
[...]
imprescindível a presença dos réus no ato deprecado.
--
ADV: RODRIGO TADEU PIMENTA DE OLIVEIRA (OAB 16752/
SC)
Processo 0043998-52.2013.8.24.0023 - Ação Penal Militar -
[...]
de parcelas para desconto remuneratório. Intimem-se.
--
ADV: RODRIGO TADEU PIMENTA DE OLIVEIRA (OAB 16752/
SC)
Processo 0049304-02.2013.8.24.0023 - Ação Penal Militar -
[...]
Rel. Ângela Maria Silveira).
--
ADV: ROBSON LUIZ CERON (OAB 22475/SC)
Processo 0000421-87.2014.8.24.0023 - Ação Penal Militar -
[...]
prazo de 05 (cinco) dias.
--
ADV: RODRIGO TADEU PIMENTA DE OLIVEIRA (OAB 16752/
SC)
Processo 0003198-45.2014.8.24.0023 - Ação Penal Militar -
[...]
de 05 (cinco) dias.
--
ADV: ISAEL MARCELINO COELHO (OAB 13878/SC), ROBSON 
LUIZ CERON (OAB 22475/SC)
Processo 0010380-82.2014.8.24.0023 - Ação Penal Militar -
Procedimento Ordinário - Crimes Militares - Autor: Ministério Público
Estadual - Réu: Vilson Diocimar Antunes - HOMOLOGO o pedido 
de desistência. Intime-se a defesa para o que preceitua o artigo 417, 
§2º, do Código de Processo Penal Militar.

(由于显而易见的原因缩短了一点)

除以彩色标题

OP在评论中写道:

  

还有一点,我如何识别部分内部的字体大小/颜色变化?在某些没有分隔符的情况下我需要它(只有更大的标题)(示例页面346,“Armazém”应该结束该部分)

作为一个例子,我扩展了上面的DividerAwareTextExtrationStrategy,将给定颜色的上升文本行添加到已经找到的分隔线中:

public class DividerAndColorAwareTextExtractionStrategy extends DividerAwareTextExtrationStrategy
{
    //
    // constructor
    //
    public DividerAndColorAwareTextExtractionStrategy(float topMargin, float bottomMargin, float leftMargin, float rightMargin, BaseColor headerColor)
    {
        super(topMargin, bottomMargin, leftMargin, rightMargin);
        this.headerColor = headerColor;
    }

    //
    // DividerAwareTextExtrationStrategy overrides
    //
    /**
     * As the {@link DividerAwareTextExtrationStrategy#lines} are not
     * properly sorted anymore (the additional lines come after all
     * divider lines of the same column), we have to sort that {@link List}
     * first.
     */
    @Override
    public List<Section> getSections()
    {
        Collections.sort(lines, new Comparator<LineSegment>()
        {
            @Override
            public int compare(LineSegment o1, LineSegment o2)
            {
                Vector start1 = o1.getStartPoint();
                Vector start2 = o2.getStartPoint();

                float v1 = start1.get(Vector.I1), v2 = start2.get(Vector.I1);
                if (Math.abs(v1 - v2) < 2)
                {
                    v1 = start2.get(Vector.I2);
                    v2 = start1.get(Vector.I2);
                }

                return Float.compare(v1, v2);
            }
        });

        return super.getSections();
    }

    /**
     * The ascender lines of text rendered using a fill color approximately
     * like the given header color are added to the divider lines.
     */
    @Override
    public void renderText(TextRenderInfo renderInfo)
    {
        if (approximates(renderInfo.getFillColor(), headerColor))
        {
            lines.add(renderInfo.getAscentLine());
        }

        super.renderText(renderInfo);
    }

    /**
     * This method checks whether two colors are approximately equal. As the
     * sample document only uses CMYK colors, only this comparison has been
     * implemented yet.
     */
    boolean approximates(BaseColor colorA, BaseColor colorB)
    {
        if (colorA == null || colorB == null)
            return colorA == colorB;
        if (colorA instanceof CMYKColor && colorB instanceof CMYKColor)
        {
            CMYKColor cmykA = (CMYKColor) colorA;
            CMYKColor cmykB = (CMYKColor) colorB;
            float c = Math.abs(cmykA.getCyan() - cmykB.getCyan());
            float m = Math.abs(cmykA.getMagenta() - cmykB.getMagenta());
            float y = Math.abs(cmykA.getYellow() - cmykB.getYellow());
            float k = Math.abs(cmykA.getBlack() - cmykB.getBlack());
            return c+m+y+k < 0.01;
        }
        // TODO: Implement comparison for other color types
        return false;
    }

    final BaseColor headerColor;
}

DividerAndColorAwareTextExtractionStrategy.java

renderText中,我们会识别headerColor中的文字,并将各自的第一行添加到lines列表中。

注意: 我们以给定颜色添加每个块的上升线。实际上我们应该加入所有文本块的上升线,形成一个标题行。由于示例文档中的蓝色标题行仅包含一个块,因此我们不需要在此示例代码中。必须适当扩展通用解决方案。

由于lines不再正确排序(额外的上升线位于同一列的所有分隔线之后),我们必须先对该列表进行排序。

请注意 此处使用的Comparator不是很合适:它会忽略x坐标中的某个差异,这使得它不会真的过渡。只有当同一列的各行具有与不同列明显不同的起始x坐标时,它才有效。

在试运行中(参见DividerAndColorAwareTextExtraction.java方法test20150211600_346),找到的部分也分为蓝色标题“Armazém”和“BalneárioCamboriú”。

请注意我上面提到的限制。如果是如果要在示例文档中的灰色标题处进行拆分,则必须改进上述方法,因为这些标题不会出现在单个块中。