Question

如何使用pdfbox从pdf中的表中提取数据？

在这个过程中，可以使用PDContentStream和PageStripper类找到文本索引和内容。必须找到表中行的索引，任何人都可以帮助扩展哪个类以及实现哪个方法？

我尝试了以下方法来提取文本的起始索引：

public class Tables {
    public static void main(String args[]) throws IOException{
        BufferedWriter wr;
        File input = new File("test.pdf");
        File output = new File("SampleText.txt"); 
        PDDocument pd=new PDDocument();
        pd=PDDocument.load(input);

        //      PDFTextStripper pds=new PDFTextStripper();
        //      String text=pds.getText(pd);
        PDFTextStripper stripper = new PDFTextStripper()
        {
            @Override
            protected void startPage(PDPage page) throws IOException
            {
                startOfLine = true;
                super.startPage(page);
            }

            @Override
            protected void writeLineSeparator() throws IOException
            {
                startOfLine = true;
                super.writeLineSeparator();
            }

            @Override
            protected void writeString(String text, List<TextPosition> textPositions) throws IOException
            {
                if (startOfLine)
                {
                    TextPosition firstProsition = textPositions.get(0);
                    writeString(String.format("[%s]", firstProsition.getYDirAdj()));
                    startOfLine = false;
                }
                super.writeString(text, textPositions);
            }
            boolean startOfLine = true;
        };
        wr = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(output)));
        stripper.writeText(pd, wr);
        if (pd != null) {
            pd.close();
        }
        // I use close() to flush the stream.
        wr.close();
    }
}

Answer 1

最近我做了一个类似的项目，我不得不从表中提取数据。

这里有两个选项： -

1）您可以使用tabula（它是一个用于从pdf中提取表格的开源工具）。 http://tabula.technology/ https://github.com/tabulapdf/tabula 您可以在代码中使用tabula命令行工具，并从特定区域提取数据。

2）您需要设计自己的算法来提取表格数据。如果您打算使用第二个选项，那么您还需要提取文本的坐标。你可以覆盖pdfTextStripper类的writtring方法（你可以谷歌这个）。然后，您需要考虑如何使用这些信息来获取所需的详细信息。（坐标可能非常有帮助）。

如果你有标准格式的pdf，那么我建议你使用tabula，因为没有太多工作要做。

PDFBox：从表

1 个答案: