Tabula看起来像是从PDF提取表格数据的绝佳工具。关于如何从命令行调用它或如何在Python中使用它的例子很多,但是似乎没有关于Java的文档。有人有可行的例子吗?
请注意,表格确实提供了源代码,但是版本之间似乎有些混淆。例如,GitHub上的示例引用了JAR中似乎不存在的TableExtractor类。
答案 0 :(得分:6)
您可以使用以下代码从Java调用表格,希望这对您有所帮助
public static void main(String[] args) throws IOException {
final String FILENAME="../test.pdf";
PDDocument pd = PDDocument.load(new File(FILENAME));
int totalPages = pd.getNumberOfPages();
System.out.println("Total Pages in Document: "+totalPages);
ObjectExtractor oe = new ObjectExtractor(pd);
SpreadsheetExtractionAlgorithm sea = new SpreadsheetExtractionAlgorithm();
Page page = oe.extract(1);
// extract text from the table after detecting
List<Table> table = sea.extract(page);
for(Table tables: table) {
List<List<RectangularTextContainer>> rows = tables.getRows();
for(int i=0; i<rows.size(); i++) {
List<RectangularTextContainer> cells = rows.get(i);
for(int j=0; j<cells.size(); j++) {
System.out.print(cells.get(j).getText()+"|");
}
// System.out.println();
}
}
}
答案 1 :(得分:0)
// ****** Extract text from the table after detecting & TRANSFER TO XLSX *****
XSSFWorkbook wb = new XSSFWorkbook();
Sheet sheet = wb.createSheet("Barang Baik");
List<Table> table = sea.extract(page);
for (Table t : table) {
int rowNumber = 0;
try {
while (sheet.getRow(rowNumber).getCell(0) != null) {
rowNumber++;
}
} catch (Exception e) { }
List<List<RectangularTextContainer>> rows = t.getRows();
for (int i = 0; i < rows.size(); i++) {
List<RectangularTextContainer> cells = rows.get(i);
Row row = sheet.createRow(i+rowNumber);
for (int j = 0; j < cells.size(); j++) {
Cell cell = row.createCell(j);
String cellValue = cells.get(j).getText();
cell.setCellValue(cellValue);
}
}
FileOutputStream fos = new FileOutputStream("C:\\your\\file.xlsx");
wb.write(fos);
fos.close();
}