我正在使用tabula-0.9.2和Python 3.6.1& java版“1.8.0_45”从一些PDF中提取表格如下:
from tabula import read_pdf_table
read_pdf_table(pdf_file, pages=1, silent=True)
在大多数情况下,这有效,但我遇到了其中几个例外。任何人都知道如何找出这个的根本原因?是否存在我错过的read_pdf_table参数,这可能是这个问题吗?我想我的所有依赖版本都是正确的,除非我遗漏了什么?请指教。感谢。
Jul 13, 2017 3:52:31 PM org.apache.pdfbox.pdfviewer.PageDrawer processTextPosition
SEVERE: java.io.IOException: Problem reading font data.
java.io.IOException: Problem reading font data.
at java.awt.Font.createFont0(Font.java:1000)
at java.awt.Font.createFont(Font.java:877)
at org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.getawtFont(PDTrueTypeFont.java:471)
at org.apache.pdfbox.pdmodel.font.PDSimpleFont.drawString(PDSimpleFont.java:110)
at org.apache.pdfbox.pdfviewer.PageDrawer.processTextPosition(PageDrawer.java:260)
at org.apache.pdfbox.util.PDFStreamEngine.processEncodedText(PDFStreamEngine.java:504)
at org.apache.pdfbox.util.operator.ShowText.process(ShowText.java:56)
at org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:562)
at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:269)
at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:236)
at org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:216)
at org.apache.pdfbox.pdfviewer.PageDrawer.drawPage(PageDrawer.java:139)
at org.apache.pdfbox.pdmodel.PDPage.convertToImage(PDPage.java:801)
at technology.tabula.detectors.NurminenDetectionAlgorithm.detect(NurminenDetectionAlgorithm.java:93)
at technology.tabula.CommandLineApp$TableExtractor.extractTablesBasic(CommandLineApp.java:372)
at technology.tabula.CommandLineApp$TableExtractor.extractTables(CommandLineApp.java:359)
at technology.tabula.CommandLineApp.extractFile(CommandLineApp.java:166)
at technology.tabula.CommandLineApp.extractFileTables(CommandLineApp.java:123)
at technology.tabula.CommandLineApp.extractTables(CommandLineApp.java:104)
at technology.tabula.CommandLineApp.main(CommandLineApp.java:74)