Question

尝试使用Python 3.6从PDF中提取表格。似乎[pyPDF2] [1]失败，[pdfminer] [2]与3.x不兼容。我找到了[tabula] [3]的python包装器。

import tabula
file_list = get_pdf_list()

text = tabula.read_pdf(file_list[0])
print(text)

tabula.convert_into(file_list[0], "test.json", ouput_format="json")

read_pdf和convert_into都返回空结果。 PyPDF2也有同样的问题。运行时没有错误

我开始认为它与我的pdf格式有关。谁有更多的经验？我试图从pdf中的表中提取值。

Answer 1

希望您已经得到答案！但是这里仍然是我的代码！我想说表格是PDF表格提取器中的一种。我在哪里遇到了很多问题。

安装最新的pkg表格

pip install tabula-py

代码是！

import os
from tabula import wrapper
os.path.abspath("E:/Documents/myPy/")
tables = wrapper.read_pdf("MyPDF.pdf",multiple_tables=True,pages='all')

i=1
for table in tables:
    table.to_excel('output'+str(i)+'.xlsx',index=False)
    print(i)
    i=i+1

尝试一下！

提取PDF表格，Python3，tabula-py

1 个答案: