我正在一个项目中,以表格格式将PDF数据提取到Excel。首先,我使用了将所有PDF合并为单个PDF的代码,然后尝试使用tabula包提取表。但是我遇到了错误。
我认为错误是由于列数所致。也许有些有8列,有些表有9列。
首先,我使用了将所有PDF合并为单个PDF的代码,然后尝试使用tabula包提取表。
import os
from PyPDF2 import PdfFileMerger
folder = 'C:/Users/User.LAPTOP-2TC2V5HI/Documents/WOD PDF/'
x = [folder + fn for fn in os.listdir(folder) if fn.endswith('.pdf')]
# folder = 'C:/Users/User.LAPTOP-2TC2V5HI/Documents/WOD PDF/'
# x = [a for a in os.listdir(folder) if a.endswith(".pdf")]
merger = PdfFileMerger()
for pdf in x:
merger.append(open(pdf, 'rb'))
with open("result.pdf", "wb") as fout:
merger.write(fout)
我使用了以下代码:
from tabula import read_pdf
from tabulate import tabulate
df = read_pdf('result.pdf', pages='all', mulitple_tables=True, names = ('col1','col2','col3','col4','col5','col6','col7','col8','col9'), error_bad_lines=False)
df
但出现此错误:
'CSVParseError: Error failed to create DataFrame with different column tables. Try to set `multiple_tables=True` or set `names` option for `pandas_options`. , caused by ParserError('Error tokenizing data. C error: Expected 8 fields in line 169, saw 9\n',)'