我使用pdfminer从PDF文档中提取文本。我需要跳过表格内容(在写入文本文件之前)
from pdfminer.layout import LAParams
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import PDFPageAggregator
from pdfminer.pdfpage import PDFPage
from pdfminer.layout import LTTextBoxHorizontal
rsrcmgr = PDFResourceManager()
laparams = LAParams()
device = PDFPageAggregator(rsrcmgr, laparams=laparams)
interpreter = PDFPageInterpreter(rsrcmgr, device)
pdf_file_instance = open(pdf_file, 'rb')
total_text = []
for page in PDFPage.get_pages(pdf_file_instance):
interpreter.process_page(page)
layout = device.get_result()
for element in layout:
if isinstance(element, LTTextBoxHorizontal):
data = element.get_text().strip()
total_text.append(data)
我使用上面的代码库将pdf文档转换为文本。请提供任何建议。提前谢谢。