由于iam试图从合并的pdf文件中提取文本,并使用PDFMiner将其转换为txt文件,因此iam面临PDFInterpreter错误:未知运算符“ QQ”,这是代码
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.layout import LAParams
from pdfminer.converter import TextConverter
from io import StringIO
from pdfminer.pdfpage import PDFPage
def get_pdf_file_content(path_to_pdf):
resource_manager = PDFResourceManager(caching=True)
out_text = StringIO()
codec = 'utf-8'
laParams = LAParams()
text_converter = TextConverter(resource_manager, out_text,
laparams=laParams)
fp = open(path_to_pdf, 'rb')
interpreter = PDFPageInterpreter(resource_manager, text_converter)
for page in PDFPage.get_pages(fp, pagenos=set(), maxpages=0,
password="", caching=True, check_extractable=True):
interpreter.process_page(page)
text = out_text.getvalue()
fp.close()
text_converter.close()
out_text.close()
return text
path_to_pdf = 'merged.pdf'
print(get_pdf_file_content(path_to_pdf))
答案 0 :(得分:0)
由于我是Windows用户,所以我不了解PDFMiner,我不习惯使用shell,但是您可以尝试以下在线转换器:https://pdftotext.com/对我来说,它工作正常。