我正在使用pdfminer将pdf中的文本转换为文本。我目前正在使用此代码:
# -*- coding: utf-8 -*-
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from io import BytesIO
import re
import os
import os.path
import glob
from collections import Counter
def pdf_to_text(path):
manager = PDFResourceManager()
retstr = BytesIO()
layout = LAParams(all_texts=True)
device = TextConverter(manager, retstr, laparams=layout)
filepath = open(path, 'rb')
interpreter = PDFPageInterpreter(manager, device)
for page in PDFPage.get_pages(filepath, check_extractable=True):
interpreter.process_page(page)
text = retstr.getvalue()
filepath.close()
device.close()
retstr.close()
return text
if __name__ == "__main__":
path = 'C:/folder/document.pdf'
text = str(pdf_to_text(path))
print(text)
我也需要在文本中插入页码。即在所有页面的文本中插入带有页码的换行符。几天来我一直在挠头,试图弄清楚如何使其发挥作用,有人可以帮忙吗?