我有5个pdf文件,我想将其转换为txt文件。 3个文件工作正常。其他2个仅返回(CID:number),例如:
(cid:47)(cid:57)(cid:3)(cid:69)(cid:72)
我用pdfminer编写了代码。 有谁知道如何解决此问题或调整我的代码?
顺便说一句:文本是德语的,没有CJK,我尝试将文件https://www.pdf2go.com上的文件转换成文本,并且可以正常工作。
这是我的代码:
import sys
import io
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.layout import LAParams
from pdfminer.converter import XMLConverter, HTMLConverter, TextConverter
# importieren Modul regex
import re
import os
filename = 'test.pdf'
page_start_input = 24
pages = list(range((page_start_input-1),500))
def pdfparser(data):
fp = open(data, 'rb')
resource_manager = PDFResourceManager()
retstr = io.StringIO()
codec = 'utf-8'
pagenos = set(pages)
laparams = LAParams()
device = TextConverter(resource_manager, retstr, codec=codec, laparams=laparams)
interpreter = PDFPageInterpreter(resource_manager, device)
for page in PDFPage.get_pages(fp, pagenos):
interpreter.process_page(page)
data = retstr.getvalue()
# print (data)
file = open("test_out.txt", "w", encoding='utf-8')
file.write(data)
file.close()
pdfparser(filename)