如何从合并的PDF文件中提取文本并转换为txt文件?

时间:2020-10-29 07:30:05

标签: text-extraction

由于iam试图从合并的pdf文件中提取文本,并使用PDFMiner将其转换为txt文件,因此iam面临PDFInterpreter错误:未知运算符“ QQ”,这是代码

    from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
    from pdfminer.layout import LAParams
    from pdfminer.converter import TextConverter
    from io import StringIO
    from pdfminer.pdfpage import PDFPage
    def get_pdf_file_content(path_to_pdf):
        resource_manager = PDFResourceManager(caching=True)
        out_text = StringIO()
        codec = 'utf-8'
        laParams = LAParams()
        text_converter = TextConverter(resource_manager, out_text, 
              laparams=laParams)
        fp = open(path_to_pdf, 'rb')
        interpreter = PDFPageInterpreter(resource_manager, text_converter)
        for page in PDFPage.get_pages(fp, pagenos=set(), maxpages=0, 
                    password="", caching=True, check_extractable=True):
        interpreter.process_page(page)
        text = out_text.getvalue()
        fp.close()
        text_converter.close()
        out_text.close()
        return text
    path_to_pdf = 'merged.pdf'
    print(get_pdf_file_content(path_to_pdf))

1 个答案:

答案 0 :(得分:0)

由于我是Windows用户,所以我不了解PDFMiner,我不习惯使用shell,但是您可以尝试以下在线转换器:https://pdftotext.com/对我来说,它工作正常。