Question

我正在尝试使用Python从PDF文件中提取文本。我的主要目标是尝试创建一个读取银行对账单并提取其文本以更新Excel文件以轻松记录每月支出的程序。现在，我只专注于从pdf文件中提取文本，但我不知道该怎么做。

当前将PDF文件中的文本提取为字符串的最佳和最简便的方法是什么？今天最适合使用哪种库，我该怎么办？

我尝试使用PyPDF2，但是每次我尝试使用extractText（）从任何页面提取文本时，它都会返回空字符串。我曾尝试安装textract，但由于我需要更多的库而出现错误。

import PyPDF2

pdfFileObj = open("January2019.pdf", 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)

pageObj = pdfReader.getPage(0)
print(pageObj.extractText())

这应该在打印页面内容时打印空字符串

Answer 1

我尝试了很多方法，但都失败了，包括PyPDF2和Tika。我终于找到了对我有用的模块pdfplumber，您也可以尝试。

希望这对您有帮助。

import pdfplumber
pdf = pdfplumber.open('pdffile.pdf')
page = pdf.pages[0]
text = page.extract_text()
print(text)
pdf.close()

Answer 2

使用蒂卡为我工作！

from tika import parser

rawText = parser.from_file('January2019.pdf')

rawList = rawText['content'].splitlines()

这使得将银行对帐单中的每一行分别提取到一个列表中非常容易。

Answer 3

如果您正在寻找一个维护较大的项目，请查看PyMuPDF。用pip install pymupdf安装它，并像这样使用它：

import fitz

def get_text(filepath: str) -> str:
    with fitz.open(filepath) as doc:
        text = ""
        for page in doc:
            text += page.getText().strip()
        return text

Answer 4

PyPDF2无法正确读取整个pdf。您必须使用此代码。

    import pdftotext

    pdfFileObj = open("January2019.pdf", 'rb')


    pdf = pdftotext.PDF(pdfFileObj)

    # Iterate over all the pages
    for page in pdf:
        print(page)

Answer 5

import PyPDF2
pdf-file = open('January2019.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdf-file)
count = pdfReader.numPages
for i in range(count):
    page = pdfReader.getPage(i)
    print(page.extractText())

Answer 6

import pdftables_api
import os

c = pdftables_api.Client('MY-API-KEY')

file_path = "C:\\Users\\MyName\\Documents\\PDFTablesCode\\"

for file in os.listdir(file_path):
    if file.endswith(".pdf"):
        c.xlsx(os.path.join(file_path,file), file+'.xlsx')

转到https://pdftables.com获取API密钥。

CSV，format = csv

XML，format = xml

HTML，format = html

XLSX，格式= xlsx-单数，格式= xlsx-倍数

Answer 7

尝试pdfreader。您可以提取包含“ pdf markdown”的纯文本或解码文本：

public async ActionResult GetBla([fromquery]int id){
var t1 = await doThis(id); /// or use ContinueWith(result=> ....
var t2 = await doThat(t1.id);
}

Answer 8

PyPDF2对于从pdf提取文本非常不可靠。也指出here。它说：

虽然PyPDF2具有.extractText（），但可以在其页面对象上使用（在此示例中未显示），它不能很好地工作。一些PDF 将返回文本，有些将返回空字符串。当你想要的时候要从PDF中提取文本，您应该检出PDFMiner项目代替。 PDFMiner更加强大，并且是专门设计的用于从PDF提取文本。

您可以改为使用
安装和使用pdfminer
pip install pdfminer
，或者您可以使用xpdfreader创建的名为pdftotext的另一个开源实用程序。页面上提供了使用该实用程序的说明。

您可以从here下载命令行工具并可以使用subprocess使用pdftotext.exe实用工具。详细说明了使用子流程的方法here

Answer 9

这是Windows 10 Python 3.8中的替代解决方案

示例测试pdf：https://drive.google.com/file/d/1aUfQAlvq5hA9kz2c9CyJADiY3KpY3-Vn/view?usp=sharing

#pip install pdfminer.six
import io

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage


def convert_pdf_to_txt(path):
    '''Convert pdf content from a file path to text

    :path the file path
    '''
    rsrcmgr = PDFResourceManager()
    codec = 'utf-8'
    laparams = LAParams()

    with io.StringIO() as retstr:
        with TextConverter(rsrcmgr, retstr, codec=codec,
                           laparams=laparams) as device:
            with open(path, 'rb') as fp:
                interpreter = PDFPageInterpreter(rsrcmgr, device)
                password = ""
                maxpages = 0
                caching = True
                pagenos = set()

                for page in PDFPage.get_pages(fp,
                                              pagenos,
                                              maxpages=maxpages,
                                              password=password,
                                              caching=caching,
                                              check_extractable=True):
                    interpreter.process_page(page)

                return retstr.getvalue()


if __name__ == "__main__":
    print(convert_pdf_to_txt('C:\\Path\\To\\Test_PDF.pdf'))

Answer 10

尝试：

在末尾：pip install PyPDF2

import PyPDF2
pdfFileObject = open('mypdf.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObject)
count = pdfReader.numPages
for i in range(count):
    page = pdfReader.getPage(i)
    print(page.extractText())

Answer 11

我认为这段代码正是您要找的：

import requests, time, datetime, os, threading, sys, configparser
import glob
import pdfplumber

for filename in glob.glob("*.pdf"):
    pdf = pdfplumber.open(filename)
    OutputFile = filename.replace('.pdf','.txt')
    fx2=open(OutputFile, "a+")
    for i in range(0,10000,1):
        try:
            page = pdf.pages[i]
            text = page.extract_text()
            print(text)
            fx2.write(text)
        except Exception as e: 
            print(e)
    fx2.close()
    pdf.close()

如何在python 3.7.3中从pdf提取文本

11 个答案: