`我正在尝试从包含文本,表格和图像的pdf文件中提取文本。并希望将该文件保存在本地系统上。这是我正在开发的代码。
from PyPDF2 import PdfFileReader
# Load the pdf to the PdfFileReader object with default settings
with open("SHKelkar.pdf", "rb") as pdf_file:
pdf_reader = PdfFileReader(pdf_file)
total_pages = pdf_reader.numPages
print(total_pages)
print(f"The total number of pages in the pdf document is {pdf_reader.numPages}")
for i in range(total_pages):
page = pdf_file.page[i]
textdata = page.extract_text()
print(textdata)
答案 0 :(得分:0)
您是从pdf_file
而不是pdf_reader
提取的:
在下面的工作代码中进行检查。
from PyPDF2 import PdfFileReader
# Load the pdf to the PdfFileReader object with default settings
with open("sample.pdf", "rb") as pdf_file:
pdf_reader = PdfFileReader(pdf_file)
total_pages = pdf_reader.getNumPages()
print(total_pages)
print(f"The total number of pages in the pdf document is {pdf_reader.numPages}")
for i in range(total_pages):
page = pdf_reader.getPage(i)
textdata = page.extractText()
print(textdata)