我正在尝试从PDF文件读取特定的单词,我使用PDFMiner将PDF转换为文本(逐行),并使用“ for”循环查找单词。
当我打印“文本”时,会逐行获得所需的PDF内容。
但是我使用相同的“文本”读入“ For”循环以找到实际的单词,而我的“ For”循环无法识别“文本”。
除了下面的代码外,我还尝试创建一个Test.txt('PDF_PlaceHolder.txt')文件并读取,这也无法在“ For循环”中识别
file = open("PDF_PlaceHolder.txt", "w")
file.write(PDFtext)
fname ='PDF_PlaceHolder.txt'
fh = open(fname)
for line in fh:
line = line.rstrip()
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter#process_pdf
from pdfminer.pdfpage import PDFPage
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
# from cStringIO import StringIO
# from io import BytestIO
from io import StringIO
Tax_Co = list()
Tax_Au = list()
line = list()
def pdf_to_text(pdfname):
# PDFMiner boilerplate
rsrcmgr = PDFResourceManager()
sio = StringIO()
codec = 'utf-8'
laparams = LAParams()
device = TextConverter(rsrcmgr, sio, codec=codec, laparams=laparams)
interpreter = PDFPageInterpreter(rsrcmgr, device)
# Extract text
fp = open(pdfname, 'rb')
for page in PDFPage.get_pages(fp):
interpreter.process_page(page)
fp.close()
# Get text from StringIO
text = sio.getvalue()
# Cleanup
device.close()
sio.close()
return text
# Local PDF File reading
pdfname ='Summary 5.pdf'
# Passing to function
PDFtext = pdf_to_text(pdfname)
fh = PDFtext
#Reading the line by line to find the content and next index word
for line in fh:
line = line.rstrip()
# if line.startswith('From' ):
if line.startswith('Tax Company Name:'):
Tax_Company = line.split()
print ("Tax Company Name:", Tax_Company[3])
if line.startswith('Tax Authority:'):
Tax_Aut = line.split()
print ("Tax Authority:", Tax_Aut[2])
预期结果:
Tax Company : XY
Tax Authority : XZ
现在什么也不打印。(空白)