文字= sio.getvalue();在“ For”循环中返回无法识别的“文本”

时间:2019-05-21 18:48:49

标签: python pdf pdfminer

我正在尝试从PDF文件读取特定的单词,我使用PDFMiner将PDF转换为文本(逐行),并使用“ for”循环查找单词。

当我打印“文本”时,会逐行获得所需的PDF内容。

但是我使用相同的“文本”读入“ For”循环以找到实际的单词,而我的“ For”循环无法识别“文本”。

除了下面的代码外,我还尝试创建一个Test.txt('PDF_PlaceHolder.txt')文件并读取,这也无法在“ For循环”中识别

file = open("PDF_PlaceHolder.txt", "w")
file.write(PDFtext)
fname ='PDF_PlaceHolder.txt'
fh = open(fname)
for line in fh:
     line = line.rstrip()

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter#process_pdf
from pdfminer.pdfpage import PDFPage
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams

# from cStringIO import StringIO
# from io import BytestIO
from io import StringIO

Tax_Co = list()
Tax_Au = list()
line = list()

def pdf_to_text(pdfname):
     # PDFMiner boilerplate
     rsrcmgr = PDFResourceManager()
     sio = StringIO()
     codec = 'utf-8'
     laparams = LAParams()
     device = TextConverter(rsrcmgr, sio, codec=codec, laparams=laparams)
     interpreter = PDFPageInterpreter(rsrcmgr, device)

# Extract text
     fp = open(pdfname, 'rb')
     for page in PDFPage.get_pages(fp):
        interpreter.process_page(page)
        fp.close()

   # Get text from StringIO
     text = sio.getvalue()
     # Cleanup
     device.close()
     sio.close()
    return text

 # Local PDF File reading 
 pdfname ='Summary 5.pdf'
  # Passing to function
 PDFtext = pdf_to_text(pdfname)
 fh = PDFtext

  #Reading the line by line to find the content and next index word
  for line in fh:
     line = line.rstrip()
  # if line.startswith('From' ):
    if line.startswith('Tax Company Name:'):
        Tax_Company = line.split()
        print ("Tax Company Name:", Tax_Company[3])
     if line.startswith('Tax Authority:'):
        Tax_Aut = line.split()
        print ("Tax Authority:", Tax_Aut[2])

预期结果:

Tax Company : XY
Tax Authority : XZ

现在什么也不打印。(空白)

0 个答案:

没有答案