我目前正在将PDFS转换为巨型文件夹中的文本,然后将某些关键字输出到Excel文件。一切正常,但即使我的文件夹中有多个PDFS,他们也会在A1列上相互写作。
如何迭代它以便下一个字典转到后续行?
custData = {}
def data_grabbing(pdf):
row = 0
col = 0
string = convert_pdf_to_txt(pdf)
lines = list(filter(bool,string.split('\n')))
for i in range(len(lines)):
if 'Lead:' in lines[i]:
custData['Name'] = lines[i+2]
elif 'Date:Date:Date:Date:' in lines[i]:
custData['Fund Manager'] = lines[i+2]
elif 'Priority:' in lines[i]:
custData['Industry'] = lines[i+2]
custData['Date'] = lines[i+1]
custData['Deal Size']= lines [i+3]
elif 'DEAL QUALIFYING MEMORANDUM' in lines[i]:
custData['Owner'] = lines[i+2]
elif 'Fund Manager' in lines[i]:
custData['Investment Type'] = lines [i+2]
print custData
for item, descrip in custData.iteritems():
worksheet.write(row, col, item)
worksheet.write(row+1, col, descrip)
col += 1
row +=2
for myFile in os.listdir(directory):
if myFile.endswith(".pdf"):
data_grabbing(os.path.join(directory, myFile))
workbook.close()
答案 0 :(得分:1)
您的一些选择是:
row
成为一个全局的,并实例化外部函数(@ StevenRumbalski的建议)datag_grabbing
成为类的方法,并使row成为实例变量。我会显示选项#3(但可能更喜欢#2):
custData = {}
def data_grabbing(pdf, row):
col = 0
string = convert_pdf_to_txt(pdf)
lines = list(filter(bool,string.split('\n')))
for i in range(len(lines)):
if 'Lead:' in lines[i]:
custData['Name'] = lines[i+2]
elif 'Date:Date:Date:Date:' in lines[i]:
custData['Fund Manager'] = lines[i+2]
elif 'Priority:' in lines[i]:
custData['Industry'] = lines[i+2]
custData['Date'] = lines[i+1]
custData['Deal Size']= lines [i+3]
elif 'DEAL QUALIFYING MEMORANDUM' in lines[i]:
custData['Owner'] = lines[i+2]
elif 'Fund Manager' in lines[i]:
custData['Investment Type'] = lines [i+2]
print custData
for item, descrip in custData.iteritems():
worksheet.write(row, col, item)
worksheet.write(row+1, col, descrip)
col += 1
cur_row = 0
for myFile in os.listdir(directory):
if myFile.endswith(".pdf"):
data_grabbing(os.path.join(directory, myFile), cur_row)
cur_row +=-2
workbook.close()