所以我有一个我想要阅读的文件列表并打印出这些信息。它一直给我错误list index out of range
。不确定是什么问题。对于line2,如果我添加matches[:10]
,它可以用于前10个文件。但我需要它来做所有文件。查了一些旧帖但仍然无法获得我的代码工作。
re.findall
在我编写这段代码之前就已经工作了。不确定它不再起作用了。感谢。
import re, os
topdir = r'E:\Grad\LIS\LIS590 Text mining\Part1\Part1' # Topdir has to be an object rather than a string, which means that there is no paranthesis.
matches = []
for root, dirnames, filenames in os.walk(topdir):
for filename in filenames:
if filename.endswith(('.txt','.pdf')):
matches.append(os.path.join(root, filename))
capturedorgs = []
capturedfiles = []
capturedabstracts = []
orgAwards={}
for filepath in matches:
with open (filepath,'rt') as mytext:
mytext=mytext.read()
matchOrg=re.findall(r'NSF\s+Org\s+\:\s+(\w+)',mytext)[0]
capturedorgs.append(matchOrg)
# code to capture files
matchFile=re.findall(r'File\s+\:\s+(\w\d{7})',mytext)[0]
capturedfiles.append(matchFile)
# code to capture abstracts
matchAbs=re.findall(r'Abstract\s+\:\s+(\w.+)',mytext)[0]
capturedabstracts.append(matchAbs)
# total awarded money
matchAmt=re.findall(r'Total\s+Amt\.\s+\:\s+\$(\d+)',mytext)[0]
if matchOrg not in orgAwards:
orgAwards[matchOrg]=[]
orgAwards[matchOrg].append(int(matchAmt))
for each in capturedorgs:
print(each,"\n")
for each in capturedfiles:
print(each,"\n")
for each in capturedabstracts:
print (each,"\n")
# add code to print what is in your other two lists
from collections import Counter
countOrg=Counter(capturedorgs)
print (countOrg)
for each in orgAwards:
print(each,sum(orgAwards[each]))
错误消息:
Traceback (most recent call last):
File "C:\Python32\Assignment1.py", line 17, in <module>
matchOrg=re.findall(r'NSF\s+Org\s+\:\s+(\w+)',mytext)[0]
IndexError: list index out of range
答案 0 :(得分:2)
如果findall
找不到匹配项,则会返回空列表[]
;当您尝试从此空列表中获取第一个项目时会发生错误,从而导致异常:
>>> import re
>>> i = 'hello'
>>> re.findall('abc', i)
[]
>>> re.findall('abc', i)[0]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
IndexError: list index out of range
要确保在找不到匹配项时代码没有停止,您需要捕获引发的异常:
try:
matchOrg=re.findall(r'NSF\s+Org\s+\:\s+(\w+)',mytext)[0]
capturedorgs.append(matchOrg)
except IndexError:
print('No organization match for {}'.format(filepath))
您必须为每个re.findall
语句执行此操作。
答案 1 :(得分:0)
问题在于:
matchOrg=re.findall(r'NSF\s+Org\s+\:\s+(\w+)',mytext)[0]
显然,你有一个文件根本没有这个文件。因此,当您推荐项目[0]
时,它就不存在。
你需要处理这个案子。
如果找不到它,一种方法就是根本不包括它:
for filepath in matches:
with open (filepath,'rt') as mytext:
mytext=mytext.read()
matchOrg=re.findall(r'NSF\s+Org\s+\:\s+(\w+)',mytext)
if len(matchOrg) > 0:
capturedorgs.append(matchOrg[0])
此外,如果文件中存在多个文件,您可能希望使用extend(matchOrg)
,并且想要捕获所有文件。