Question

所以我有一个我想要阅读的文件列表并打印出这些信息。它一直给我错误list index out of range。不确定是什么问题。对于line2，如果我添加matches[:10]，它可以用于前10个文件。但我需要它来做所有文件。查了一些旧帖但仍然无法获得我的代码工作。

re.findall在我编写这段代码之前就已经工作了。不确定它不再起作用了。感谢。

import re, os
topdir = r'E:\Grad\LIS\LIS590 Text mining\Part1\Part1' # Topdir has to be an object rather than a string, which means that there is no paranthesis.
matches = []
for root, dirnames, filenames in os.walk(topdir):
    for filename in filenames:
        if filename.endswith(('.txt','.pdf')):
            matches.append(os.path.join(root, filename))

capturedorgs = []
capturedfiles = []
capturedabstracts = []
orgAwards={}
for filepath in matches:
with open (filepath,'rt') as mytext:
    mytext=mytext.read()

    matchOrg=re.findall(r'NSF\s+Org\s+\:\s+(\w+)',mytext)[0]
            capturedorgs.append(matchOrg)

    # code to capture files
    matchFile=re.findall(r'File\s+\:\s+(\w\d{7})',mytext)[0]
    capturedfiles.append(matchFile)

    # code to capture abstracts
    matchAbs=re.findall(r'Abstract\s+\:\s+(\w.+)',mytext)[0]
    capturedabstracts.append(matchAbs)

    # total awarded money
    matchAmt=re.findall(r'Total\s+Amt\.\s+\:\s+\$(\d+)',mytext)[0]

    if matchOrg not in orgAwards:
        orgAwards[matchOrg]=[]
    orgAwards[matchOrg].append(int(matchAmt))

for each in capturedorgs:
    print(each,"\n")
for each in capturedfiles:
    print(each,"\n")
for each in capturedabstracts:
    print (each,"\n")

# add code to print what is in your other two lists
from collections import Counter
countOrg=Counter(capturedorgs)
print (countOrg)

for each in orgAwards:
print(each,sum(orgAwards[each]))

错误消息：

Traceback (most recent call last):
  File "C:\Python32\Assignment1.py", line 17, in <module>
    matchOrg=re.findall(r'NSF\s+Org\s+\:\s+(\w+)',mytext)[0]
IndexError: list index out of range

Answer 1

如果findall找不到匹配项，则会返回空列表[];当您尝试从此空列表中获取第一个项目时会发生错误，从而导致异常：

>>> import re
>>> i = 'hello'
>>> re.findall('abc', i)
[]
>>> re.findall('abc', i)[0]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
IndexError: list index out of range

要确保在找不到匹配项时代码没有停止，您需要捕获引发的异常：

try:
    matchOrg=re.findall(r'NSF\s+Org\s+\:\s+(\w+)',mytext)[0]
    capturedorgs.append(matchOrg)
except IndexError:
    print('No organization match for {}'.format(filepath))

您必须为每个re.findall语句执行此操作。

Answer 2

问题在于：

matchOrg=re.findall(r'NSF\s+Org\s+\:\s+(\w+)',mytext)[0]

显然，你有一个文件根本没有这个文件。因此，当您推荐项目[0]时，它就不存在。

你需要处理这个案子。

如果找不到它，一种方法就是根本不包括它：

for filepath in matches:
    with open (filepath,'rt') as mytext:
        mytext=mytext.read()

        matchOrg=re.findall(r'NSF\s+Org\s+\:\s+(\w+)',mytext)
        if len(matchOrg) > 0:
            capturedorgs.append(matchOrg[0])

此外，如果文件中存在多个文件，您可能希望使用extend(matchOrg)，并且想要捕获所有文件。

为什么“列表索引超出范围”错误？

2 个答案: