我编写的代码用于解析大量电子邮件(640,000个文件),输出是具有特定日期的电子邮件文件名列表。代码如下:
def createListOfFilesByDate():
searchDates = ["12 Mar 2012","13 Mar 2012"]
outfile = "EmailList.txt"
sent = "Sent:"
fileList=glob.glob("./Emails/*.txt")
foundDate = False
fout = open(outfile,'w')
for filename in fileList:
foundDate = False
with open(filename) as f:
header = [next(f) for x in xrange(10)]
f.close()
for line in header:
if sent in line:
for searchDate in searchDates:
if searchDate in line:
foundDate = True
break
if foundDate == True:
fout.write(filename + '\n')
break
fout.close()
问题在于代码会很快处理前10,000封电子邮件,但之后开始显着减慢,并且需要很长时间来覆盖剩余的电子邮件。我调查了很多可能的原因,但没找到。我想知道我是否做得不够有效。