Question

我想对多个文本文件（＆gt; 50,000个文件）进行文本分析，其中一些是html脚本。我的程序（下面）迭代这些文件，依次打开每个文件，使用NLTK模块分析内容并将输出写入CSV文件，然后继续使用第二个文件进行分析。

该程序对于单个文件运行正常，但是在第8次运行后循环几乎停止，即使要分析的第9个文件不大于第8个。例如。前8次迭代共计10分钟，而第9次则需要45分钟。第10次花了超过45分钟（文件比第一次小得多）。

我确信该程序可以进一步优化，因为我对Python仍然相对较新，但我不明白为什么它在第8次运行后变得如此缓慢？任何帮助，将不胜感激。谢谢！

#import necessary modules
import urllib, csv, re, nltk
from string import punctuation
from bs4 import BeautifulSoup
import glob

#Define bags of words (There are more variable, ie word counts, that are calculated)
adaptability=['adaptability', 'flexibility']

csvfile=open("test.csv", "w", newline='', encoding='cp850', errors='replace')
writer=csv.writer(csvfile)
for filename in glob.glob('*.txt'):

    ###Open files and arrange them so that they are ready for pre-processing
    review=open(filename, encoding='utf-8', errors='ignore').read()
    soup=BeautifulSoup(review)
    text=soup.get_text()

    from nltk.stem import WordNetLemmatizer
    wnl=WordNetLemmatizer()

    adaptability_counts=[]
    adaptability_counter=0
    review_processed=text.lower().replace('\r',' ').replace('\t',' ').replace('\n',' ').replace('. ', ' ').replace(';',' ').replace(', ',' ')
    words=review_processed.split(' ')
    word_l1=[word for word in words if word not in stopset]
    word_l=[x for x in word_l1 if x != ""]
    word_count=len(word_l)
    for word in words:
       wnl.lemmatize(word)
       if word in adaptability:
         adaptability_counter=adaptability_counter+1
    adaptability_counts.append(adaptability_counter)

    #I then repeat the analysis with 2 subsections of the text files
    #(eg. calculate adaptability_counts for Part I only)

    output=zip(adaptability_counts)
    writer=csv.writer(open('test_10.csv','a',newline='', encoding='cp850', errors='replace'))
    writer.writerows(output)
    csvfile.flush()

Answer 1

打开文件后，您永远不会关闭文件。我的猜测是你的内存不足而且需要很长时间，因为你的机器必须交换页面文件中的数据（在磁盘上）。您不必只调用open()，而是在完成文件时使用close()文件，或者使用with open构造，这将在您完成后自动关闭文件。有关详细信息，请参阅此页面：http://effbot.org/zone/python-with-statement.htm

如果是我，我会改变这一行：

review=open(filename, encoding='utf-8', errors='ignore').read()

到此：

with open(filename, encoding='utf-8', errors='ignore') as f:
    review = f.read()
    ...

并确保缩进。您打开文件时执行的代码需要在with块中缩进。

Answer 2

由于接受的答案并未完全解决您的问题，因此这是一个后续行动：

您有一个列表adaptability，您可以在其中查找输入中的每个单词。 永远不要在列表中查找单词！用一组替换列表，你应该看到一个巨大的进步。（如果您使用列表来计算单个单词，请将其替换为collections.counter或nltk的FreqDist。）如果您的adaptability列表与您阅读的每个文件一起增长（是吗？它应该？），这肯定足以引起你的问题。
但可能不止一个罪魁祸首。您遗漏了很多代码，因此无法确定您看到的每个文件或其他数据结构的增长情况。很明显，你的代码是“二次的”，随着数据变大而变慢，不是因为内存大小，而是因为你需要更多的步骤。

不要费心切换到数组和CountVectorizer，你只需稍微推迟一下这个问题。弄清楚如何在恒定时间内处理每个文件。如果您的算法不需要从多个文件中收集单词，那么最快的解决方案是分别在每个文件上运行它（自动化并不难）。

使用python进行文本分析 - 程序在8次运行后停止

2 个答案: