Question

我有一个包含70,429个文件（296.5 mb）的语料库。我试图通过使用整个语料库来找到bi-gram。我写了以下代码;

allFiles = ""
for dirName in os.listdir(rootDirectory):
     for subDir in os.listdir(dirName):
         for fileN in os.listdir(subDir):
             FText = codecs.open(fileN, encoding="'iso8859-9'")
             PText = FText.read()
             allFiles += PText
tokens = allFiles.split()
finder = BigramCollocationFinder.from_words(tokens, window_size = 3)
finder.apply_freq_filter(2)
bigram_measures = nltk.collocations.BigramAssocMeasures()
for k,v in finder.ngram_fd.most_common(100):
    print(k,v)

有一个根目录，根目录包含子目录，每个子目录包含许多文件。我所做的是;

我逐个读取所有文件，并将上下文添加到名为allFiles的字符串中。最后，我将字符串拆分为标记并调用相关的二元函数。问题是;

我将程序运行了一天，但无法获得任何结果。有没有更有效的方法在包含大量文件的语料库中查找bigrams？

任何建议和建议将不胜感激。提前谢谢。

Answer 1

通过尝试立即将巨大的语料库读入内存，你会耗尽你的记忆力，迫使大量交换使用，并减慢一切。

NLTK提供了各种“语料库读者”，可以逐个返回你的单词，这样整个语料库就不会同时存储在内存中。如果我理解你的语料库布局，这可能会有效：

from nltk.corpus.reader import PlaintextCorpusReader
reader = PlaintextCorpusReader(rootDirectory, "*/*/*", encoding="iso8859-9")
finder = BigramCollocationFinder.from_words(reader.words(), window_size = 3)
finder.apply_freq_filter(2) # Continue processing as before
...

附录：你的方法有一个错误：你正在拍摄从一个文档的末尾到下一个文档的开头的三元组...这是你想要摆脱的废话。我推荐以下变体，它分别从每个文档中收集三元组。

document_streams = (reader.words(fname) for fname in reader.fileids())
BigramCollocationFinder.default_ws = 3
finder = BigramCollocationFinder.from_documents(document_streams)

Answer 2

考虑使用Python的“多处理”线程池（https://docs.python.org/2/library/multiprocessing.html）并行化您的进程，将语料库中每个文件的{word：count}字典发送到一些共享列表中。在工作池完成之后，在过滤之前合并字典，然后按字出现次数进行合并。

python中的双字节有很多txt文件

2 个答案: