I have a large .txt
file with more than 24,000,000 lines. Now I'd like to do a word count, that is, count each word and its corresponding occurrence and record them to a new file. Here is the code I tried to run:
import gensim
class Corpus(gensim.corpora.TextCorpus):
def count_tokens(self):
word_count = 0
for text in self.get_texts():
word_count += len(text)
return word_count
def get_texts(self):
for filename in self.input:
yield open(filename).read().split()
def main():
corpus = Corpus(['somefile.txt'])
word_count = corpus.count_tokens()
text = open('somefile.txt').read().split()
with open('path_to_output', 'w') as f:
for word, _ in corpus.dictionary.token2id.items():
num_occur = text.count(word)
f.write('%s %d\n' % (word, num_occur))
if __name__ == '__main__':
main()
And the server hang... I wonder if there is other sufficient way to do so or any improvement I can make? How do you read and write really large file with python?
答案 0 :(得分:2)
Your get_texts()
method is reading one entire file in memory at a time. It's fine for corpora with lots of small files, but if you have one enormous file, you need to read it line by line.
from collections import Counter
wordcounts = Counter()
with open("file.txt") as fp:
for line in fp:
wordcounts.update(line.split())
答案 1 :(得分:1)
您的代码有很多问题:
我在几个文件上使用collections.Counter
创建了一个简单的示例,没有您的对象和所有对象。 text_file_list
包含文件路径列表。
import collections
c = collections.Counter()
for text_file in text_file_list:
with open(text_file) as f:
c.update(word for line in f for word in line.split())
循环文件,并为每个文件更新专用的Counter
字典。文件是逐行读取的,永远不会完全读取。所以需要一些时间,但记忆力不会太多。
答案 2 :(得分:0)
I'd do something like this :
words = {}
with open('somefile.txt', 'r') as textf:
for line in textf.readlines():
for word in line.split():
words[word] = words.getdefault(word, 0) + 1
Not very pythonic but it's the idea