Read and write large file with python

时间:2017-06-15 09:27:01

标签: python file text io python-3.5

I have a large .txt file with more than 24,000,000 lines. Now I'd like to do a word count, that is, count each word and its corresponding occurrence and record them to a new file. Here is the code I tried to run:

import gensim
class Corpus(gensim.corpora.TextCorpus): 
    def count_tokens(self):
        word_count = 0
        for text in self.get_texts():
            word_count += len(text)
        return word_count
    def get_texts(self): 
        for filename in self.input: 
            yield open(filename).read().split()

def main():
    corpus = Corpus(['somefile.txt'])
    word_count = corpus.count_tokens()
    text = open('somefile.txt').read().split()
    with open('path_to_output', 'w') as f:
        for word, _ in corpus.dictionary.token2id.items():
            num_occur = text.count(word)
            f.write('%s %d\n' % (word, num_occur))

if __name__  == '__main__':
    main()

And the server hang... I wonder if there is other sufficient way to do so or any improvement I can make? How do you read and write really large file with python?

3 个答案:

答案 0 :(得分:2)

Your get_texts() method is reading one entire file in memory at a time. It's fine for corpora with lots of small files, but if you have one enormous file, you need to read it line by line.

from collections import Counter
wordcounts = Counter()

with open("file.txt") as fp:
    for line in fp:
        wordcounts.update(line.split())

答案 1 :(得分:1)

您的代码有很多问题:

  • 它读取内存中的文件,然后拆分单词,使内存大小增加一倍(或三倍)
  • 它执行两次,首先计算单词数,然后计算每个单词的出现次数

我在几个文件上使用collections.Counter创建了一个简单的示例,没有您的对象和所有对象。 text_file_list包含文件路径列表。

import collections

c = collections.Counter()
for text_file in text_file_list:
   with open(text_file) as f:
       c.update(word for line in f for word in line.split())

循环文件,并为每个文件更新专用的Counter字典。文件是逐行读取的,永远不会完全读取。所以需要一些时间,但记忆力不会太多。

答案 2 :(得分:0)

I'd do something like this :

words = {}
with open('somefile.txt', 'r') as textf:
    for line in textf.readlines():
        for word in line.split():
            words[word] = words.getdefault(word, 0) + 1

Not very pythonic but it's the idea