来自txt文件的Python Counter

时间:2015-05-28 02:53:31

标签: python text-files counter

我想从word频率计数的文本文件中初始化collections.Counter对象。也就是说,我有一个文件" counts.txt":

rank  wordform         abs     r        mod
   1  the           225300    29   223066.9
   2  and           157486    29   156214.4
   3  to            134478    29   134044.8
...
 999  fallen           345    29      326.6
1000  supper           368    27      325.8

我想要一个Counter对象wordCounts,以便我可以调用

>>> print wordCounts.most_common(3)
[('the', 225300), ('of', 157486), ('and', 134478)]

什么是效率最高的Pythonic方式

2 个答案:

答案 0 :(得分:2)

import collections.Counter

words = dict()
fp = open('counts.txt')

for line in fp:
   items = line.split()
   words[items[1].strip()] = int(items[2].strip())

wordCounts = collections.Counter(words)

答案 1 :(得分:1)

这是两个版本。第一个将counts.txt作为常规文本文件。第二个将它视为csv文件(它看起来像这样)。

from collections import Counter

with open('counts.txt') as f:
    lines = [line.strip().split() for line in f]
    wordCounts = Counter({line[1]: int(line[2]) for line in lines[1:]})
    print wordCounts.most_common(3)

如果你的数据文件有些被一些一致的字符或字符串分隔,你可以使用csv.DictReader对象来解析文件。

下面显示了如何完成 IF 您的文件TAB分隔。

数据文件(由我编辑为TAB分隔)

rank    wordform    abs r   mod
1   the 225300  29  223066.9
2   and 157486  29  156214.4
3   to  134478  29  134044.8
999 fallen  345 29  326.6
1000    supper  368 27  325.8

代码

from csv import DictReader
from collections import Counter

with open('counts.txt') as f:
    reader = DictReader(f, delimiter='\t')
    wordCounts = Counter({row['wordform']: int(row['abs']) for row in reader})
    print wordCounts.most_common(3)