我想从word频率计数的文本文件中初始化collections.Counter对象。也就是说,我有一个文件" counts.txt":
rank wordform abs r mod
1 the 225300 29 223066.9
2 and 157486 29 156214.4
3 to 134478 29 134044.8
...
999 fallen 345 29 326.6
1000 supper 368 27 325.8
我想要一个Counter对象wordCounts
,以便我可以调用
>>> print wordCounts.most_common(3)
[('the', 225300), ('of', 157486), ('and', 134478)]
什么是效率最高的Pythonic方式
答案 0 :(得分:2)
import collections.Counter
words = dict()
fp = open('counts.txt')
for line in fp:
items = line.split()
words[items[1].strip()] = int(items[2].strip())
wordCounts = collections.Counter(words)
答案 1 :(得分:1)
这是两个版本。第一个将counts.txt
作为常规文本文件。第二个将它视为csv文件(它看起来像这样)。
from collections import Counter
with open('counts.txt') as f:
lines = [line.strip().split() for line in f]
wordCounts = Counter({line[1]: int(line[2]) for line in lines[1:]})
print wordCounts.most_common(3)
如果你的数据文件有些被一些一致的字符或字符串分隔,你可以使用csv.DictReader
对象来解析文件。
下面显示了如何完成 IF 您的文件TAB
分隔。
数据文件(由我编辑为TAB分隔)
rank wordform abs r mod
1 the 225300 29 223066.9
2 and 157486 29 156214.4
3 to 134478 29 134044.8
999 fallen 345 29 326.6
1000 supper 368 27 325.8
代码
from csv import DictReader
from collections import Counter
with open('counts.txt') as f:
reader = DictReader(f, delimiter='\t')
wordCounts = Counter({row['wordform']: int(row['abs']) for row in reader})
print wordCounts.most_common(3)