我想挂载一个数据结构,说明出现次数并按正确的顺序映射它们。
例如:
word_1 => 10次出现
word_2 => 5次出现
word_3 => 12次出现
word_4 => 2次发生
并且每个单词都有一个id来表示它:
kw2id = ['word_1':0, 'word_2':1, 'word_3':2, 'word_4': 3]
所以有序列表将是:
ordered_vocab = [2, 0, 1, 3]
例如我的代码是......:
#build a vocabulary with the number of ocorrences
vocab = {}
count = 0
for line in open(DATASET_FILE):
for word in line.split():
if word in vocab:
vocab[word] += 1
else:
vocab[word] = 1
count += 1
if not count % 100000:
print(count, "documents processed")
我怎样才能有效?
答案 0 :(得分:3)
那是Counters
的用途:
from collections import Counter
cnt = Counter()
with open(DATASET_FILE) as fp:
for line in fp.readlines():
for word in line.split():
cnt[word] += 1
或者(更短,更多"漂亮"使用发电机):
from collections import Counter
with open(DATASET_FILE) as fp:
words = (word for line in fp.readlines() for word in line.split())
cnt = Counter(words)
答案 1 :(得分:2)
这是您的代码稍微快一点的版本,对不起,我不太了解numpy,但也许这会有所帮助,enumerate
和defaultdict(int)
是更改我做了(你不必接受这个答案,只是想帮忙)
from collections import defaultdict
#build a vocabulary with the number of ocorrences
vocab = defaultdict(int)
with open(DATASET_FILE) as file_handle:
for count,line in enumerate(file_handle):
for word in line.split():
vocab[word] += 1
if not count % 100000:
print(count, "documents processed")
从0开始,defaultdict(int)
对于for循环中的增量(运行Python 3.44)来说,似乎是Counter()
的两倍:
from collections import Counter
from collections import defaultdict
import time
words = " ".join(["word_"+str(x) for x in range(100)])
lines = [words for i in range(100000)]
counter_dict = Counter()
default_dict = defaultdict(int)
start = time.time()
for line in lines:
for word in line.split():
counter_dict[word] += 1
end = time.time()
print (end-start)
start = time.time()
for line in lines:
for word in line.split():
default_dict[word] += 1
end = time.time()
print (end-start)
结果:
5.353034019470215
2.554084062576294
如果您想对此声明提出异议,我建议您使用以下问题:Surprising results with Python timeit: Counter() vs defaultdict() vs dict()
答案 2 :(得分:1)
您可以使用collection.Counter。计数器允许您输入列表,它将自动计算每个元素的出现次数。
from collections import Counter
l = [1,2,2,3,3,3]
cnt = Counter(l)
所以你可以做的,除了上面的答案,它还要创建一个文件列表,并使用Counter列表,而不是手动迭代列表中的每个元素。请注意,如果您的文件与内存相比太大,则此方法不适用。
答案 3 :(得分:0)
字符串:
>>> a = 'word_1 word_2 word_3 word_2 word_4'
ID
>>> d = {'word_1':0, 'word_2':1, 'word_3':2, 'word_4': 3}
生成字数:
>>> s = dict(zip(a.split(), map(lambda x: a.split().count(x), a.split())))
>>> s
{'word_1': 1, 'word_2': 2, 'word_3': 1, 'word_4': 1}
生成有序列表:
>>> a = sorted(s.items(), key=lambda x: x[1], reverse=True)
>>> ordered_list = list(map(lambda x: d[x[0]], a ))
>>> ordered_list
[1, 0, 2, 3]