Question

我想挂载一个数据结构，说明出现次数并按正确的顺序映射它们。

例如：

word_1 =＆gt; 10次出现

word_2 =＆gt; 5次出现

word_3 =＆gt; 12次出现

word_4 =＆gt; 2次发生

并且每个单词都有一个id来表示它：

kw2id = ['word_1':0, 'word_2':1, 'word_3':2, 'word_4': 3]

所以有序列表将是：

ordered_vocab = [2, 0, 1, 3]

例如我的代码是......：

#build a vocabulary with the number of ocorrences
vocab = {}
count = 0
for line in open(DATASET_FILE):
    for word in line.split():
        if word in vocab:
            vocab[word] += 1
        else:
            vocab[word] = 1
    count += 1
    if not count % 100000:
        print(count, "documents processed")

我怎样才能有效？

Answer 1

那是Counters的用途：

from collections import Counter
cnt = Counter()

with open(DATASET_FILE) as fp:
    for line in fp.readlines():
        for word in line.split():
            cnt[word] += 1

或者（更短，更多＆＃34;漂亮＆＃34;使用发电机）：

from collections import Counter

with open(DATASET_FILE) as fp:
    words = (word for line in fp.readlines() for word in line.split())
    cnt = Counter(words)

Answer 2

这是您的代码稍微快一点的版本，对不起，我不太了解numpy，但也许这会有所帮助，enumerate和defaultdict(int)是更改我做了（你不必接受这个答案，只是想帮忙）

from collections import defaultdict

#build a vocabulary with the number of ocorrences
vocab = defaultdict(int)
with open(DATASET_FILE) as file_handle:
    for count,line in enumerate(file_handle):
        for word in line.split():
            vocab[word] += 1
        if not count % 100000:
            print(count, "documents processed")

从0开始，defaultdict(int)对于for循环中的增量（运行Python 3.44）来说，似乎是Counter()的两倍：

from collections import Counter
from collections import defaultdict
import time

words = " ".join(["word_"+str(x) for x in range(100)])
lines = [words for i in range(100000)]

counter_dict = Counter()
default_dict = defaultdict(int)

start = time.time()
for line in lines:
    for word in line.split():
        counter_dict[word] += 1
end = time.time()
print (end-start)

start = time.time()
for line in lines:
    for word in line.split():
        default_dict[word] += 1
end = time.time()
print (end-start)

结果：

5.353034019470215
2.554084062576294

如果您想对此声明提出异议，我建议您使用以下问题：Surprising results with Python timeit: Counter() vs defaultdict() vs dict()

Answer 3

您可以使用collection.Counter。计数器允许您输入列表，它将自动计算每个元素的出现次数。

from collections import Counter
l = [1,2,2,3,3,3]
cnt = Counter(l)

所以你可以做的，除了上面的答案，它还要创建一个文件列表，并使用Counter列表，而不是手动迭代列表中的每个元素。请注意，如果您的文件与内存相比太大，则此方法不适用。

Answer 4

字符串：

>>> a = 'word_1 word_2 word_3 word_2 word_4'

ID

>>> d = {'word_1':0, 'word_2':1, 'word_3':2, 'word_4': 3}

生成字数：

>>> s = dict(zip(a.split(), map(lambda x: a.split().count(x), a.split())))
>>> s
{'word_1': 1, 'word_2': 2, 'word_3': 1, 'word_4': 1}

生成有序列表：

>>> a = sorted(s.items(), key=lambda x: x[1], reverse=True)
>>> ordered_list = list(map(lambda x: d[x[0]], a ))
>>> ordered_list
[1, 0, 2, 3]

在Python上安装单词出现列表的有效方法

4 个答案: