在Python上安装单词出现列表的有效方法

时间:2017-10-24 18:22:34

标签: python algorithm data-structures

我想挂载一个数据结构,说明出现次数并按正确的顺序映射它们。

例如:

  

word_1 => 10次​​出现

     

word_2 => 5次出现

     

word_3 => 12次出现

     

word_4 => 2次发生

并且每个单词都有一个id来表示它:

kw2id = ['word_1':0, 'word_2':1, 'word_3':2, 'word_4': 3]

所以有序列表将是:

ordered_vocab = [2, 0, 1, 3]

例如我的代码是......:

#build a vocabulary with the number of ocorrences
vocab = {}
count = 0
for line in open(DATASET_FILE):
    for word in line.split():
        if word in vocab:
            vocab[word] += 1
        else:
            vocab[word] = 1
    count += 1
    if not count % 100000:
        print(count, "documents processed")

我怎样才能有效

4 个答案:

答案 0 :(得分:3)

那是Counters的用途:

from collections import Counter
cnt = Counter()

with open(DATASET_FILE) as fp:
    for line in fp.readlines():
        for word in line.split():
            cnt[word] += 1

或者(更短,更多"漂亮"使用发电机):

from collections import Counter

with open(DATASET_FILE) as fp:
    words = (word for line in fp.readlines() for word in line.split())
    cnt = Counter(words)

答案 1 :(得分:2)

这是您的代码稍微快一点的版本,对不起,我不太了解numpy,但也许这会有所帮助,enumeratedefaultdict(int)是更改我做了(你不必接受这个答案,只是想帮忙)

from collections import defaultdict

#build a vocabulary with the number of ocorrences
vocab = defaultdict(int)
with open(DATASET_FILE) as file_handle:
    for count,line in enumerate(file_handle):
        for word in line.split():
            vocab[word] += 1
        if not count % 100000:
            print(count, "documents processed")

从0开始,defaultdict(int)对于for循环中的增量(运行Python 3.44)来说,似乎是Counter()的两倍:

from collections import Counter
from collections import defaultdict
import time

words = " ".join(["word_"+str(x) for x in range(100)])
lines = [words for i in range(100000)]

counter_dict = Counter()
default_dict = defaultdict(int)

start = time.time()
for line in lines:
    for word in line.split():
        counter_dict[word] += 1
end = time.time()
print (end-start)

start = time.time()
for line in lines:
    for word in line.split():
        default_dict[word] += 1
end = time.time()
print (end-start)

结果:

5.353034019470215
2.554084062576294

如果您想对此声明提出异议,我建议您使用以下问题:Surprising results with Python timeit: Counter() vs defaultdict() vs dict()

答案 2 :(得分:1)

您可以使用collection.Counter。计数器允许您输入列表,它将自动计算每个元素的出现次数。

from collections import Counter
l = [1,2,2,3,3,3]
cnt = Counter(l)

所以你可以做的,除了上面的答案,它还要创建一个文件列表,并使用Counter列表,而不是手动迭代列表中的每个元素。请注意,如果您的文件与内存相比太大,则此方法不适用。

答案 3 :(得分:0)

字符串:

>>> a = 'word_1 word_2 word_3 word_2 word_4'

ID

>>> d = {'word_1':0, 'word_2':1, 'word_3':2, 'word_4': 3}

生成字数:

>>> s = dict(zip(a.split(), map(lambda x: a.split().count(x), a.split())))
>>> s
{'word_1': 1, 'word_2': 2, 'word_3': 1, 'word_4': 1}

生成有序列表:

>>> a = sorted(s.items(), key=lambda x: x[1], reverse=True)
>>> ordered_list = list(map(lambda x: d[x[0]], a ))
>>> ordered_list
[1, 0, 2, 3]