Question

我正在努力使这段代码（我制作的）尽可能快。首先，代码如下

#lemmas is list consisting of about 20,000 words. 
#That is, lemmas = ['apple', 'dog', ... ] 

#new_sents is list consisting of about 12,000 lists representing a sentence. 
#That is, new_sents = [ ['Hello', 'I', 'am', 'a', 'boy'], ['Hello', 'I', 'am', 'a', 'girl'], ... ]   

for x in lemmas:
        for y in lemmas:
            # prevent zero denominator 
            x_count = 0.00001
            y_count = 0.00001

            xy_count = 0
            ## Dice denominator 
            for i in new_sents:
                x_count += i.count(x) 
                y_count += i.count(y)

                if(x in i and y in i):
                    xy_count += 1

            sim_score = float(xy_count) / (x_count + y_count)

正如你所看到的，有太多的迭代......大约20,000 * 20,000 * 12,000，这些数字太大了。 sim_score是两个单词的Dice系数。也就是说，xy_count表示在单词中出现的单词x和单词y的数量，x_count和y_count分别表示new_sents中显示的单词x和y的总数。

我的代码太慢了。有没有更好的方法？

提前致谢。

Answer 1

你正在计算两件事。你的分数在x和y方面是对称的，所以你可以通过这样做获得2倍的速度：

for x, y in itertools.combinations(lemmas, 2):

我假设您不想将lemmas[0]与自身进行比较，否则您可以使用combinations_with_replacement。

如果从集合中查找lemmas，实施速度会更快。

但是你仍然会多次计算同样的事情。您可以采用每个引理，在news_sent中计算并存储它。

Answer 2

这是一种方法，通过迭代句子，提取单词组合，然后相对于单个单词出现计数它们。这样效率更高，因为它是句子的数量* number_of_words_per_sentence ^ 2

lemmas = ['apple', 'dog', 'foo', 'bar','Hello', 'I', 'am', 'a', 'boy', 'girl' ]

new_sents = [ ['Hello', 'I', 'am', 'a', 'boy'], ['Hello', 'I', 'am', 'a', 'girl']]

import itertools

#A counter is an auto updating dictionary for counting
from collections import Counter

#we initialize the counter with 1 for smoothing (avoiding 0)
lemmas = Counter({k:1 for k in lemmas})

#this is where we count the co-occurrences of words
coocurrs = Counter()

#we iterate the sentences, not the dictionary
for sentence in new_sents:
    #create all the word combinations in the sentences
    combos = (tuple(sorted(pair)) for pair in itertools.combinations(sentence, 2))

    #update a count for each word in the sentence
    lemmas.update(sentence)

    #update a count for each word combinations
    coocurrs.update(combos)

probabilities = {}

#convert to "probabilities"
for xy, score in coocurrs.iteritems():
    probabilities[xy] = score/float((lemmas[xy[0]]+lemmas[xy[1]]))

print probabilities

Answer 3

只要您拥有具有相同成员的 n X n 矩阵，唯一，非重复组合的数量等于：

（n ^ 2 - n）/ 2

在您的情况下，n = 20,000，即不到200,000,000次迭代。然而，现在编写代码的方式有400,000,000种可能性：

for x in lemmas:
        for y in lemmas:

换句话说，除了x == oranges和y == apples之外，你还会遇到x == apples和y == oranges的情况。据推测，其中只有一个是必要的。

找到一种方法来排除那些不必要的200,000,000次迭代将提高速度。

除此之外，我的建议是将new_sents转换为字典并完全删除此循环：

for i in new_sents

做这两件事应该可以提高速度。然后，迭代的总量保持为200,000,000，最后的查找是使用字典，这比列表快得多。这种快速查找是以牺牲内存为代价的，但对于长度为12,000的内存来说，这应该不是问题。

减少“为大数据循环”并进行改进

3 个答案: