如何使这个功能更有效(时间)?

时间:2017-04-02 23:09:42

标签: python performance pandas dictionary coding-efficiency

我有一个包含句子的数据框系列。 (有些很长)

我还有2个字典,其中包含单词作为键和整数作为计数。

并非字典中的所有单词都存在于两个字典中。有些只有一种,有些则不存在。

Dataframe的长度为124011个单位。功能是每串约0.4。这是长期的。

W只是字典的参考值(权重= {},权重[W] = {})

这是功能:

def match_share(string, W, weights, rel_weight):

    words = string.split()

    words_counts = Counter(words)

    ratios = []

    for word in words:

        if ((word in weights[W].keys())&(word in rel_weight[W].keys())):

            if (weights[W][word]!=0):

                ratios.append(words_counts[word]*rel_weight[W][word]/weights[W][word])

        else:

            ratios.append(0)

    if len(words)>0:

        ratios = np.divide(ratios, float(len(words)))

    ratio = np.sum(ratios)

    return ratio

THX

3 个答案:

答案 0 :(得分:1)

我认为你的时间效率低下可能来自你使用Counter而不是dict的事实。一些discussion here表明dict类的部分用纯c编写,而counter用python编写。

我建议将代码更改为使用dict并测试以确定是否提供更快的时间

为什么这段代码重复?:

words = string.split()

words_counts = Counter(words)

words = string.split()

words_counts = Counter(words)

ratios = []

答案 1 :(得分:1)

让我们清理一下:

def match_share(string, W, weights, rel_weight):

    words = string.split()

    words_counts = Counter(words)

    words = string.split()

    words_counts = Counter(words)

那是多余的!将4个语句替换为2:

def match_share(string, W, weights, rel_weight):

    words = string.split()    
    words_counts = Counter(words)

下一步:

    ratios = []

    for word in words:    

        if ((word in weights[W].keys())&(word in rel_weight[W].keys())):

            if (weights[W][word]!=0):

                ratios.append(words_counts[word]*rel_weight[W][word]/weights[W][word])

        else:

            ratios.append(0)

我不知道你认为代码的作用。我希望你不要狡猾。但是.keys会返回一个可迭代的,而X in <iterable>的速度比X in <dict>慢。此外,注意:如果最内层(weights[W][word] != 0)条件失败,您不会附加任何内容。这可能是一个错误,因为你试图在另一个条件下追加0。 (我不知道你在做什么,所以我只是指出它。)这是Python,而不是Perl,C或Java。所以if <test>:

周围不需要任何parens

让我们一起去:

    ratios = []

    for word in words:
        if word in weights[W] and word in rel_weight[W]:
            if weights[W][word] != 0:    
                ratios.append(words_counts[word] * rel_weight[W][word] / weights[W][word])

        else:
            ratios.append(0)

下一步:

    if len(words)>0:

        ratios = np.divide(ratios, float(len(words)))

你试图阻止除以零。但您可以使用列表的truthiness来检查这一点,并避免进行比较:

    if words:
        ratios = np.divide(ratios, float(len(words)))

剩下的很好,但你不需要变量。

    ratio = np.sum(ratios)

    return ratio

应用这些mod后,您的函数如下所示:

def match_share(string, W, weights, rel_weight):

    words = string.split()    
    words_counts = Counter(words)
    ratios = []

    for word in words:
        if word in weights[W] and word in rel_weight[W]:
            if weights[W][word] != 0:    
                ratios.append(words_counts[word] * rel_weight[W][word] / weights[W][word])

        else:
            ratios.append(0)

    if words:
        ratios = np.divide(ratios, float(len(words)))

    ratio = np.sum(ratios)
    return ratio

稍微看一下,我看到你正在这样做:

word_counts = Counter(words)

for word in words:
    append(   word_counts[word] * ...)

据我所知,这意味着如果&#34; apple&#34;出现6次,你要将6 * ...附加到列表中,每个单词一次。因此,您的列表中有6个不同的6 * ....你确定这是你想要的吗?或者应该for word in word_counts只是迭代不同的单词?

另一个优化是从循环内部删除查找。尽管weights[W]的值永远不会发生变化,但您一直在查找rel_weight[W]W。让我们在循环外缓存这些值。另外,让我们缓存指向ratios.append方法的指针。

def match_share(string, W, weights, rel_weight):

    words = string.split()    
    words_counts = Counter(words)
    ratios = []

    # Cache these values for speed in loop
    ratios_append = ratios.append
    weights_W = weights[W]
    rel_W = rel_weight[W]

    for word in words:
        if word in weights_W and word in rel_W:
            if weights_W[word] != 0:    
                ratios_append(words_counts[word] * rel_W[word] / weights_W[word])

        else:
            ratios_append(0)

    if words:
        ratios = np.divide(ratios, float(len(words)))

    ratio = np.sum(ratios)
    return ratio

试试看,看看它是如何运作的。请查看上面的粗体注释和问题。可能存在错误,可能有更多方法可以加速。

答案 2 :(得分:1)

如果你有一个该函数执行的配置文件会很好,但这里有一些通用的想法:

  1. 你不必要地在每次迭代中获得一些元素。您可以在循环之前提取这些
  2. 例如

    weights_W = weights[W]
    rel_weights_W = rel_weights[W]
    
    1. 您不需要在词典上致电.keys()
    2. 这些是等价的:

      word in weights_W.keys()
      word in weights_W
      
      1. 尝试获取值而不先查找它们。这将为您节省一次查询。
      2. 例如,而不是:

        if ((word in weights[W].keys())&(word in rel_weight[W].keys())):
                if (weights[W][word]!=0):
        

        你可以这样做:

        word_weight = weights_W.get(word)
        if word_weight is not None:
            word_rel_weight = rel_weights_W.get(word)
            if word_rel_weight is not None:
                if word_weight != 0:  # lookup saved here