我有一个包含句子的数据框系列。 (有些很长)
我还有2个字典,其中包含单词作为键和整数作为计数。
并非字典中的所有单词都存在于两个字典中。有些只有一种,有些则不存在。
Dataframe的长度为124011个单位。功能是每串约0.4。这是长期的。
W只是字典的参考值(权重= {},权重[W] = {})
这是功能:
def match_share(string, W, weights, rel_weight):
words = string.split()
words_counts = Counter(words)
ratios = []
for word in words:
if ((word in weights[W].keys())&(word in rel_weight[W].keys())):
if (weights[W][word]!=0):
ratios.append(words_counts[word]*rel_weight[W][word]/weights[W][word])
else:
ratios.append(0)
if len(words)>0:
ratios = np.divide(ratios, float(len(words)))
ratio = np.sum(ratios)
return ratio
THX
答案 0 :(得分:1)
我认为你的时间效率低下可能来自你使用Counter而不是dict的事实。一些discussion here表明dict类的部分用纯c编写,而counter用python编写。
我建议将代码更改为使用dict并测试以确定是否提供更快的时间
为什么这段代码重复?:
words = string.split()
words_counts = Counter(words)
words = string.split()
words_counts = Counter(words)
ratios = []
答案 1 :(得分:1)
让我们清理一下:
def match_share(string, W, weights, rel_weight):
words = string.split()
words_counts = Counter(words)
words = string.split()
words_counts = Counter(words)
那是多余的!将4个语句替换为2:
def match_share(string, W, weights, rel_weight):
words = string.split()
words_counts = Counter(words)
下一步:
ratios = []
for word in words:
if ((word in weights[W].keys())&(word in rel_weight[W].keys())):
if (weights[W][word]!=0):
ratios.append(words_counts[word]*rel_weight[W][word]/weights[W][word])
else:
ratios.append(0)
我不知道你认为代码的作用。我希望你不要狡猾。但是.keys
会返回一个可迭代的,而X in <iterable>
的速度比X in <dict>
慢。此外,注意:如果最内层(weights[W][word] != 0
)条件失败,您不会附加任何内容。这可能是一个错误,因为你试图在另一个条件下追加0。 (我不知道你在做什么,所以我只是指出它。)这是Python,而不是Perl,C或Java。所以if <test>:
让我们一起去:
ratios = []
for word in words:
if word in weights[W] and word in rel_weight[W]:
if weights[W][word] != 0:
ratios.append(words_counts[word] * rel_weight[W][word] / weights[W][word])
else:
ratios.append(0)
下一步:
if len(words)>0:
ratios = np.divide(ratios, float(len(words)))
你试图阻止除以零。但您可以使用列表的truthiness来检查这一点,并避免进行比较:
if words:
ratios = np.divide(ratios, float(len(words)))
剩下的很好,但你不需要变量。
ratio = np.sum(ratios)
return ratio
应用这些mod后,您的函数如下所示:
def match_share(string, W, weights, rel_weight):
words = string.split()
words_counts = Counter(words)
ratios = []
for word in words:
if word in weights[W] and word in rel_weight[W]:
if weights[W][word] != 0:
ratios.append(words_counts[word] * rel_weight[W][word] / weights[W][word])
else:
ratios.append(0)
if words:
ratios = np.divide(ratios, float(len(words)))
ratio = np.sum(ratios)
return ratio
稍微看一下,我看到你正在这样做:
word_counts = Counter(words)
for word in words:
append( word_counts[word] * ...)
据我所知,这意味着如果&#34; apple&#34;出现6次,你要将6 * ...附加到列表中,每个单词一次。因此,您的列表中有6个不同的6 * ....你确定这是你想要的吗?或者应该for word in word_counts
只是迭代不同的单词?
另一个优化是从循环内部删除查找。尽管weights[W]
的值永远不会发生变化,但您一直在查找rel_weight[W]
和W
。让我们在循环外缓存这些值。另外,让我们缓存指向ratios.append
方法的指针。
def match_share(string, W, weights, rel_weight):
words = string.split()
words_counts = Counter(words)
ratios = []
# Cache these values for speed in loop
ratios_append = ratios.append
weights_W = weights[W]
rel_W = rel_weight[W]
for word in words:
if word in weights_W and word in rel_W:
if weights_W[word] != 0:
ratios_append(words_counts[word] * rel_W[word] / weights_W[word])
else:
ratios_append(0)
if words:
ratios = np.divide(ratios, float(len(words)))
ratio = np.sum(ratios)
return ratio
试试看,看看它是如何运作的。请查看上面的粗体注释和问题。可能存在错误,可能有更多方法可以加速。
答案 2 :(得分:1)
如果你有一个该函数执行的配置文件会很好,但这里有一些通用的想法:
例如
weights_W = weights[W]
rel_weights_W = rel_weights[W]
.keys()
。这些是等价的:
word in weights_W.keys()
word in weights_W
例如,而不是:
if ((word in weights[W].keys())&(word in rel_weight[W].keys())):
if (weights[W][word]!=0):
你可以这样做:
word_weight = weights_W.get(word)
if word_weight is not None:
word_rel_weight = rel_weights_W.get(word)
if word_rel_weight is not None:
if word_weight != 0: # lookup saved here