Question

我正在尝试一些计算PMI的例子，试图计算我有的一些推文消息（收集~50k），如果发现algorithm的实现瓶颈在{{1我不知道为什么：

以下是我对其进行分析并占用大量内存和时间的示例

defaultdict(lambda : defaultdict(int))

部分：

for term, n in p_t.items():
    positive_assoc = sum(pmi[term][tx] for tx in positive_vocab)
    negative_assoc = sum(pmi[term][tx] for tx in negative_vocab)
    semantic_orientation[term] = positive_assoc - negative_assoc

由于某种原因分配了大量内存。我假设因为不存在的值返回0，所以传递给 sum 函数的数组非常大。

我用简单positive_assoc = sum(pmi[term][tx] for tx in positive_vocab) negative_assoc = sum(pmi[term][tx] for tx in negative_vocab)和变量if value exist来解决问题。

博客的整个实施：

sum_pos

Answer 1

defaultdict将为缺少的每个键调用工厂函数。如果你在sum()中使用它，那里缺少很多键，那么你确实会创建一大堆字典，这些字典在不使用它们的情况下会增加包含更多键。

切换到此处使用dict.get() method以防止创建对象：

positive_assoc = sum(pmi.get(term, {}).get(tx, 0) for tx in positive_vocab)
negative_assoc = sum(pmi.get(term, {}).get(tx, 0) for tx in negative_vocab)

请注意pmi.get()调用返回一个空字典，以便链式dict.get()调用继续工作，并且如果没有与给定{相关联的字典，则可以返回默认0 {1}}。

Answer 2

我喜欢Martjin的答案......但这也应该有效，你可能会觉得它更具可读性。

positive_assoc = sum(pmi[term][tx] for tx in positive_vocab if term in pmi and tx in pmi[term) negative_assoc = sum(pmi[term][tx] for tx in negative_vocab if term in pmi and tx in pmi[term)

defaultdict的记忆效率

2 个答案: