正确使用字典的方式

Question

我正在尝试比较两个句子，看看它们是否包含相同的单词集。
例如：比较“今天是美好的一天”和“今天是美好的一天”应该返回true
我现在正在使用来自收藏夹模块的Counter函数

from collections import Counter


vocab = {}
for line in file_ob:
    flag = 0
    for sentence in vocab:
        if Counter(sentence.split(" ")) == Counter(line.split(" ")):
            vocab[sentence]+=1
            flag = 1
            break
        if flag==0:
            vocab[line]=1

似乎可以正常工作几行，但是我的文本文件有1000多个，并且从未完成执行。还有其他方法，更有效的方法可以帮助我计算整个文件的结果吗？

编辑：

我只需要替换Counter方法，就可以替换它。而且实现上没有任何变化。

Answer 1

尝试类似

set(sentence.split(" ")) == set(line.split(" "))

比较组对象比比较计数器要快。集合对象和计数器对象基本上都是集合，但是当您使用计数器对象进行比较时，它必须同时比较键和值，而集合只需要比较键。
谢谢 Eric 和 Barmar 的输入。

您的完整代码如下

from collections import Counter
vocab = {a dictionary of around 1000 sentences as keys}
for line in file_ob:
    for sentence in vocab:
        if set(sentence.split(" ")) == set(line.split(" ")):
            vocab[sentence]+=1

Answer 2

您真的不需要使用两个循环。

正确使用字典的方式

假设您有一个dict：

my_dict = {'a': 1, 'b': 2, 'c': 3, 'd': 4, 'e': 5, 'f': 5, 'g': 6}

您的代码基本上等同于：

for (key, value) in my_dict.items():
    if key == 'c':
        print(value)
        break
#=> 3

但是dict（以及set，Counter，...）的全部要点是能够直接获得所需的值：

my_dict['c']
#=> 3

如果您的字典具有1000个值，则第一个示例的平均速度将比第二个示例慢500倍。这是我在Reddit上找到的简单说明：

字典就像魔术外套检查室一样。你把外套递过来得到票。每当您退还该票时，您都会立即获得你的大衣。你可以穿很多大衣，但你还是穿上大衣立即回来。外套里面有很多魔术检查房间，但是只要穿上外套就不必在意立即返回。

重构代码

您只需要在"Today is a good day!"和"Is today a good day?"之间找到一个公共签名。一种方法是提取单词，将它们转换为小写字母，对其进行排序并加入它们。重要的是输出应该是不变的（例如tuple，string，frozenset）。这样，它可以直接在集合，计数器或字典中使用，而无需遍历每个键。

from collections import Counter

sentences = ["Today is a good day", 'a b c', 'a a b c', 'c b a', "Is today a good day"]

vocab = Counter()
for sentence in sentences:
    sorted_words = ' '.join(sorted(sentence.lower().split(" ")))
    vocab[sorted_words] += 1

vocab
#=> # Counter({'a day good is today': 2, 'a b c': 2, 'a a b c': 1})

或更短：

from collections import Counter

sentences = ["Today is a good day", 'a b c', 'a a b c', 'c b a', "Is today a good day"]

def sorted_words(sentence):
    return ' '.join(sorted(sentence.lower().split(" ")))

vocab = Counter(sorted_words(sentence) for sentence in sentences)
# Counter({'a day good is today': 2, 'a b c': 2, 'a a b c': 1})

此代码应该比您迄今为止尝试的代码快得多。

还有另一种选择

如果要将原始句子保留在列表中，可以使用setdefault：

sentences = ["Today is a good day", 'a b c', 'a a b c', 'c b a', "Is today a good day"]

def sorted_words(sentence):
    return ' '.join(sorted(sentence.lower().split(" ")))

vocab = {}
for sentence in sentences:
    vocab.setdefault(sorted_words(sentence), []).append(sentence)

vocab

#=> {'a day good is today': ['Today is a good day', 'Is today a good day'],
# 'a b c': ['a b c', 'c b a'],
# 'a a b c': ['a a b c']}

Answer 3

要考虑重复/多个单词，您的相等性比较可能是：

def hash_sentence(s):                                                                                                                                                                                                                                         
    return hash(''.join(sorted(s.split())))                                                                                                                                                                                                                   

a = 'today is a good day'                                                                                                                                                                                                                                     
b = 'is today a good day'                                                                                                                                                                                                                                     
c = 'today is a good day is a good day'                                                                                                                                                                                                                       

hash_sentence(a) == hash_sentence(b)  # True
hash_sentence(a) == hash_sentence(c)  # False

另外，请注意，在您的实现中，每个句子都被计算n次（for sentence in vocab:）。

Answer 4

在您的代码中，您可以在内部循环之外提取Counter构造，而不是为每对重新计算每个计数器-这应将算法的改进与每个字符串的平均令牌数成正比。

from collections import Counter
vocab = {a dictionary of around 1000 sentences as keys}

vocab_counter = {k: Counter(k.split(" ")) for k in vocab.keys() }

for line in file_obj:
    line_counter = Counter(line.split(" "))
    for sentence in vocab:
        if vocab_counter[sentence] == line_counter:
            vocab[sentence]+=1

通过使用Counters作为字典的索引，可以进行进一步的改进，这将使您可以用查找替换线性搜索匹配的句子。 frozendict包可能会很有用，因此您可以将字典用作另一个字典的键。

检查两个字符串在Python中是否包含相同的单词集

4 个答案:

正确使用字典的方式

重构代码

还有另一种选择