Question

我正在遍历一列称为数据的字典。但是，我想随后查找在单词列表下的单词是否出现在数据列表中字典的键值对之一中。数据列表中的每个字典的键都相同。

问题在于，尽管循环有效，但字典似乎在cleandata列表中多次附加。我想念什么？

这是我目前拥有的：

cleandata = []
data = [{'text':'some string1','another key':'some other text',...}, {'text':'some string2', 'another key':'some other text 2',...},{...},..]
words = ['word1','word2',...]

for d in data:
   for word in words:
     if word in d['text']:
       cleandata.append(d)
     else:
       continue

这给了我类似的东西

cleandata = [{'text':'word1','another key':'some other text',...},{'text':'word1','another key':'some other text',...},{'text':'word1','another key':'some other text',...},...{'text':'word2','another key':'some other text',...},{'text':'word2','another key':'some other text',...},... ]

Answer 1

发现第一个循环内部中断。而且您不需要继续

cleandata = []
data = [{'text':'some string1 word1 word2','another key':'some other text'}, {'text':'some string2 word1', 'another key':'some other text 2'}]
words = ['word1','word2']

for d in data:
    for word in words:
        if word in d['text']:
            cleandata.append(d)
            break

Answer 2

也许：

for d in data:
   if any(word in d['text'] for word in words):
       cleandata.append(d)

在第一个比赛中发生短路。

这可以简化为以下列表的理解：

[d for d in data if any(word in d['text'] for word in words)]

您还可以使用set math完全消除内部循环：

words = set(['word1','word2'])
for d in data:
   if set(d['text'].split()) & words:
       cleandata.append(d)

效率可能会更高（当然，这取决于实际数据。）

这可以简化为以下列表的理解：

[d for d in data if set(d['text'].split()) & words]

遍历字典列表以查找字符串值匹配

2 个答案: