我有一个字典和一个像这样的列表:
key_labels = {'countries': ['usa','france','japan','china','germany'],
'fruits': ['mango', 'apple', 'passion-fruit', 'durion', 'bananna']}
docs = ["mango is a fruit that is very different from apple",
"i like to travel, last year i was in germany but i like france.it was lovely",
"mango bananna and apple are my favourite",
"apples are grown in USA",
"fruits have the best nutrients, particularly apple and mango",
"usa and germany were both in the race last year"]
我想做的是检查文档中的字符串是否包含key_labels
中的关键字(值),如果存在这些关键字,则为该句子分配一个标签,该标签基本上是与之对应的关键字key_labels
,我可以这样做:
temp = []
for s in docs:
for k, l in key_labels.items():
for w in l:
if w in s.lower():
temp.append({s:k})
此输出看起来像这样:
#temp
[{'mango is a fruit that is very different from apple': 'fruits'},
{'mango is a fruit that is very different from apple': 'fruits'},
{'i like to travel, last year i was in germany but i like france.it was lovely': 'countries'},
{'i like to travel, last year i was in germany but i like france.it was lovely': 'countries'},
{'mango bananna and apple are my favourite': 'fruits'},
{'mango bananna and apple are my favourite': 'fruits'},
{'mango bananna and apple are my favourite': 'fruits'},
{'apples are grown in USA': 'countries'},
{'apples are grown in USA': 'fruits'},
{'fruits have the best nutrients, particularly apple and mango': 'fruits'},
{'fruits have the best nutrients, particularly apple and mango': 'fruits'},
{'usa and germany were both in the race last year': 'countries'}]
如您所见,从输出中看到的是,对于在句子中检测到的每个关键字,为同一句子分配了多次标签。
但是我想要得到的输出是这样的:
{"mango is a fruit that is very different from apple": {"fruits": 2}),
"i like to travel, last year i was in germany but i like france.it was lovely":{"countries": 2},
"mango bananna and apple are my favourite":{"fruits": 3},
"apples are grown in USA": {"fruits":1, "countries":1},
"fruits have the best nutrients, particularly apple and mango":{"fruits": 2},
"usa and germany were both in the race last year":{"countries": 1}}
我将如何修改我的代码以完成此任务
答案 0 :(得分:3)
您可以将temp
设为字典,并使用dict.setdefault
和dict.get
方法为外部dict和内部dict设置默认值:
temp = {}
for s in docs:
for k, l in key_labels.items():
for w in l:
if w in s.lower():
temp[s][k] = temp.setdefault(s, {}).get(k, 0) + 1
print(temp)
这将输出:
{'mango is a fruit that is very different from apple': {'fruits': 2}, 'i like to travel, last year i was in germany but i like france.it was lovely': {'countries': 2}, 'mango bananna and apple are my favourite': {'fruits': 3}, 'apples are grown in USA': {'countries': 1, 'fruits': 1}, 'fruits have the best nutrients, particularly apple and mango': {'fruits': 2}, 'usa and germany were both in the race last year': {'countries': 2}}