我有两个包含关键字的列表:
slangNames = [Vikes, Demmies, D, MS Contin]
riskNames = [enough, pop, final, stress, trade]
我还有一个名为overallDict
的字典,其中包含推文。键值对是{ID:Tweet text}
例如:
{1:"Vikes is not enough for me", 2:"Demmies is okay", 3:"pop a D"}
我试图仅隔离那些至少包含slangNames和riskNames中一个关键字的推文。因此,推文必须包含来自slangNames的任何关键字以及来自riskNames的任何关键字。 所以从上面的例子中,我的代码应该返回键1和3,即
{1:"Vikes is not enough for me", 3:"pop a D"}.
但我的代码正在拾取子串而不是完整的单词。所以基本上,任何带有字母'D'的东西都会被拿起来。我如何将这些作为整个单词而不是子串匹配? 请帮忙。谢谢!
到目前为止我的代码如下:
for key in overallDict:
if any(x in overallDict[key] for x in strippedRisks) and (any(x in overallDict[key] for x in strippedSlangs)):
output.append(key)
答案 0 :(得分:1)
将slangNames和riskNames存储为集合,拆分字符串并检查两个集合中是否出现任何单词
slangNames = set(["Vikes", "Demmies", "D", "MS", "Contin"])
riskNames = set(["enough", "pop", "final", "stress", "trade"])
d = {1: "Vikes is not enough for me", 2:"Demmies is okay", 3:"pop a D"}
for k,v in d.items():
spl = v.split() # split once
if any(word in slangNames for word in spl) and any(word in riskNames for word in spl):
print(k,v)
输出:
1 Vikes is not enough for me
3 pop a D
或者不使用set.isdisjoint
:
slangNames = set(["Vikes", "Demmies", "D", "MS", "Contin"])
riskNames = set(["enough", "pop", "final", "stress", "trade"])
d = {1: "Vikes is not enough for me", 2:"Demmies is okay", 3:"pop a D"}
for k,v in d.items():
spl = v.split()
if not slangNames.isdisjoint(spl) and not riskNames.isdisjoint(spl):
print(k, v)
使用any应该是最有效的,因为我们将在第一场比赛时短路。如果两个集合的交集是空集,则它们是不相交的,因此如果if not slangNames.isdisjoint(spl)
为真,则至少出现一个常用单词。
如果MS Contin
实际上是一个单词,您还需要注意:
import re
slangNames = set(["Vikes", "Demmies", "D"])
r = re.compile(r"\bMS Contin\b")
riskNames = set(["enough", "pop", "final", "stress", "trade"])
d = {1: "Vikes is not enough for me", 2:"Demmies is okay", 3:"pop a D"}
for k,v in d.items():
spl = v.split()
if (not slangNames.isdisjoint(spl) or r.search(v)) and not riskNames.isdisjoint(spl):
print(k,v)