如何将2个列表中的单词与Python中没有子字符串匹配的另一个单词串匹配?

时间:2015-06-22 18:50:26

标签: python regex string loops twitter

我有两个包含关键字的列表:

slangNames = [Vikes, Demmies, D, MS Contin]
riskNames = [enough, pop, final, stress, trade]

我还有一个名为overallDict的字典,其中包含推文。键值对是{ID:Tweet text} 例如:

{1:"Vikes is not enough for me", 2:"Demmies is okay", 3:"pop a D"}

我试图仅隔离那些至少包含slangNames和riskNames中一个关键字的推文。因此,推文必须包含来自slangNames的任何关键字以及来自riskNames的任何关键字。 所以从上面的例子中,我的代码应该返回键1和3,即

{1:"Vikes is not enough for me", 3:"pop a D"}. 

但我的代码正在拾取子串而不是完整的单词。所以基本上,任何带有字母'D'的东西都会被拿起来。我如何将这些作为整个单词而不是子串匹配? 请帮忙。谢谢!

到目前为止我的代码如下:

for key in overallDict:
    if any(x in overallDict[key] for x in strippedRisks) and (any(x in overallDict[key] for x in strippedSlangs)):
        output.append(key)

1 个答案:

答案 0 :(得分:1)

将slangNames和riskNames存储为集合,拆分字符串并检查两个集合中是否出现任何单词

slangNames = set(["Vikes", "Demmies", "D", "MS", "Contin"])
riskNames = set(["enough", "pop", "final", "stress", "trade"])
d =  {1: "Vikes is not enough for me", 2:"Demmies is okay", 3:"pop a D"}

for k,v in d.items():
    spl = v.split() # split once
    if any(word in slangNames for word in spl) and any(word  in riskNames for word in spl):
        print(k,v)

输出:

1 Vikes is not enough for me
3 pop a D

或者不使用set.isdisjoint

slangNames = set(["Vikes", "Demmies", "D", "MS", "Contin"])
riskNames = set(["enough", "pop", "final", "stress", "trade"])
d =  {1: "Vikes is not enough for me", 2:"Demmies is okay", 3:"pop a D"}

for k,v in d.items():
    spl = v.split()
    if not slangNames.isdisjoint(spl) and not riskNames.isdisjoint(spl):
        print(k, v)

使用any应该是最有效的,因为我们将在第一场比赛时短路。如果两个集合的交集是空集,则它们是不相交的,因此如果if not slangNames.isdisjoint(spl)为真,则至少出现一个常用单词。

如果MS Contin实际上是一个单词,您还需要注意:

import re
slangNames = set(["Vikes", "Demmies", "D"])
r = re.compile(r"\bMS Contin\b")
riskNames = set(["enough", "pop", "final", "stress", "trade"])
d =  {1: "Vikes is not enough for me", 2:"Demmies is okay", 3:"pop a D"}

for k,v in d.items():
    spl = v.split()
    if (not slangNames.isdisjoint(spl) or r.search(v)) and not riskNames.isdisjoint(spl):
        print(k,v)