如何在python上执行精确的字符串匹配

时间:2016-12-30 02:37:02

标签: python string compare match

我有一套单词

words = {'感谢给予',' cat','而不是'等等...}

我需要在表格列中详细搜索这些字词' description'

--------------------------------|
ID  | Description               |
--- |---------------------------|
1   | having fun   thanks giving| 
----|---------------------------|
2   |  cat eats all the food    |
----|---------------------------|
3   |  instead you can come     | 
--------------------------------

def matched_words(x,words):
   match_words =[]
  for word in words:
     if word in x:
       match_words.append(word)
  return match_words

df['new_col'] = df['description'].apply(lambda x:matched_words(x,words))

期望的输出:

----|---------------------------|-------------------|
ID  | Description               |matched words      |
--- |---------------------------|-------------------|
1   | having fun   thanks giving|['thanks giving']  |
----|---------------------------|------------------ |
2   |  cat eats all the food    |['cat']            |
----|---------------------------|-------------------|
3   |  instead you can come     | []                |
----------------------------------------------------

我只获得匹配的单个令牌,例如[' cat']

2 个答案:

答案 0 :(得分:1)

以下代码应该为您提供所需的结果:

import re

words = {'thanks', 'cat', 'instead of'}
phrases = [
    [1,"having fun at thanksgiving"],
    [2,"cater the food"],
    [3, "instead you can come"],
    [4, "instead of pizza"],
    [5, "thanks for all the fish"]
]

matched_words = []
matched_pairs = []
for word in words:
    for phrase in phrases:
        result = re.search(r'\b'+word+'\W', phrase[1])
        if result:
            matched_words.append(result.group(0))
            matched_pairs.append([result.group(0), phrase])
            print()

print(matched_words)
print(matched_pairs)

相关部分,即regexre.search(r'\b'+word+'\W', phrase[1]),正在搜索从字边界\b开始搜索字符串的情况,或{{1} },并以非单词字符empty string结尾。这应该确保我们只找到整个字符串匹配。无需对要搜索的文本执行任何其他操作。

当然,您可以使用您想要的任何内容,而不是\Wwordsphrasesmatched_words

希望这有帮助!

答案 1 :(得分:0)

import re
words = {'thanks', 'cat', 'instead of'}

samples = [
    (1, 'having fun at thanksgiving'),
    (2, 'cater the food'),
    (3, 'instead you can come'),
    (4, 'instead of you can come'),
]

for id, description in samples:
    for word in words:
        if re.search(r'\b' + word + r'\b', description):
            print("'%s' in '%s" % (word, description))