我有一套单词
words = {'感谢给予',' cat','而不是'等等...}
我需要在表格列中详细搜索这些字词' description'
--------------------------------|
ID | Description |
--- |---------------------------|
1 | having fun thanks giving|
----|---------------------------|
2 | cat eats all the food |
----|---------------------------|
3 | instead you can come |
--------------------------------
def matched_words(x,words):
match_words =[]
for word in words:
if word in x:
match_words.append(word)
return match_words
df['new_col'] = df['description'].apply(lambda x:matched_words(x,words))
期望的输出:
----|---------------------------|-------------------|
ID | Description |matched words |
--- |---------------------------|-------------------|
1 | having fun thanks giving|['thanks giving'] |
----|---------------------------|------------------ |
2 | cat eats all the food |['cat'] |
----|---------------------------|-------------------|
3 | instead you can come | [] |
----------------------------------------------------
我只获得匹配的单个令牌,例如[' cat']
答案 0 :(得分:1)
以下代码应该为您提供所需的结果:
import re
words = {'thanks', 'cat', 'instead of'}
phrases = [
[1,"having fun at thanksgiving"],
[2,"cater the food"],
[3, "instead you can come"],
[4, "instead of pizza"],
[5, "thanks for all the fish"]
]
matched_words = []
matched_pairs = []
for word in words:
for phrase in phrases:
result = re.search(r'\b'+word+'\W', phrase[1])
if result:
matched_words.append(result.group(0))
matched_pairs.append([result.group(0), phrase])
print()
print(matched_words)
print(matched_pairs)
相关部分,即regex
位re.search(r'\b'+word+'\W', phrase[1])
,正在搜索从字边界\b
开始搜索字符串的情况,或{{1} },并以非单词字符empty string
结尾。这应该确保我们只找到整个字符串匹配。无需对要搜索的文本执行任何其他操作。
当然,您可以使用您想要的任何内容,而不是\W
,words
,phrases
和matched_words
。
希望这有帮助!
答案 1 :(得分:0)
import re
words = {'thanks', 'cat', 'instead of'}
samples = [
(1, 'having fun at thanksgiving'),
(2, 'cater the food'),
(3, 'instead you can come'),
(4, 'instead of you can come'),
]
for id, description in samples:
for word in words:
if re.search(r'\b' + word + r'\b', description):
print("'%s' in '%s" % (word, description))