我有一个要搜索的字符串列表。
strings = ['Tea','Baseball','Onus']
我的数据框是
itemid desc
0 101 tea leaves
1 201 baseball gloves
3 221 teas leaves from Onus Green Tea Co.
我想得到这样的东西,而不考虑部分匹配
itemid desc matches
0 101 tea leaves [Tea]
1 201 baseball gloves [Baseball]
2 221 teas leaves from Onus Green Tea Co. [Tea, Onus]
我正在这样做
import re
df['desc'] = df.desc.str.split(' ')
df['desc'].str.findall('|'.join(strings),flags=re.IGNORECASE)
但这给了我一系列空逗号
0 [(, , , , , ), (, , , , , ), (, , , , , )]
1 [(, , , , , ), (, , , , , ), (, , , , , )]
2 [(, , , , , ), (, , , , , ), (, , , , , )]
请帮助我解决此问题。
编辑:我不希望部分匹配。更新了示例以反映这一点。
答案 0 :(得分:4)
您不需要吐出desc
列。
import re
strings = ['Tea','Baseball','Onus']
df = pd.DataFrame({"desc": ['tea leaves', 'baseball gloves', 'tea leaves from Onus Green Tea Co.']})
df['matches'] = df['desc'].str.findall('|'.join(strings),flags=re.IGNORECASE)
print(df['matches'])
输出:
0 [tea]
1 [baseball]
2 [tea, Onus, Tea]
Name: matches, dtype: object
答案 1 :(得分:1)
尝试将contains
与正则表达式交替使用:
strings = ['Tea','Baseball','Onus']
rgx = '\\b(?:' + '|'.join(strings) + ')\\b'
df[df.desc.str.contains(rgx, regex=True, na=False)]
答案 2 :(得分:1)
我们可以将Series.str.findall
与正则表达式忽略大小写标志(?i
)一起使用,这样我们就不必使用import re
df['Matches'] = df['desc'].str.findall(f'(?i)({"|".join(strings)})')
itemid desc Matches
0 101 tea leaves [tea]
1 201 baseball gloves [baseball]
2 221 tea leaves from Onus Green Tea Co. [tea, Onus, Tea]
要删除重复项,我们将您的字符串转换为大写并制作一个set
:
df['Matches'] = (
df['desc'].str.findall(f'(?i)({"|".join(strings)})')
.apply(lambda x: list(set(map(str.upper, x))))
)
itemid desc Matches
0 101 tea leaves [TEA]
1 201 baseball gloves [BASEBALL]
2 221 tea leaves from Onus Green Tea Co. [TEA, ONUS]
为此,我们可以使用单词边界\b
:
strings = ['\\b' + f + '\\b' for f in strings]
df['Matches'] = df['desc'].str.findall(f'(?i)({"|".join(strings)})')
itemid desc Matches
0 101 tea leaves [tea]
1 201 baseball gloves [baseball]
2 221 teas leaves from Onus Green Tea Co. [Onus, Tea]