在pandas列中搜索字符串列表

时间:2019-11-25 12:52:30

标签: python pandas

我有一个要搜索的字符串列表。

strings = ['Tea','Baseball','Onus']

我的数据框是

   itemid   desc
0  101      tea leaves
1  201      baseball gloves
3  221      teas leaves from Onus Green Tea Co.

我想得到这样的东西,而不考虑部分匹配

   itemid   desc                                 matches
0  101      tea leaves                           [Tea]
1  201      baseball gloves                      [Baseball]
2  221      teas leaves from Onus Green Tea Co.   [Tea, Onus]

我正在这样做

import re
df['desc'] = df.desc.str.split(' ')
df['desc'].str.findall('|'.join(strings),flags=re.IGNORECASE)

但这给了我一系列空逗号

0     [(, , , , , ), (, , , , , ), (, , , , , )]
1     [(, , , , , ), (, , , , , ), (, , , , , )]
2     [(, , , , , ), (, , , , , ), (, , , , , )]

请帮助我解决此问题。

编辑:我不希望部分匹配。更新了示例以反映这一点。

3 个答案:

答案 0 :(得分:4)

您不需要吐出desc列。

import re
strings = ['Tea','Baseball','Onus']     
df = pd.DataFrame({"desc": ['tea leaves', 'baseball gloves', 'tea leaves from Onus Green Tea Co.']})
df['matches'] = df['desc'].str.findall('|'.join(strings),flags=re.IGNORECASE)
print(df['matches'])

输出:

0               [tea]
1          [baseball]
2    [tea, Onus, Tea]
Name: matches, dtype: object

答案 1 :(得分:1)

尝试将contains与正则表达式交替使用:

strings = ['Tea','Baseball','Onus']
rgx = '\\b(?:' + '|'.join(strings) + ')\\b'
df[df.desc.str.contains(rgx, regex=True, na=False)]

答案 2 :(得分:1)

我们可以将Series.str.findall与正则表达式忽略大小写标志(?i)一起使用,这样我们就不必使用import re

df['Matches'] = df['desc'].str.findall(f'(?i)({"|".join(strings)})')

   itemid                                desc           Matches
0     101                          tea leaves             [tea]
1     201                     baseball gloves        [baseball]
2     221  tea leaves from Onus Green Tea Co.  [tea, Onus, Tea]

要删除重复项,我们将您的字符串转换为大写并制作一个set

df['Matches'] = (
    df['desc'].str.findall(f'(?i)({"|".join(strings)})')
    .apply(lambda x: list(set(map(str.upper, x))))
)
   itemid                                desc      Matches
0     101                          tea leaves        [TEA]
1     201                     baseball gloves   [BASEBALL]
2     221  tea leaves from Onus Green Tea Co.  [TEA, ONUS]

编辑部分匹配内容

为此,我们可以使用单词边界\b

strings = ['\\b' + f + '\\b' for f in strings]

df['Matches'] = df['desc'].str.findall(f'(?i)({"|".join(strings)})')
   itemid                                 desc      Matches
0     101                           tea leaves        [tea]
1     201                      baseball gloves   [baseball]
2     221  teas leaves from Onus Green Tea Co.  [Onus, Tea]