从数据框列的字符串列表中返回所有子字符串

时间:2019-09-15 23:14:53

标签: python pandas numpy lambda

我需要搜索df列并返回列表中的所有子字符串。

myList= ['a cat', 'the dog', 'a cow']

example df
'col A'
there was a cat with the dog
the cow was brown
the dog was sick

这会拆分列表中的单词,仅返回单个单词

df['col B'] = df['col A'].apply(lambda x: ';'.join([word for word in x.split() if word in (myList)]))

还尝试在np中添加任何内容...

df['col B'] = df['col A'].apply(lambda x: ';'.join(np.any(word for word in df['col A'] if word in (myList))))

需要返回

'col B'
a cat;the dog
NaN
the dog

2 个答案:

答案 0 :(得分:1)

您可以

s = df.col.str.extractall(f'({"|".join(myList)})')
res = s.groupby(s.index.get_level_values(0))[0].agg(';'.join)
df.loc[res.index, 'new'] = res

                            col            new
0  there was a cat with the dog  a cat;the dog
1             the cow was brown            NaN
2              the dog was sick        the dog

答案 1 :(得分:0)

这应该有效,您很亲密:

import numpy as np

df['col B'] = df['col A'].apply(lambda x: ';'.join([m for m in myList if m in x])).replace('',np.nan)

结果:

                          col A          col B
0  there was a cat with the dog  a cat;the dog
1             the cow was brown            NaN
2              the dog was sick        the dog