这必须在其他地方得到解答,但我找不到链接。我有一个df
,其中包含一些任意文本和一个单词列表W
。我想为df
分配一个新列,使其包含匹配的W
中的单词。例如,给定df
T
dog
dog and meerkat
cat
如果W =“dog”,那么我想
T
dog dog
dog and meerkat dog
cat
到目前为止我所拥有的是
df[df.T.str.contains('|'.join(W), case=False)]
但这只给了我匹配的行,即:
T
dog
dog and meerkat
任何想法,指针?
答案 0 :(得分:2)
您可以使用Series.where
- 其中不匹配获取NaN
:
W = 'dog'
df['new'] = df['T'].where(df['T'].str.contains('|'.join(W), case=False))
print (df)
T new
0 dog dog
1 dog and meerkat dog and meerkat
2 cat NaN
W = 'dog'
df.loc[df['T'].str.contains('|'.join(W), case=False), 'new'] = df['T']
print (df)
T new
0 dog dog
1 dog and meerkat dog and meerkat
2 cat NaN
另一种可能的解决方案是numpy.where
如果不匹配则可以增加值:
W = 'dog'
df['new'] = np.where(df['T'].str.contains('|'.join(W), case=False), df['T'], 'nothing')
print (df)
T new
0 dog dog
1 dog and meerkat dog and meerkat
2 cat nothing
但是,如果只需要匹配列表使用extract
的值,而groups
添加第一个和最后一个()
:
W = ['dog', 'rabbit']
df['new'] = df['T'].str.extract('('+'|'.join(W) + ')', expand=True)
print (df)
T new
0 dog dog
1 dog and meerkat dog
2 cat NaN
答案 1 :(得分:2)
在盒子外面思考
包含单词数组的布尔数组点积
df['T'].str.contains('dog')[:, None].dot(pd.Index(['dog']))
df.assign(new=df['T'].str.contains('dog')[:, None].dot(pd.Index(['dog'])))
T new
0 dog dog
1 dog and meerkat dog
2 cat