如何将`str.contains`的输出分配给Pandas列?

时间:2017-01-17 21:08:40

标签: python pandas

这必须在其他地方得到解答,但我找不到链接。我有一个df,其中包含一些任意文本和一个单词列表W。我想为df分配一个新列,使其包含匹配的W中的单词。例如,给定df

   T
   dog
   dog and meerkat
   cat

如果W =“dog”,那么我想

   T
   dog                dog
   dog and meerkat    dog
   cat

到目前为止我所拥有的是

df[df.T.str.contains('|'.join(W), case=False)]

但这只给了我匹配的行,即:

   T
   dog
   dog and meerkat

任何想法,指针?

2 个答案:

答案 0 :(得分:2)

您可以使用Series.where - 其中不匹配获取NaN

W = 'dog'
df['new'] = df['T'].where(df['T'].str.contains('|'.join(W), case=False))
print (df)
                 T              new
0              dog              dog
1  dog and meerkat  dog and meerkat
2              cat              NaN

DataFrame.loc

W = 'dog'
df.loc[df['T'].str.contains('|'.join(W), case=False), 'new'] = df['T']
print (df)
                 T              new
0              dog              dog
1  dog and meerkat  dog and meerkat
2              cat              NaN

另一种可能的解决方案是numpy.where如果不匹配则可以增加值:

W = 'dog'
df['new'] = np.where(df['T'].str.contains('|'.join(W), case=False), df['T'], 'nothing')
print (df)
                 T              new
0              dog              dog
1  dog and meerkat  dog and meerkat
2              cat          nothing

但是,如果只需要匹配列表使用extract的值,而groups添加第一个和最后一个()

W = ['dog', 'rabbit']
df['new'] = df['T'].str.extract('('+'|'.join(W) + ')', expand=True)
print (df)
                 T  new
0              dog  dog
1  dog and meerkat  dog
2              cat  NaN

Extracting in docs

答案 1 :(得分:2)

在盒子外面思考

包含单词数组的布尔数组点积

df['T'].str.contains('dog')[:, None].dot(pd.Index(['dog']))
df.assign(new=df['T'].str.contains('dog')[:, None].dot(pd.Index(['dog'])))

                    T  new
0                 dog  dog
1     dog and meerkat  dog
2                 cat