我有一个包含多个物质列的数据框,如下所示:
import random
values = ['oil', 'gas', 'water']
modifier = ['dirty', 'crude', 'fuel']
wordzip = [modifier, values]
data = [[wordzip[0][random.randint(0,2)] + ' ' + wordzip[1][random.randint(0,2)] for c in wordzip[0]] for i in range(7)]
pd.DataFrame(data = data, columns = ['A', 'B', 'C'])
A B C
0 dirty gas crude oil dirty water
1 dirty water fuel gas dirty gas
2 dirty water fuel gas dirty oil
3 fuel oil crude water crude gas
4 dirty water fuel oil dirty water
5 crude oil fuel water dirty oil
6 fuel water crude gas crude water
我想创建一个新列,并在其中包含“石油”一词的那些列中的值。因此,最终的df应该如下所示:
A B C D
0 dirty gas crude oil dirty water crude oil
1 dirty water fuel gas dirty gas NaN
2 dirty water fuel gas dirty oil dirty oil
3 fuel oil crude water crude gas fuel oil
4 dirty water fuel oil dirty water fuel oil
5 crude oil fuel water dirty oil crude oil
6 fuel water crude gas crude water NaN
我尝试了df[['A', 'B', 'C']].apply(lambda x: x.str.contains('oil'))
,但是返回的是布尔型数据框,而不是值本身。
答案 0 :(得分:3)
让我们使用stack
+ extract
:
df['D'] = df.stack().str.extract(r'(.* oil)').groupby(level=0).first()[0]
df
A B C D
0 dirty gas crude oil dirty water crude oil
1 dirty water fuel gas dirty gas NaN
2 dirty water fuel gas dirty oil dirty oil
3 fuel oil crude water crude gas fuel oil
4 dirty water fuel oil dirty water fuel oil
5 crude oil fuel water dirty oil crude oil
6 fuel water crude gas crude water NaN
答案 1 :(得分:1)
类似这样的东西:
import pandas as pd
import random
values = ['oil', 'gas', 'water']
modifier = ['dirty', 'crude', 'fuel']
wordzip = [modifier, values]
data = [[wordzip[0][random.randint(0,2)] + ' ' + wordzip[1][random.randint(0,2)] for c in wordzip[0]] for i in range(7)]
df=pd.DataFrame(data = data, columns = ['A', 'B', 'C'])
temp=df[df[['A', 'B', 'C']].apply(lambda x: x.str.contains('oil'))]
df['D'] = temp.A.combine_first(temp.B).combine_first(temp.C)
答案 2 :(得分:1)
将applymap
与bfill
一起使用
df[df.applymap(lambda x : 'oil' in x)].bfill(1).loc[:,'A']
Out[80]:
0 NaN
1 NaN
2 fuel oil
3 crude oil
4 crude oil
5 fuel oil
6 NaN
Name: A, dtype: object
答案 3 :(得分:0)
此答案将水平方向的字符串求和,然后使用正则表达式提取以获取所需的输出:
# insert temporary columns containing spaces for this regex implementation to work
df.insert(1,'a',' ')
df.insert(3,'b',' ')
# this regex contains a capture group which will get 'oil' instances and the preceding word
df['D'] = df.sum(axis=1).str.extract('([a-z]+ oil)')
# remove the temporary columns
df.drop(['a', 'b'], axis=1, inplace=True)