我的数据框看起来像这样:
id k1 k2 same
1 re_setup oo_setup true
2 oo_setup oo_setup true
3 alerting bounce false
4 bounce re_oversetup false
5 re_oversetup alerting false
6 alerting_s re_setup false
7 re_oversetup oo_setup true
8 alerting bounce false
因此,我需要对包含或不包含字符串'setup'的行进行分类。
And simple output would be:
id k1 k2 same
1 re_setup oo_setup true
2 oo_setup oo_setup true
3 alerting bounce false
4 bounce re_setup false
5 re_setup alerting false
6 alerting_s re_setup false
7 re_setup oo_setup true
8 alerting bounce false
我已经尝试过这个,但是当我表达时,我在选择多个列时遇到错误。
data['same'] = data[data['k1', 'k2'].str.contains('setup')==True]
答案 0 :(得分:2)
我认为apply
需要str.contains
,因为它只适用于Series
(一列):
print (data[['k1', 'k2']].apply(lambda x: x.str.contains('setup')))
k1 k2
0 True True
1 True True
2 False False
3 False True
4 True False
5 False True
6 True True
7 False False
然后添加DataFrame.all
以检查每行是否True
data['same'] = data[['k1', 'k2']].apply(lambda x: x.str.contains('setup')).all(1)
print (data)
id k1 k2 same
0 1 re_setup oo_setup True
1 2 oo_setup oo_setup True
2 3 alerting bounce False
3 4 bounce re_setup False
4 5 re_setup alerting False
5 6 alerting_s re_setup False
6 7 re_setup oo_setup True
7 8 alerting bounce False
或DataFrame.any
每行检查至少一个True
:
data['same'] = data[['k1', 'k2']].applymap(lambda x: 'setup' in x).any(1)
print (data)
id k1 k2 same
0 1 re_setup oo_setup True
1 2 oo_setup oo_setup True
2 3 alerting bounce False
3 4 bounce re_setup True
4 5 re_setup alerting True
5 6 alerting_s re_setup True
6 7 re_setup oo_setup True
7 8 alerting bounce False
使用applymap
进行元素明智检查的另一种解决方案:
data['same'] = data[['k1', 'k2']].applymap(lambda x: 'setup' in x).all(1)
print (data)
id k1 k2 same
0 1 re_setup oo_setup True
1 2 oo_setup oo_setup True
2 3 alerting bounce False
3 4 bounce re_setup False
4 5 re_setup alerting False
5 6 alerting_s re_setup False
6 7 re_setup oo_setup True
7 8 alerting bounce False
如果只有2列简单链条条件&
类似all
或|
类似any
:
data['same'] = data['k1'].str.contains('setup') & data['k2'].str.contains('setup')
print (data)
id k1 k2 same
0 1 re_setup oo_setup True
1 2 oo_setup oo_setup True
2 3 alerting bounce False
3 4 bounce re_setup False
4 5 re_setup alerting False
5 6 alerting_s re_setup False
6 7 re_setup oo_setup True
7 8 alerting bounce False
答案 1 :(得分:1)
这是另一项通用的减少操作,无需apply
In [114]: np.logical_or.reduce([df[c].str.contains('setup') for c in ['k1', 'k2']])
Out[114]: array([ True, True, False, True, True, True, True, False], dtype=bool)
详细
In [115]: df['same'] = np.logical_or.reduce(
[df[c].str.contains('setup') for c in ['k1', 'k2']])
In [116]: df
Out[116]:
id k1 k2 same
0 1 re_setup oo_setup True
1 2 oo_setup oo_setup True
2 3 alerting bounce False
3 4 bounce re_oversetup True
4 5 re_oversetup alerting True
5 6 alerting_s re_setup True
6 7 re_oversetup oo_setup True
7 8 alerting bounce False
<强>计时强>
小
In [111]: df.shape
Out[111]: (8, 4)
In [108]: %timeit np.logical_or.reduce([df[c].str.contains('setup') for c in ['k1', 'k2']])
1000 loops, best of 3: 421 µs per loop
In [109]: %timeit df[['k1', 'k2']].apply(lambda x: x.str.contains('setup')).any(1)
1000 loops, best of 3: 2.01 ms per loop
大
In [110]: df.shape
Out[110]: (40000, 4)
In [112]: %timeit np.logical_or.reduce([df[c].str.contains('setup') for c in ['k1', 'k2']])
10 loops, best of 3: 59.5 ms per loop
In [113]: %timeit df[['k1', 'k2']].apply(lambda x: x.str.contains('setup')).any(1)
10 loops, best of 3: 88.4 ms per loop