Question

我的数据框看起来像这样：

    id      k1        k2         same
    1    re_setup    oo_setup   true
    2    oo_setup    oo_setup   true
    3    alerting    bounce     false
    4    bounce      re_oversetup   false
    5    re_oversetup    alerting   false
    6    alerting_s  re_setup   false
    7    re_oversetup    oo_setup   true
    8    alerting    bounce     false

因此，我需要对包含或不包含字符串'setup'的行进行分类。

And simple output would be:
    id      k1        k2         same
    1    re_setup    oo_setup   true
    2    oo_setup    oo_setup   true
    3    alerting    bounce     false
    4    bounce      re_setup   false
    5    re_setup    alerting   false
    6    alerting_s  re_setup   false
    7    re_setup    oo_setup   true
    8    alerting    bounce     false

我已经尝试过这个，但是当我表达时，我在选择多个列时遇到错误。

data['same'] = data[data['k1', 'k2'].str.contains('setup')==True]

Answer 1

我认为apply需要str.contains，因为它只适用于Series（一列）：

print (data[['k1', 'k2']].apply(lambda x: x.str.contains('setup')))
      k1     k2
0   True   True
1   True   True
2  False  False
3  False   True
4   True  False
5  False   True
6   True   True
7  False  False

然后添加DataFrame.all以检查每行是否True

data['same'] = data[['k1', 'k2']].apply(lambda x: x.str.contains('setup')).all(1)
print (data)
   id          k1        k2   same
0   1    re_setup  oo_setup   True
1   2    oo_setup  oo_setup   True
2   3    alerting    bounce  False
3   4      bounce  re_setup  False
4   5    re_setup  alerting  False
5   6  alerting_s  re_setup  False
6   7    re_setup  oo_setup   True
7   8    alerting    bounce  False

或DataFrame.any每行检查至少一个True：

data['same'] = data[['k1', 'k2']].applymap(lambda x: 'setup' in x).any(1)
print (data)
   id          k1        k2   same
0   1    re_setup  oo_setup   True
1   2    oo_setup  oo_setup   True
2   3    alerting    bounce  False
3   4      bounce  re_setup   True
4   5    re_setup  alerting   True
5   6  alerting_s  re_setup   True
6   7    re_setup  oo_setup   True
7   8    alerting    bounce  False

使用applymap进行元素明智检查的另一种解决方案：

data['same'] = data[['k1', 'k2']].applymap(lambda x: 'setup' in x).all(1)
print (data)
   id          k1        k2   same
0   1    re_setup  oo_setup   True
1   2    oo_setup  oo_setup   True
2   3    alerting    bounce  False
3   4      bounce  re_setup  False
4   5    re_setup  alerting  False
5   6  alerting_s  re_setup  False
6   7    re_setup  oo_setup   True
7   8    alerting    bounce  False

如果只有2列简单链条条件&类似all或|类似any：

data['same'] = data['k1'].str.contains('setup') & data['k2'].str.contains('setup')
print (data)
   id          k1        k2   same
0   1    re_setup  oo_setup   True
1   2    oo_setup  oo_setup   True
2   3    alerting    bounce  False
3   4      bounce  re_setup  False
4   5    re_setup  alerting  False
5   6  alerting_s  re_setup  False
6   7    re_setup  oo_setup   True
7   8    alerting    bounce  False

Answer 2

这是另一项通用的减少操作，无需apply

In [114]: np.logical_or.reduce([df[c].str.contains('setup') for c in ['k1', 'k2']])
Out[114]: array([ True,  True, False,  True,  True,  True,  True, False], dtype=bool)

详细

In [115]: df['same'] = np.logical_or.reduce(
                         [df[c].str.contains('setup') for c in ['k1', 'k2']])

In [116]: df
Out[116]:
   id            k1            k2   same
0   1      re_setup      oo_setup   True
1   2      oo_setup      oo_setup   True
2   3      alerting        bounce  False
3   4        bounce  re_oversetup   True
4   5  re_oversetup      alerting   True
5   6    alerting_s      re_setup   True
6   7  re_oversetup      oo_setup   True
7   8      alerting        bounce  False

<强>计时

小

In [111]: df.shape
Out[111]: (8, 4)

In [108]: %timeit np.logical_or.reduce([df[c].str.contains('setup') for c in ['k1', 'k2']])
1000 loops, best of 3: 421 µs per loop

In [109]: %timeit df[['k1', 'k2']].apply(lambda x: x.str.contains('setup')).any(1)
1000 loops, best of 3: 2.01 ms per loop

大

In [110]: df.shape
Out[110]: (40000, 4)

In [112]: %timeit np.logical_or.reduce([df[c].str.contains('setup') for c in ['k1', 'k2']])
10 loops, best of 3: 59.5 ms per loop

In [113]: %timeit df[['k1', 'k2']].apply(lambda x: x.str.contains('setup')).any(1)
10 loops, best of 3: 88.4 ms per loop

如果两列中的一行包含相同的字符串python pandas

2 个答案: