基于行和列条件的新列pandas python

时间:2018-05-18 21:17:45

标签: python pandas

我有一个示例数据集:

import pandas as pd
import numpy as np

d = {

 'ID': ['A','B','C','D','E'],
 'index_1':[2,0,2,-2,0],
 'index_2':[-2,-2,0,0,0],
 'index_3':[2,2,2,2,0],
 'index_4':[2,2,0,-2,0],
 'index_total':[2,2,2,2,2]
}
df = pd.DataFrame(d)

看起来像:

   ID   index_1  index_2  index_3   index_4   index_total
0   A        2       -2        2        2            2
1   B        0       -2        2        2            2
2   C        2        0        2        0            2
3   D       -2        0        2       -2            2
4   E        0        0        0        0            2

我想基于以下条件为每行创建一个名为“flag”的列

  1. 如果任何列'index_1','index_2','index_3','index_4'包含值-2 AND 'index_total'= 2则flag = 1
  2. 如果列'index_1','index_2','index_3','index_4'仅包含值0 AND 'index_total'= 2则flag = 1
  3. else:flag = 0
  4. 期望的输出:

        ID   index_1  index_2  index_3   index_4   index_total   flag
    0    A        2       -2        2        2            2        1
    1    B        0       -2        2        2            2        1
    2    C        2        0        2        0            2        0
    3    D       -2        0        2       -2            2        1
    4    E        0        0        0        0            2        1
    

    我的尝试(注意我为index_1,index_2,index_3和index_4列名使用循环而不是写出来,因为在我的实际数据集中有超过70个index_列)

    第一次尝试:

    for colname in df.columns:
        if "index_" in colname:
            df[colname] = df[colname].astype(int)  
     #making sure the numbers are all integer for comparison
        if ((df[colname] == -2).any() and df['index_total']==2):
             df['flag'] = 1
      #this doesn't work , it's going by columns not rows
    

    第二次尝试:

     for index, row in df.iterrows():    
        for colname in df.columns:
           if "index_" in colname:
               if( (df[colname][index] == -2).any() and df['index_total']==2 ):
                    df['flag'] = 1
     # i stopped writing the other conditions because this one doesn't work
    

3 个答案:

答案 0 :(得分:2)

第一个条件:

df[cols].eq(-2).any(1) & df['index_total'].eq(2)

# (array([0, 1, 3], dtype=int64),)

第二个条件:

df[cols].eq(0).all(1) & df['index_total'].eq(2)

# (array([4], dtype=int64),)

np.where 创建新列:

c1 = df[cols].eq(-2).any(1) & df['index_total'].eq(2)
c2 = df[cols].eq(0).all(1) & df['index_total'].eq(2)

df['Flag'] = np.where(c1 | c2, 1, 0)

  ID  index_1  index_2  index_3  index_4  index_total       Flag
0  A        2       -2        2        2            2          1
1  B        0       -2        2        2            2          1
2  C        2        0        2        0            2          0
3  D       -2        0        2       -2            2          1
4  E        0        0        0        0            2          1

答案 1 :(得分:2)

anyall和布尔屏蔽

(内联评论。)

# sub-select your column of interest
i = df.filter(regex=r'index_\d+')
# this is a common mask, we'll compute it once and use later
j = df['index_total'].eq(2)

m1 = i.eq(-2).any(1) & j   # first condition
m2 = i.eq(0).all(1) & j    # second condition
# compute the union of the masks and convert to int
df['flag'] = (m1 | m2).astype(int)

df
  ID  index_1  index_2  index_3  index_4  index_total  flag
0  A        2       -2        2        2            2     1
1  B        0       -2        2        2            2     1
2  C        2        0        2        0            2     0
3  D       -2        0        2       -2            2     1
4  E        0        0        0        0            2     1

答案 2 :(得分:1)

编写一个接受行并执行逻辑的函数:

因为您说您有很多列,我们将使用std lib中的anyall。这假定index_total是最后一列,ID是第一列

def functo(row):
    if (any([i == -2 for i in row[1:-1]]) and row[-1] == 2):
        return 1
    elif (all(i == 0 for i in row[1:-1]) and row[-1] == 2):
        return 1
    else:
        return 0

并将其应用于您的数据框:

df['flag'] = df.apply(functo, axis=1)

我们使用axis=1将您的函数应用于行而不是列。

另外,提示:我会避免命名列index,因为在pandas术语中,索引引用了一行。