Question

我尝试根据以下数据框获取新列a和b：

      a_x  b_x    a_y  b_y
0   13.67  0.0  13.67  0.0
1   13.42  0.0  13.42  0.0
2   13.52  1.0  13.17  1.0
3   13.61  1.0  13.11  1.0
4   12.68  1.0  13.06  1.0
5   12.70  1.0  12.93  1.0
6   13.60  1.0    NaN  NaN
7   12.89  1.0    NaN  NaN
8   11.68  1.0    NaN  NaN
9     NaN  NaN   8.87  0.0
10    NaN  NaN   8.77  0.0
11    NaN  NaN   7.97  0.0

如果b_x或b_y为0.0（在这种情况下，如果它们都存在，则它们具有相同的值），则a_x和b_y共享相同的值，因此我将它们中的一个作为新列a和b；如果b_x或b_y是1.0，则它们是不同的值，因此我将a_x和a_y的均值计算为a的值，将b_x和b_y都设为b；

如果a_x, b_x或a_y, b_y不为null，那么我将使用现有值a和b。

我的预期结果将是这样：

      a_x  b_x    a_y  b_y       a  b
0   13.67  0.0  13.67  0.0  13.670  0
1   13.42  0.0  13.42  0.0  13.420  0
2   13.52  1.0  13.17  1.0  13.345  1
3   13.61  1.0  13.11  1.0  13.360  1
4   12.68  1.0  13.06  1.0  12.870  1
5   12.70  1.0  12.93  1.0  12.815  1
6   13.60  1.0    NaN  NaN  13.600  1
7   12.89  1.0    NaN  NaN  12.890  1
8   11.68  1.0    NaN  NaN  11.680  1
9     NaN  NaN   8.87  0.0   8.870  0
10    NaN  NaN   8.77  0.0   8.770  0
11    NaN  NaN   7.97  0.0   7.970  0

如何获得以上结果？谢谢。

Answer 1

使用：

#filter all a and b columns 
b = df.filter(like='b')
a = df.filter(like='a')
#test if at least one 0 or 1 value
m1 = b.eq(0).any(axis=1)
m2 = b.eq(1).any(axis=1)

#get means of a columns
a1 = a.mean(axis=1)
#forward filling mising values and select last column
b1 = b.ffill(axis=1).iloc[:, -1]
a2 = a.ffill(axis=1).iloc[:, -1]

#new Dataframe with 2 conditions
df1 = pd.DataFrame(np.select([m1, m2], [[a2, b1], [a1, b1]]), index=['a','b']).T
#join to original
df = df.join(df1)
print (df)
      a_x  b_x    a_y  b_y       a    b
0   13.67  0.0  13.67  0.0  13.670  0.0
1   13.42  0.0  13.42  0.0  13.420  0.0
2   13.52  1.0  13.17  1.0  13.345  1.0
3   13.61  1.0  13.11  1.0  13.360  1.0
4   12.68  1.0  13.06  1.0  12.870  1.0
5   12.70  1.0  12.93  1.0  12.815  1.0
6   13.60  1.0    NaN  NaN  13.600  1.0
7   12.89  1.0    NaN  NaN  12.890  1.0
8   11.68  1.0    NaN  NaN  11.680  1.0
9     NaN  NaN   8.87  0.0   8.870  0.0
10    NaN  NaN   8.77  0.0   8.770  0.0
11    NaN  NaN   7.97  0.0   7.970  0.0

但是我认为解决方案应该简化，因为均值应同时用于两个条件（因为相同值的均值与第一个值相同）

b = df.filter(like='b')
a = df.filter(like='a')
m1 = b.eq(0).any(axis=1)
m2 = b.eq(1).any(axis=1)

a1 = a.mean(axis=1)
b1 = b.ffill(axis=1).iloc[:, -1]


df['a'] = a1
df['b'] = b1
print (df)
      a_x  b_x    a_y  b_y       a    b
0   13.67  0.0  13.67  0.0  13.670  0.0
1   13.42  0.0  13.42  0.0  13.420  0.0
2   13.52  1.0  13.17  1.0  13.345  1.0
3   13.61  1.0  13.11  1.0  13.360  1.0
4   12.68  1.0  13.06  1.0  12.870  1.0
5   12.70  1.0  12.93  1.0  12.815  1.0
6   13.60  1.0    NaN  NaN  13.600  1.0
7   12.89  1.0    NaN  NaN  12.890  1.0
8   11.68  1.0    NaN  NaN  11.680  1.0
9     NaN  NaN   8.87  0.0   8.870  0.0
10    NaN  NaN   8.77  0.0   8.770  0.0
11    NaN  NaN   7.97  0.0   7.970  0.0

在熊猫中基于多个条件创建多个新列

1 个答案: