我在熊猫DF中有2列:
col_A col_B
0 1
0 0
0 1
0 1
1 0
1 0
1 1
我想为类似于get_dummies()的col_A和col_B组合的每个值创建一个新列,但是唯一的变化是在这里我尝试使用列的组合
OP示例-在此列中,Col_A的值为0,而col_B的值为1:
col_A_0_col_B_1
1
0
1
1
0
0
0
我当前正在使用iterrows()遍历每一行以检查值,然后进行更改
有没有通常的熊猫较短方法来实现这一目标。
答案 0 :(得分:3)
将链接的布尔掩码转换为整数:
df['col_A_0_col_B_1'] = ((df['col_A']==0)&(df['col_B']==1)).astype(int)
为了获得更好的性能:
df['col_A_0_col_B_1'] = ((df['col_A'].values==0)&(df['col_B'].values==1)).astype(int)
性能:取决于行数和0
,1
值:
np.random.seed(343)
#10k rows
df = pd.DataFrame(np.random.choice([0,1], size=(10000, 2)), columns=['col_A','col_B'])
#print (df)
In [92]: %%timeit
...: df['col_A_0_col_B_1'] = ((df['col_A']==0)&(df['col_B']==1)).astype(int)
...:
870 µs ± 44.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [93]: %%timeit
...: df['col_A_0_col_B_1'] = ((df['col_A'].values==0)&(df['col_B'].values==1)).astype(int)
...:
201 µs ± 3.29 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [94]: %%timeit
...: df['col_A_0_col_B_1'] = pd.Series((df.col_A == 0) & (df.col_B == 1), dtype='uint')
...:
833 µs ± 12.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [95]: %%timeit
...: df['col_A_0_col_B_1'] = np.where((df['col_A']==0)&(df['col_B']==1), 1, 0)
...:
956 µs ± 242 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [96]: %%timeit
...: df['col_A_0_col_B_1'] = pd.Series([a == 0 and b == 1 for a, b in zip(df.col_A, df.col_B)], dtype='uint')
...:
1.61 ms ± 57.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [97]: %%timeit
...: df['col_A_0_col_B_1'] = 0
...: df.loc[(df.col_A == 0) & (df.col_B==1),'col_A_0_col_B_1'] = 1
...:
3.07 ms ± 68.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
答案 1 :(得分:1)
您可以使用np.where
df['col_A_0_col_B_1'] = np.where((df['col_A']==0)&(df['col_B']==1), 1, 0)
答案 2 :(得分:0)
首先创建您的列,然后分配例如0代表错误
df['col_A_0_col_B_1'] = 0
然后使用loc可以按col_A == 0和col_B == 1进行过滤,然后将1分配给新列
df.loc[(df.col_A == 0) & (df.col_B==1),'col_A_0_col_B_1'] = 1
答案 3 :(得分:0)
如果我理解正确,则可以执行以下操作:
import pandas as pd
data = [[0, 1],
[0, 0],
[0, 1],
[0, 1],
[1, 0],
[1, 0],
[1, 1]]
df = pd.DataFrame(data=data, columns=['col_A', 'col_B'])
df['col_A_0_col_B_1'] = pd.Series([a == 0 and b == 1 for a, b in zip(df.col_A, df.col_B)], dtype='uint')
print(df)
输出
col_A col_B col_A_0_col_B_1
0 0 1 1
1 0 0 0
2 0 1 1
3 0 1 1
4 1 0 0
5 1 0 0
6 1 1 0
或者作为替代:
df = pd.DataFrame(data=data, columns=['col_A', 'col_B'])
df['col_A_0_col_B_1'] = pd.Series((df.col_A == 0) & (df.col_B == 1), dtype='uint')
print(df)
答案 4 :(得分:0)
您可以将pandas ~
设置为不使用布尔值,并且将1和0分别设置为true和false。
df['col_A_0_col_B_1'] = ~df['col_A'] & df['col_B']