熊猫特征穿越

时间:2018-12-03 10:48:37

标签: python pandas

我在熊猫DF中有2列:

col_A     col_B
 0         1
 0         0
 0         1
 0         1
 1         0
 1         0
 1         1

我想为类似于get_dummies()的col_A和col_B组合的每个值创建一个新列,但是唯一的变化是在这里我尝试使用列的组合

OP示例-在此列中,Col_A的值为0,而col_B的值为1:

col_A_0_col_B_1

   1
   0
   1
   1
   0
   0
   0

我当前正在使用iterrows()遍历每一行以检查值,然后进行更改

有没有通常的熊猫较短方法来实现这一目标。

5 个答案:

答案 0 :(得分:3)

将链接的布尔掩码转换为整数:

df['col_A_0_col_B_1'] = ((df['col_A']==0)&(df['col_B']==1)).astype(int)

为了获得更好的性能:

df['col_A_0_col_B_1'] = ((df['col_A'].values==0)&(df['col_B'].values==1)).astype(int)

性能:取决于行数和01值:

np.random.seed(343)
#10k rows
df = pd.DataFrame(np.random.choice([0,1], size=(10000, 2)), columns=['col_A','col_B'])
#print (df)

In [92]: %%timeit
    ...: df['col_A_0_col_B_1'] = ((df['col_A']==0)&(df['col_B']==1)).astype(int)
    ...: 
870 µs ± 44.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [93]: %%timeit
    ...: df['col_A_0_col_B_1'] = ((df['col_A'].values==0)&(df['col_B'].values==1)).astype(int)
    ...: 
201 µs ± 3.29 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [94]: %%timeit
    ...: df['col_A_0_col_B_1'] = pd.Series((df.col_A == 0) & (df.col_B == 1), dtype='uint')
    ...: 
833 µs ± 12.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [95]: %%timeit
    ...: df['col_A_0_col_B_1'] = np.where((df['col_A']==0)&(df['col_B']==1), 1, 0)
    ...: 
956 µs ± 242 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [96]: %%timeit
    ...: df['col_A_0_col_B_1'] = pd.Series([a == 0 and b == 1 for a, b in zip(df.col_A, df.col_B)], dtype='uint')
    ...: 
1.61 ms ± 57.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [97]: %%timeit
    ...: df['col_A_0_col_B_1'] = 0
    ...: df.loc[(df.col_A == 0) & (df.col_B==1),'col_A_0_col_B_1'] = 1
    ...: 
3.07 ms ± 68.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

答案 1 :(得分:1)

您可以使用np.where

df['col_A_0_col_B_1'] = np.where((df['col_A']==0)&(df['col_B']==1), 1, 0)

答案 2 :(得分:0)

首先创建您的列,然后分配例如0代表错误

df['col_A_0_col_B_1'] = 0

然后使用loc可以按col_A == 0和col_B == 1进行过滤,然后将1分配给新列 df.loc[(df.col_A == 0) & (df.col_B==1),'col_A_0_col_B_1'] = 1

答案 3 :(得分:0)

如果我理解正确,则可以执行以下操作:

import pandas as pd
data = [[0, 1],
        [0, 0],
        [0, 1],
        [0, 1],
        [1, 0],
        [1, 0],
        [1, 1]]

df = pd.DataFrame(data=data, columns=['col_A', 'col_B'])
df['col_A_0_col_B_1'] = pd.Series([a == 0 and b == 1 for a, b in zip(df.col_A, df.col_B)], dtype='uint')
print(df)

输出

   col_A  col_B  col_A_0_col_B_1
0      0      1                1
1      0      0                0
2      0      1                1
3      0      1                1
4      1      0                0
5      1      0                0
6      1      1                0

或者作为替代:

df = pd.DataFrame(data=data, columns=['col_A', 'col_B'])
df['col_A_0_col_B_1'] = pd.Series((df.col_A == 0) & (df.col_B == 1), dtype='uint')
print(df)

答案 4 :(得分:0)

您可以将pandas ~设置为不使用布尔值,并且将1和0分别设置为true和false。

df['col_A_0_col_B_1'] = ~df['col_A'] & df['col_B']