如何设置一列中的所有值= 0,其中所选列中的值是重复的,同时保持第一个重复的值不变

时间:2019-05-01 09:42:37

标签: python pandas duplicates

我有一个与此类似的df,除了物料列数上升到material_19并且有1000多个客户。

Client_ID  Visit_DT   material_1  material_2  material_3  material_4
C001       2019-01-01 1           0           1           0
C002       2019-01-05 0           1           0           0
C003       2019-01-10 1           0           1           0
C001       2019-01-15 1           0           0           1
C002       2019-01-20 1           1           1           0

同一客户在不同日期多次使用某项物料(通过1列出现在同一material的多行的同一Client_ID列中),我想将发生重复的那些行中的material列中的所有值设置为等于0,但第一行重复的值除外。生成的df应该如下所示:

Client_ID  Visit_DT   material_1  material_2  material_3  material_4
C001       2019-01-01 1           0           1           0
C002       2019-01-05 0           1           0           0
C003       2019-01-10 1           0           1           0
C001       2019-01-15 0           0           0           1
C002       2019-01-20 1           0           1           0

1 个答案:

答案 0 :(得分:1)

material_cols = ['material_1', 'material_2', 'material_3', 'material_4']
mask = df.groupby('Client_ID').cumsum() == 1
df[material_cols] = df[material_cols]*mask

这将导致

df
Out[27]: 
  Client_ID    Visit_DT  material_1  material_2  material_3  material_4
0      C001  2019-01-01           1           0           1           0
1      C002  2019-01-05           0           1           0           0
2      C003  2019-01-10           1           0           1           0
3      C001  2019-01-15           1           0           0           1
4      C002  2019-01-20           1           1           1           0
material_cols = ['material_1', 'material_2', 'material_3', 'material_4']
mask = df.groupby('Client_ID').cumsum() == 1
df[material_cols] = df[material_cols] * mask
df
Out[29]: 
  Client_ID    Visit_DT  material_1  material_2  material_3  material_4
0      C001  2019-01-01           1           0           1           0
1      C002  2019-01-05           0           1           0           0
2      C003  2019-01-10           1           0           1           0
3      C001  2019-01-15           0           0           0           1
4      C002  2019-01-20           1           0           1           0

请注意,根据您的DataFrame,您也许可以将df[material_cols]替换为df.iloc[:, 2:]