我有一个与此类似的df
,除了物料列数上升到material_19
并且有1000多个客户。
Client_ID Visit_DT material_1 material_2 material_3 material_4
C001 2019-01-01 1 0 1 0
C002 2019-01-05 0 1 0 0
C003 2019-01-10 1 0 1 0
C001 2019-01-15 1 0 0 1
C002 2019-01-20 1 1 1 0
同一客户在不同日期多次使用某项物料(通过1
列出现在同一material
的多行的同一Client_ID
列中),我想将发生重复的那些行中的material
列中的所有值设置为等于0
,但第一行重复的值除外。生成的df应该如下所示:
Client_ID Visit_DT material_1 material_2 material_3 material_4
C001 2019-01-01 1 0 1 0
C002 2019-01-05 0 1 0 0
C003 2019-01-10 1 0 1 0
C001 2019-01-15 0 0 0 1
C002 2019-01-20 1 0 1 0
答案 0 :(得分:1)
material_cols = ['material_1', 'material_2', 'material_3', 'material_4']
mask = df.groupby('Client_ID').cumsum() == 1
df[material_cols] = df[material_cols]*mask
这将导致
df
Out[27]:
Client_ID Visit_DT material_1 material_2 material_3 material_4
0 C001 2019-01-01 1 0 1 0
1 C002 2019-01-05 0 1 0 0
2 C003 2019-01-10 1 0 1 0
3 C001 2019-01-15 1 0 0 1
4 C002 2019-01-20 1 1 1 0
material_cols = ['material_1', 'material_2', 'material_3', 'material_4']
mask = df.groupby('Client_ID').cumsum() == 1
df[material_cols] = df[material_cols] * mask
df
Out[29]:
Client_ID Visit_DT material_1 material_2 material_3 material_4
0 C001 2019-01-01 1 0 1 0
1 C002 2019-01-05 0 1 0 0
2 C003 2019-01-10 1 0 1 0
3 C001 2019-01-15 0 0 0 1
4 C002 2019-01-20 1 0 1 0
请注意,根据您的DataFrame,您也许可以将df[material_cols]
替换为df.iloc[:, 2:]