我有一个CSV文件,如下所示:
Timestamp Status
1501 Normal
1501 Normal
1502 Delay
1503 Received
1504 Normal
1504 Delay
1505 Received
1506 Received
1507 Delay
1507 Received
我能够向数据框添加一个新的“ Notif”列,该列显示为计数器变量,并且在遇到“状态”列中的“已接收”值时具有增量。我现在的输出像:
Timestamp Status Notif
1501 Normal N0
1501 Normal N0
1502 Delay N0
1503 Received N1
1504 Normal N1
1504 Delay N1
1505 Received N2
1506 Received N3
1507 Delay N3
1507 Received N4
现在,我想删除列中的所有重复值,并保留第一个。我希望输出为:
Timestamp Status Notif
1501 Normal N0
1501 Normal
1502 Delay
1503 Received N1
1504 Normal
1504 Delay
1505 Received N2
1506 Received N3
1507 Delay
1507 Received N4
对于具有N0,N0,N0,N1,N1,N1,N2,N3,N3,N4的第一个输出,我使用了代码:
df['Notif'] = None
counter = 0
for idx, row in df.iterrows():
if df.iloc[idx, 1] == "Received":
counter +=1
df.iloc[idx,-1] = "N" + str(counter)
要删除我使用的重复值部分:
df.drop_duplicates(subset='Notif', keep="first")
运行删除重复项的代码后,“ Notif”列似乎始终具有一个奇怪的数值400。
答案 0 :(得分:0)
您可能只是将分配作为找到“ Received”字符串的循环的一部分。这样,您就不必删除任何字符串,只需将其添加到正确的行上即可。
df['Notif'] = None
counter = 0
for idx, row in df.iterrows():
if df.iloc[idx, 1] == "Received":
counter +=1
df.iloc[idx,-1] = "N" + str(counter)
答案 1 :(得分:0)
不需要任何循环(与其他答案一样)。 您可以使用单个说明:
df.Notif = df.Notif.mask(df.Notif.duplicated(), '')
df.Notif.duplicated()
生成一个 bool 系列,标记重复的值,
除了第一个(默认值 keep 只是 first )。
然后将其用作 mask 中的条件,该条件设置空字符串 (第二段)以 True 值表示的元素。
我假设您希望在此行中仅空字符串,而不是 NaN 如评论之一中所述。