我用"重复"表示文件中的重复值。使用以下(工作)代码:
frame=pd.read_excel(io=r"D:XXXX\test.xlsx")
df=pd.DataFrame(frame)
dup=[df.duplicated(subset=(i),keep=False) for i in [("id","Type"),("id","Time"),("Time","Type")]]
duplicate="duplicate"
for i in range(len(dup)):
for j in range(len(dup[i])):
if dup[i][j]==True:
df.loc[j,"Attribute"]=duplicate
DataFrame的形状类似于:
id Type Time
12 ab 12:00:00
11 cd 11:12:22
663 dfd 10:00:00
但如果文件中有很多行,这个approch会变得单调乏味。 我正在寻找一种通过组合lambda替换循环或用LC实现的方法
感谢您的回复
答案 0 :(得分:1)
我相信他正是您正在寻找的pandas.duplicated和pandas.apply:
for i in [("id","Type"),("id","Time"),("Time","Type")]:
df['Attribute'] = df.duplicated(subset=i,keep=False).apply(lambda x: "duplicate" if x else "not duplicate")
numpy.where的解决方案:
for i in [("id","Type"),("id","Time"),("Time","Type")]:
df['Attribute'] = np.where(df.duplicated(subset=i,keep=False),"duplicate","not duplicate")
我使用此数据框作为输入:
id Type Time
0 12 ab 12:00:00
1 12 abacd 11:12:22
2 663 dfd 10:00:00
3 11 ab 12:00:00
4 663 dfd 10:00:00
5 11 caad 11:12:22
这是输出:
id Type Time Attribute
0 12 ab 12:00:00 duplicate
1 12 abacd 11:12:22 not duplicate
2 663 dfd 10:00:00 duplicate
3 11 ab 12:00:00 duplicate
4 663 dfd 10:00:00 duplicate
5 11 caad 11:12:22 not duplicate
希望这有用。