从DataFrame中删除重复项的矢量化方法

时间:2017-07-08 10:04:12

标签: python python-3.x pandas

我用"重复"表示文件中的重复值。使用以下(工作)代码:

frame=pd.read_excel(io=r"D:XXXX\test.xlsx")
df=pd.DataFrame(frame)


dup=[df.duplicated(subset=(i),keep=False) for i in [("id","Type"),("id","Time"),("Time","Type")]]
duplicate="duplicate"

for i in range(len(dup)):
    for j in range(len(dup[i])):
        if dup[i][j]==True:
            df.loc[j,"Attribute"]=duplicate

DataFrame的形状类似于:

id  Type    Time
12  ab  12:00:00
11  cd  11:12:22
663 dfd 10:00:00

但如果文件中有很多行,这个approch会变得单调乏味。 我正在寻找一种通过组合lambda替换循环或用LC实现的方法

感谢您的回复

1 个答案:

答案 0 :(得分:1)

我相信他正是您正在寻找的pandas.duplicatedpandas.apply

for i in [("id","Type"),("id","Time"),("Time","Type")]:
    df['Attribute'] = df.duplicated(subset=i,keep=False).apply(lambda x: "duplicate" if x else "not duplicate")

numpy.where的解决方案:

for i in [("id","Type"),("id","Time"),("Time","Type")]:
    df['Attribute'] = np.where(df.duplicated(subset=i,keep=False),"duplicate","not duplicate")

我使用此数据框作为输入:

    id   Type      Time
0   12     ab  12:00:00
1   12  abacd  11:12:22
2  663    dfd  10:00:00
3   11     ab  12:00:00
4  663    dfd  10:00:00
5   11   caad  11:12:22

这是输出:

    id   Type      Time      Attribute
0   12     ab  12:00:00      duplicate
1   12  abacd  11:12:22  not duplicate
2  663    dfd  10:00:00      duplicate
3   11     ab  12:00:00      duplicate
4  663    dfd  10:00:00      duplicate
5   11   caad  11:12:22  not duplicate

希望这有用。