请告诉我如何在Python中获取其HashCodes出现多次的ImgFileNames。 注意:要仅保留第一个匹配项并删除剩余项,即使该值出现在介于两者之间,也可能是最后一个或任何位置。
我有一个如下数据框:
ImgFileName HashCodes
Img_0001 - Copy.tif 162a47470f021a60
Img_0001.tif 162a47470f021a60
Img_0002.tif 1b5b5b1aa638dac8
Img_0003.tif adadadadadadadad
Img_0004.tif adadadadadadadad
Img_0005 - Copy.tif a5b8648c8c666670
Img_0005.tif a5b8648c8c666670
Img_0006.tif 71b392da6a699392
Img_0007.tif 71b392da6a699392
Img_0008.tif b1b1f2fa6bf97292
Img_0009.tif 86e82ae4c8b6c9c9
Img_0010 - Copy.tif 86e8aae4c8b6c9c9
Img_0010.tif 86e8aae4c8b6c9c9
我想要输出如下:
ImgFileName HashCodes
Img_0001 - Copy.tif 162a47470f021a60
Img_0003.tif adadadadadadadad
Img_0005 - Copy.tif a5b8648c8c666670
Img_0006.tif 71b392da6a699392
Img_0009.tif 86e82ae4c8b6c9c9
答案 0 :(得分:1)
boolean indexing
需要duplicated
- 首先过滤掉所有欺骗,然后过滤掉dupe的最后一个值或欺骗的第一个值(keep='last'
):
df =df[ df.duplicated('HashCodes', keep=False) & df.duplicated('HashCodes')]
print (df)
ImgFileName HashCodes
1 Img_0001.tif 162a47470f021a60
4 Img_0004.tif adadadadadadadad
6 Img_0005.tif a5b8648c8c666670
8 Img_0007.tif 71b392da6a699392
12 Img_0010.tif 86e8aae4c8b6c9c9
或者:
df =df[ df.duplicated('HashCodes', keep=False) & df.duplicated('HashCodes', keep='last')]
print (df)
ImgFileName HashCodes
0 Img_0001 -Copy.tif 162a47470f021a60
3 Img_0003.tif adadadadadadadad
5 Img_0005 -Copy.tif a5b8648c8c666670
7 Img_0006.tif 71b392da6a699392
11 Img_0010 -Copy.tif 86e8aae4c8b6c9c9