我有一个csv文件。它看起来像这样;
name,id,
AAA,1111,
BBB,2222,
CCC,3333,
DDD,2222,
我想知道列id
中是否有重复项。如果是,请找出副本。在这种情况下,答案是2222
。
我有代码来确定是否存在重复。这是;
import pandas as pd
csv_file = 'C:/test.csv'
df = pd.read_csv(csv_file)
df['id'].duplicated().any()
问题是如何找出副本?
我正在使用python 2.7和panda。
答案 0 :(得分:1)
我认为您可以使用duplicated
(省略keep
,因为keep='first'
是默认值)。或者,如果您需要值tolist
:
print df['id'][df.duplicated(subset=['id'])]
3 2222
Name: id, dtype: int64
print df['id'][df.duplicated(subset=['id'])].tolist()
[2222]
您可以查看duplicated
:
print df.duplicated(subset=['id'], keep='first')
0 False
1 False
2 False
3 True
dtype: bool
print df.duplicated(subset=['id'], keep='last')
0 False
1 True
2 False
3 False
dtype: bool
print df.duplicated(subset=['id'], keep=False)
0 False
1 True
2 False
3 True
dtype: bool
如果您需要重复的行,请使用subset:
print df[df.duplicated(subset=['id'], keep='first')]
name id
3 DDD 2222
print df[df.duplicated(subset=['id'], keep='last')]
name id
1 BBB 2222
print df[df.duplicated(subset=['id'], keep=False)]
name id
1 BBB 2222
3 DDD 2222
使用drop_duplicates
删除:
print df.drop_duplicates(subset=['id'], keep='first')
name id
0 AAA 1111
1 BBB 2222
2 CCC 3333
print df.drop_duplicates(subset=['id'], keep='last')
name id
0 AAA 1111
2 CCC 3333
3 DDD 2222
print df.drop_duplicates(subset=['id'], keep=False)
name id
0 AAA 1111
2 CCC 3333