找出python panda数据结构中的副本

时间:2016-02-13 06:45:16

标签: python python-2.7 pandas

我有一个csv文件。它看起来像这样;

name,id,
AAA,1111,
BBB,2222,
CCC,3333,
DDD,2222,

我想知道列id中是否有重复项。如果是,请找出副本。在这种情况下,答案是2222

我有代码来确定是否存在重复。这是;

import pandas as pd
csv_file = 'C:/test.csv'
df = pd.read_csv(csv_file)
df['id'].duplicated().any()

问题是如何找出副本?

我正在使用python 2.7和panda。

1 个答案:

答案 0 :(得分:1)

我认为您可以使用duplicated(省略keep,因为keep='first'是默认值)。或者,如果您需要值tolist

print df['id'][df.duplicated(subset=['id'])]
3    2222
Name: id, dtype: int64

print df['id'][df.duplicated(subset=['id'])].tolist()
[2222]

您可以查看duplicated

print df.duplicated(subset=['id'], keep='first')
0    False
1    False
2    False
3     True
dtype: bool

print df.duplicated(subset=['id'], keep='last')
0    False
1     True
2    False
3    False
dtype: bool

print df.duplicated(subset=['id'], keep=False)
0    False
1     True
2    False
3     True
dtype: bool

如果您需要重复的行,请使用subset:

print df[df.duplicated(subset=['id'], keep='first')]
  name    id
3  DDD  2222

print df[df.duplicated(subset=['id'], keep='last')]
  name    id
1  BBB  2222

print df[df.duplicated(subset=['id'], keep=False)]
  name    id
1  BBB  2222
3  DDD  2222

使用drop_duplicates删除:

print df.drop_duplicates(subset=['id'], keep='first')
  name    id
0  AAA  1111
1  BBB  2222
2  CCC  3333

print df.drop_duplicates(subset=['id'], keep='last')
  name    id
0  AAA  1111
2  CCC  3333
3  DDD  2222

print df.drop_duplicates(subset=['id'], keep=False)
  name    id
0  AAA  1111
2  CCC  3333