如何删除pandas中的重复项?

时间:2016-04-17 08:41:18

标签: python pandas

我在excel文件中有很多数据。我想根据id列信息删除重复记录,将这些数据连接成一个excel文件。

df1 
   id    name   date 
0   1    cab    2017
1  11    den    2012 
2  13    ers    1998


df2 
   id    name   date 
0  11    den    2012
1  14    ces    2011 
2   4    guk    2007

我想最终得到以下的concantenated文件。

Concat df
   id    name   date 
0   1    cab    2017
1  11    den    2012 
2  13    ers    1998
1  14    ces    2011 
2   4    guk    2007

我在下面尝试,但它不会删除重复项。任何人都可以建议如何解决这个问题?

pd.concat([df1,df2]).drop_duplicates().reset_index(drop=True)

我的连锁数据如下。重复的ID仍在文件中。

    id                  created_at          retweet_count
0   721557296757797000  2016-04-17 04:34:00 21
1   721497712726844000  2016-04-17 00:37:14 94
2   721462059515453000  2016-04-16 22:15:33 0
3   721460623285072000  2016-04-16 22:09:51 0
4   721460397241446000  2016-04-16 22:08:57 0
5   721459817651577000  2016-04-16 22:06:39 0
6   721456334894469000  2016-04-16 21:52:48 0
7   721557296757797000  2016-04-17 04:34:00 21
8   721497712726844000  2016-04-17 00:37:14 94
9   721462059515453000  2016-04-16 22:15:33 0
10  721460623285072000  2016-04-16 22:09:51 0
11  721460397241446000  2016-04-16 22:08:57 0
12  721459817651577000  2016-04-16 22:06:39 0
13  721456334894469000  2016-04-16 21:52:48 0

2 个答案:

答案 0 :(得分:1)

我认为您需要将参数subset添加到drop_duplicates,以便按列id进行过滤:

print pd.concat([df1,df2]).drop_duplicates(subset='id').reset_index(drop=True)
   id name  date
0   1  cab  2017
1  11  den  2012
2  13  ers  1998
3  14  ces  2011
4   4  guk  2007

编辑:

我尝试了您的新数据,对我来说它有效:

import pandas as pd

df = pd.DataFrame({'created_at': {0: '2016-04-17 04:34:00', 1: '2016-04-17 00:37:14', 2: '2016-04-16 22:15:33', 3: '2016-04-16 22:09:51', 4: '2016-04-16 22:08:57', 5: '2016-04-16 22:06:39', 6: '2016-04-16 21:52:48', 7: '2016-04-17 04:34:00', 8: '2016-04-17 00:37:14', 9: '2016-04-16 22:15:33', 10: '2016-04-16 22:09:51', 11: '2016-04-16 22:08:57', 12: '2016-04-16 22:06:39', 13: '2016-04-16 21:52:48'}, 'retweet_count': {0: 21, 1: 94, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 21, 8: 94, 9: 0, 10: 0, 11: 0, 12: 0, 13: 0}, 'id': {0: 721557296757797000, 1: 721497712726844000, 2: 721462059515453000, 3: 721460623285072000, 4: 721460397241446000, 5: 721459817651577000, 6: 721456334894469000, 7: 721557296757797000, 8: 721497712726844000, 9: 721462059515453000, 10: 721460623285072000, 11: 721460397241446000, 12: 721459817651577000, 13: 721456334894469000}},
                  columns=['id','created_at','retweet_count'])
print df
                    id           created_at  retweet_count
0   721557296757797000  2016-04-17 04:34:00             21
1   721497712726844000  2016-04-17 00:37:14             94
2   721462059515453000  2016-04-16 22:15:33              0
3   721460623285072000  2016-04-16 22:09:51              0
4   721460397241446000  2016-04-16 22:08:57              0
5   721459817651577000  2016-04-16 22:06:39              0
6   721456334894469000  2016-04-16 21:52:48              0
7   721557296757797000  2016-04-17 04:34:00             21
8   721497712726844000  2016-04-17 00:37:14             94
9   721462059515453000  2016-04-16 22:15:33              0
10  721460623285072000  2016-04-16 22:09:51              0
11  721460397241446000  2016-04-16 22:08:57              0
12  721459817651577000  2016-04-16 22:06:39              0
13  721456334894469000  2016-04-16 21:52:48              0

print df.dtypes

id                int64
created_at       object
retweet_count     int64
dtype: object


print df.drop_duplicates(subset='id').reset_index(drop=True)
                   id           created_at  retweet_count
0  721557296757797000  2016-04-17 04:34:00             21
1  721497712726844000  2016-04-17 00:37:14             94
2  721462059515453000  2016-04-16 22:15:33              0
3  721460623285072000  2016-04-16 22:09:51              0
4  721460397241446000  2016-04-16 22:08:57              0
5  721459817651577000  2016-04-16 22:06:39              0
6  721456334894469000  2016-04-16 21:52:48              0

答案 1 :(得分:0)

另一种方式:

df1.append(df2).groupby('id').first()