我在excel文件中有很多数据。我想根据id列信息删除重复记录,将这些数据连接成一个excel文件。
df1
id name date
0 1 cab 2017
1 11 den 2012
2 13 ers 1998
df2
id name date
0 11 den 2012
1 14 ces 2011
2 4 guk 2007
我想最终得到以下的concantenated文件。
Concat df
id name date
0 1 cab 2017
1 11 den 2012
2 13 ers 1998
1 14 ces 2011
2 4 guk 2007
我在下面尝试,但它不会删除重复项。任何人都可以建议如何解决这个问题?
pd.concat([df1,df2]).drop_duplicates().reset_index(drop=True)
我的连锁数据如下。重复的ID仍在文件中。
id created_at retweet_count
0 721557296757797000 2016-04-17 04:34:00 21
1 721497712726844000 2016-04-17 00:37:14 94
2 721462059515453000 2016-04-16 22:15:33 0
3 721460623285072000 2016-04-16 22:09:51 0
4 721460397241446000 2016-04-16 22:08:57 0
5 721459817651577000 2016-04-16 22:06:39 0
6 721456334894469000 2016-04-16 21:52:48 0
7 721557296757797000 2016-04-17 04:34:00 21
8 721497712726844000 2016-04-17 00:37:14 94
9 721462059515453000 2016-04-16 22:15:33 0
10 721460623285072000 2016-04-16 22:09:51 0
11 721460397241446000 2016-04-16 22:08:57 0
12 721459817651577000 2016-04-16 22:06:39 0
13 721456334894469000 2016-04-16 21:52:48 0
答案 0 :(得分:1)
我认为您需要将参数subset
添加到drop_duplicates
,以便按列id
进行过滤:
print pd.concat([df1,df2]).drop_duplicates(subset='id').reset_index(drop=True)
id name date
0 1 cab 2017
1 11 den 2012
2 13 ers 1998
3 14 ces 2011
4 4 guk 2007
编辑:
我尝试了您的新数据,对我来说它有效:
import pandas as pd
df = pd.DataFrame({'created_at': {0: '2016-04-17 04:34:00', 1: '2016-04-17 00:37:14', 2: '2016-04-16 22:15:33', 3: '2016-04-16 22:09:51', 4: '2016-04-16 22:08:57', 5: '2016-04-16 22:06:39', 6: '2016-04-16 21:52:48', 7: '2016-04-17 04:34:00', 8: '2016-04-17 00:37:14', 9: '2016-04-16 22:15:33', 10: '2016-04-16 22:09:51', 11: '2016-04-16 22:08:57', 12: '2016-04-16 22:06:39', 13: '2016-04-16 21:52:48'}, 'retweet_count': {0: 21, 1: 94, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 21, 8: 94, 9: 0, 10: 0, 11: 0, 12: 0, 13: 0}, 'id': {0: 721557296757797000, 1: 721497712726844000, 2: 721462059515453000, 3: 721460623285072000, 4: 721460397241446000, 5: 721459817651577000, 6: 721456334894469000, 7: 721557296757797000, 8: 721497712726844000, 9: 721462059515453000, 10: 721460623285072000, 11: 721460397241446000, 12: 721459817651577000, 13: 721456334894469000}},
columns=['id','created_at','retweet_count'])
print df
id created_at retweet_count
0 721557296757797000 2016-04-17 04:34:00 21
1 721497712726844000 2016-04-17 00:37:14 94
2 721462059515453000 2016-04-16 22:15:33 0
3 721460623285072000 2016-04-16 22:09:51 0
4 721460397241446000 2016-04-16 22:08:57 0
5 721459817651577000 2016-04-16 22:06:39 0
6 721456334894469000 2016-04-16 21:52:48 0
7 721557296757797000 2016-04-17 04:34:00 21
8 721497712726844000 2016-04-17 00:37:14 94
9 721462059515453000 2016-04-16 22:15:33 0
10 721460623285072000 2016-04-16 22:09:51 0
11 721460397241446000 2016-04-16 22:08:57 0
12 721459817651577000 2016-04-16 22:06:39 0
13 721456334894469000 2016-04-16 21:52:48 0
print df.dtypes
id int64
created_at object
retweet_count int64
dtype: object
print df.drop_duplicates(subset='id').reset_index(drop=True)
id created_at retweet_count
0 721557296757797000 2016-04-17 04:34:00 21
1 721497712726844000 2016-04-17 00:37:14 94
2 721462059515453000 2016-04-16 22:15:33 0
3 721460623285072000 2016-04-16 22:09:51 0
4 721460397241446000 2016-04-16 22:08:57 0
5 721459817651577000 2016-04-16 22:06:39 0
6 721456334894469000 2016-04-16 21:52:48 0
答案 1 :(得分:0)
df1.append(df2).groupby('id').first()