我有以下代码连接两个数据帧,然后我试图删除基于3列的所有重复项。
import pandas as pd
old = pd.read_excel('Old.xlsx')
new = pd.read_excel('New.xlsx')
old['version'] = 'old'
new['version'] = 'new'
full_set = pd.concat([old,new],ignore_index=True)
changes = full_set.drop_duplicates(subset=['ID', 'Plan', 'Sum'], keep='first')
我遇到的问题是我跑
的时候len(changes)
我得到了4013,基于在Excel中基本相同的东西,我希望长度为371 - 因为当查看仅存在于其中一个文件中的那三行时,有371个值。
以下是新
的信息u'Employee ID': [101444L,
101444L,
101444L,
101444L,
101444L,
101444L,
101444L,
103421L,
103421L,
103421L],
u'Benefit Plan Type': [u'Vision',
u'Basic Life and AD&D',
u'Medical and Rx',
u'Dental',
u'Health Advocate',
u'Long Term Disability',
u'Short Term Disability',
u'Vision',
u'Basic Life and AD&D',
u'Medical and Rx'],
u'Sum of Premium': [11.63,
49.49,
876.33,
51.44,
0.8,
36.96,
0.0,
6.63,
15.93,
438.17],
老了:
u'Employee ID': [101444L,
101444L,
101444L,
101444L,
101444L,
101444L,
101444L,
103421L,
103421L,
103421L],
u'Benefit Plan Type': [u'Vision',
u'Basic Life and AD&D',
u'Medical and Rx',
u'Dental',
u'Health Advocate',
u'Long Term Disability',
u'Short Term Disability',
u'Vision',
u'Basic Life and AD&D',
u'Medical and Rx'],
u'Sum of Premium': [11.63,
49.49,
876.33,
51.44,
0.8,
36.96,
0.0,
6.63,
15.93,
438.17],