Question

我有以下代码连接两个数据帧，然后我试图删除基于3列的所有重复项。

import pandas as pd

old = pd.read_excel('Old.xlsx')
new = pd.read_excel('New.xlsx')
old['version'] = 'old'
new['version'] = 'new'

full_set = pd.concat([old,new],ignore_index=True)

changes = full_set.drop_duplicates(subset=['ID', 'Plan', 'Sum'], keep='first')

我遇到的问题是我跑

的时候

len(changes)

我得到了4013，基于在Excel中基本相同的东西，我希望长度为371 - 因为当查看仅存在于其中一个文件中的那三行时，有371个值。

以下是新

的信息

u'Employee ID': [101444L,
  101444L,
  101444L,
  101444L,
  101444L,
  101444L,
  101444L,
  103421L,
  103421L,
  103421L],
 u'Benefit Plan Type': [u'Vision',
  u'Basic Life and AD&D',
  u'Medical and Rx',
  u'Dental',
  u'Health Advocate',
  u'Long Term Disability',
  u'Short Term Disability',
  u'Vision',
  u'Basic Life and AD&D',
  u'Medical and Rx'],
u'Sum of Premium': [11.63,
  49.49,
  876.33,
  51.44,
  0.8,
  36.96,
  0.0,
  6.63,
  15.93,
  438.17],

老了：

 u'Employee ID': [101444L,
  101444L,
  101444L,
  101444L,
  101444L,
  101444L,
  101444L,
  103421L,
  103421L,
  103421L],
u'Benefit Plan Type': [u'Vision',
  u'Basic Life and AD&D',
  u'Medical and Rx',
  u'Dental',
  u'Health Advocate',
  u'Long Term Disability',
  u'Short Term Disability',
  u'Vision',
  u'Basic Life and AD&D',
  u'Medical and Rx'],
u'Sum of Premium': [11.63,
  49.49,
  876.33,
  51.44,
  0.8,
  36.96,
  0.0,
  6.63,
  15.93,
  438.17],

试图从Pandas Dataframe中删除重复项

0 个答案: