检查重复记录的数据框条目

时间:2017-09-21 11:04:05

标签: python pandas

我需要验证数据框dfdf有几千条记录,但看起来像是:

   id         score  status
    1   0.204728295 current
    2   0.811946622 current
    3   0.255717294 current
    4   0.283495765 loan in
    4   0.355463338 loan out
    5   0.090287194 current
    6   0.195224702 current
    7   0.743183619 transfer in
    7   0.6677402   transfer out
    8   0.685349828 current
    9   0.664626162 current
    9   0.389797469 transfer in
    10  0.359471869 current

验证是df中只有重复的id:

  • 其中一个重复的状态条目是'转入',而另一个具有相同ID的条目的状态等于'转出'或
  • 其中一个重复的状态条目是'loan in',另一个具有相同ID的条目的状态等于'loan out'

需要捕获上述不存在的任何情况以进行更正。

在示例中,id = 4有一个重复的条目,但条目有效,因为状态是'loan in'和'loan out'。 id = 7也是如此,其中状态为'转入'和'转出'。但是id = 9无效,因为状态是'current'和'transfer in'

练习的输出只是fil验证的记录。在这种情况下,它将是:

   id         score  status
    9   0.664626162 current
    9   0.389797469 transfer in

我已经发现我可以使用以下内容查找重复记录:

countdf = df.groupby('id').count()
result = df.loc[df['id'].isin(countdf[countdf['id'] > 1].index)]

但无法弄清楚如果它们被映射到有效状态,那么如何检查重复项ID。

2 个答案:

答案 0 :(得分:2)

这是单程

In [2111]: conds = [['transfer in', 'transfer out'], ['loan in', 'loan out']]

In [2112]: df[df.groupby('id')['status'].transform(
                  lambda x:not any(all((x==c).any() for c in cond) 
                         for cond in conds) and len(x)>1)]
Out[2112]:
    id     score       status
10   9  0.664626      current
11   9  0.389797  transfer in

详细

In [2114]: df.groupby('id')['status'].transform(
              lambda x:not any(all((x==c).any() for c in cond) for cond in conds))
Out[2114]:
0      True
1      True
2      True
3     False
4     False
5      True
6      True
7     False
8     False
9      True
10     True
11     True
12     True
Name: status, dtype: bool

In [2115]: df.groupby('id')['status'].transform(
             lambda x:not any(all((x==c).any() for c in cond) for cond in conds) and len(x)>1)
Out[2115]:
0     False
1     False
2     False
3     False
4     False
5     False
6     False
7     False
8     False
9     False
10     True
11     True
12    False
Name: status, dtype: bool

答案 1 :(得分:0)

我有以下有些丑陋的解决方案。我想你,如果你正在寻找一个优雅的答案,你想看看@Zero的答案并从那里找出一些东西。

df= df.set_index('id') #Only if it is not the index yet

index_dup = set(df.index[df.index.duplicated()])
df_dup = df.loc[index_dup]
drop = []
for bar in df_dup.status.groupby('id'):
    bar = bar[-1]
    if (bar == 'current').sum() > 1:
        continue
    if (bar == 'loan in').sum() != (bar == 'loan out').sum():
        continue
    if (bar == 'transfer in').sum() != (bar == 'transfer out').sum():
        continue
    index_dup.remove(bar.index[0])

df_dup = df_dup.loc[index_dup]
print(df_dup.sort_index()) #I changed some records to create more possible fails - They are caught

       score        status
id                        
2   0.811947       current
2   0.811947       current
7   0.743184   transfer in
7   0.743184  transfer out
7   0.667740   transfer in
9   0.664626       current
9   0.389797   transfer in