我需要验证数据框df
。 df
有几千条记录,但看起来像是:
id score status
1 0.204728295 current
2 0.811946622 current
3 0.255717294 current
4 0.283495765 loan in
4 0.355463338 loan out
5 0.090287194 current
6 0.195224702 current
7 0.743183619 transfer in
7 0.6677402 transfer out
8 0.685349828 current
9 0.664626162 current
9 0.389797469 transfer in
10 0.359471869 current
验证是df中只有重复的id:
需要捕获上述不存在的任何情况以进行更正。
在示例中,id = 4有一个重复的条目,但条目有效,因为状态是'loan in'和'loan out'。 id = 7也是如此,其中状态为'转入'和'转出'。但是id = 9无效,因为状态是'current'和'transfer in'
练习的输出只是fil验证的记录。在这种情况下,它将是:
id score status
9 0.664626162 current
9 0.389797469 transfer in
我已经发现我可以使用以下内容查找重复记录:
countdf = df.groupby('id').count()
result = df.loc[df['id'].isin(countdf[countdf['id'] > 1].index)]
但无法弄清楚如果它们被映射到有效状态,那么如何检查重复项ID。
答案 0 :(得分:2)
这是单程
In [2111]: conds = [['transfer in', 'transfer out'], ['loan in', 'loan out']]
In [2112]: df[df.groupby('id')['status'].transform(
lambda x:not any(all((x==c).any() for c in cond)
for cond in conds) and len(x)>1)]
Out[2112]:
id score status
10 9 0.664626 current
11 9 0.389797 transfer in
详细
In [2114]: df.groupby('id')['status'].transform(
lambda x:not any(all((x==c).any() for c in cond) for cond in conds))
Out[2114]:
0 True
1 True
2 True
3 False
4 False
5 True
6 True
7 False
8 False
9 True
10 True
11 True
12 True
Name: status, dtype: bool
In [2115]: df.groupby('id')['status'].transform(
lambda x:not any(all((x==c).any() for c in cond) for cond in conds) and len(x)>1)
Out[2115]:
0 False
1 False
2 False
3 False
4 False
5 False
6 False
7 False
8 False
9 False
10 True
11 True
12 False
Name: status, dtype: bool
答案 1 :(得分:0)
我有以下有些丑陋的解决方案。我想你,如果你正在寻找一个优雅的答案,你想看看@Zero的答案并从那里找出一些东西。
df= df.set_index('id') #Only if it is not the index yet
index_dup = set(df.index[df.index.duplicated()])
df_dup = df.loc[index_dup]
drop = []
for bar in df_dup.status.groupby('id'):
bar = bar[-1]
if (bar == 'current').sum() > 1:
continue
if (bar == 'loan in').sum() != (bar == 'loan out').sum():
continue
if (bar == 'transfer in').sum() != (bar == 'transfer out').sum():
continue
index_dup.remove(bar.index[0])
df_dup = df_dup.loc[index_dup]
print(df_dup.sort_index()) #I changed some records to create more possible fails - They are caught
score status
id
2 0.811947 current
2 0.811947 current
7 0.743184 transfer in
7 0.743184 transfer out
7 0.667740 transfer in
9 0.664626 current
9 0.389797 transfer in