必须有更好的方法来做到这一点,请帮助我
这是我要清理的一些数据的摘录,它有几种“重复”行(并非所有行都重复):
df =
LoanID | CustomerID | LoanStatus | CreditScore | AnnualIncome | ...
-------+------------+------------+-------------+--------------+-----
100 | ABC | Paid | NaN | 34200 |
100 | ABC | Paid | 724 | 34200 |
200 | DEF | Write Off | 611 | 9800 |
200 | DEF | Write Off | 611 | NaN |
300 | GHI | Paid | NaN | 247112 |
300 | GHI | Paid | 799 | NaN |
400 | JKL | Paid | NaN | NaN |
500 | MNO | Paid | 444 | NaN |
所以我有以下类型的重复案例:
所以,显然我想要的是拥有一个没有重复项的数据框,如:
LoanID | CustomerID | LoanStatus | CreditScore | AnnualIncome | ...
-------+------------+------------+-------------+--------------+-----
100 | ABC | Paid | 724 | 34200 |
200 | DEF | Write Off | 611 | 9800 |
300 | GHI | Paid | 799 | 247112 |
400 | JKL | Paid | NaN | NaN |
500 | MNO | Paid | 444 | NaN |
所以,我是如何解决这个问题的:
# Get the repeated keys:
rep = df['LoanID'].value_counts()
rep = rep[rep > 2]
# Now we get the valid number (we overwrite the NaNs)
for i in rep.keys():
df.loc[df['LoanID'] == i, 'CreditScore'] = df[df['LoanID'] == i]['CreditScore'].max()
df.loc[df['LoanID'] == i, 'AnnualIncome'] = df[df['LoanID'] == i]['AnnualIncome'].max()
# Drop duplicates
df.drop_duplicates(inplace=True)
这个工作,正是我需要的,问题是这个数据帧是几个100k的记录,所以这个方法需要“永远”,必须有一些方法来做得更好,对吗?
答案 0 :(得分:2)
按贷款ID分组,在上方和下方填写缺失值,并删除重复项似乎有效:
#define PROPERTY value1
#define MAKE_PROP(var) ( (var).PROPERTY )