Python / Pandas:如何在不同的列中使用NaN合并重复的行?

时间:2017-03-12 06:15:46

标签: python pandas

必须有更好的方法来做到这一点,请帮助我

这是我要清理的一些数据的摘录,它有几种“重复”行(并非所有行都重复):

df =

LoanID | CustomerID | LoanStatus | CreditScore | AnnualIncome | ...
-------+------------+------------+-------------+--------------+-----
   100 | ABC        | Paid       |         NaN |        34200 |
   100 | ABC        | Paid       |         724 |        34200 |
   200 | DEF        | Write Off  |         611 |         9800 |
   200 | DEF        | Write Off  |         611 |          NaN |
   300 | GHI        | Paid       |         NaN |       247112 |
   300 | GHI        | Paid       |         799 |          NaN |
   400 | JKL        | Paid       |         NaN |          NaN |
   500 | MNO        | Paid       |         444 |          NaN |

所以我有以下类型的重复案例:

  1. NaN和CreditScore(LoanID = 100)
  2. 列中的有效值
  3. NaN和YearIncome(LoanID = 200)列中的有效值
  4. NaN和CreditScore中的有效值和NaN以及YearIncome(贷款ID = 300)列中的有效值
  5. LoanID 400和500是“正常”案件
  6. 所以,显然我想要的是拥有一个没有重复项的数据框,如:

    LoanID | CustomerID | LoanStatus | CreditScore | AnnualIncome | ...
    -------+------------+------------+-------------+--------------+-----
       100 | ABC        | Paid       |         724 |        34200 |
       200 | DEF        | Write Off  |         611 |         9800 |
       300 | GHI        | Paid       |         799 |       247112 |
       400 | JKL        | Paid       |         NaN |          NaN |
       500 | MNO        | Paid       |         444 |          NaN |
    

    所以,我是如何解决这个问题的:

    # Get the repeated keys:
    rep = df['LoanID'].value_counts()
    rep = rep[rep > 2]
    
    # Now we get the valid number (we overwrite the NaNs)
    for i in rep.keys():
        df.loc[df['LoanID'] == i, 'CreditScore']  = df[df['LoanID'] == i]['CreditScore'].max()
        df.loc[df['LoanID'] == i, 'AnnualIncome'] = df[df['LoanID'] == i]['AnnualIncome'].max()
    
    # Drop duplicates   
    df.drop_duplicates(inplace=True)
    

    这个工作,正是我需要的,问题是这个数据帧是几个100k的记录,所以这个方法需要“永远”,必须有一些方法来做得更好,对吗?

1 个答案:

答案 0 :(得分:2)

按贷款ID分组,在上方和下方填写缺失值,并删除重复项似乎有效:

#define PROPERTY value1
#define MAKE_PROP(var) ( (var).PROPERTY )