DataFrame中基于条件的新列

时间:2018-08-02 12:38:26

标签: python pandas dataframe

我有一个类似的DataFrame:

+------------+---------------+-------------+---------------------+-------------------+
| SK_ID_CURR | CREDIT_ACTIVE | DAYS_CREDIT | DAYS_CREDIT_ENDDATE | DAYS_ENDDATE_FACT |
+------------+---------------+-------------+---------------------+-------------------+
|     436084 | Sold          |       -2835 | -2094.0             | -2436.0           |
|     436084 | Active        |        -987 | -438.0              | NaN               |
|     436084 | Sold          |       -1875 | -1494.0             | -1494.0           |
|     436084 | Active        |       -1135 | -951.0              | NaN               |
|     436084 | Bad debt      |        -986 | NaN                 | NaN               |
|     436084 | Active        |        -968 | -845.0              | NaN               |
|     436084 | Active        |        -987 | -803.0              | NaN               |
+------------+---------------+-------------+---------------------+-------------------+

我喜欢使用以下规则创建新列CREDIT_LENGTH_IN_DAYS:

def func(x):
    if x[x['CREDIT_ACTIVE'] == 'Active']:
    return  x['DAYS_CREDIT_ENDDATE'] - x['DAYS_CREDIT']
    elif x[x['CREDIT_ACTIVE'] == 'Closed'] | x[x['CREDIT_ACTIVE'] == 'Sold'] :
    return x['DAYS_ENDDATE_FACT'] - x['DAYS_CREDIT']
    elif x[x['CREDIT_ACTIVE'] == 'Bad debt']:
    return x['DAYS_CREDIT']

然后我用:

df_bureau['CREDIT_LENGTH_IN_DAYS'] = df_bureau.apply(func, axis=1)

无论如何,x[x['CREDIT_ACTIVE']=='Bad debt'都是有趣的值,而不是x['DAYS_CREDIT']中每一行的实际值。

1 个答案:

答案 0 :(得分:2)

使用numpy.select

m1 = df_bureau['CREDIT_ACTIVE'] == 'Active'
m2 = df_bureau['CREDIT_ACTIVE'].isin(['Closed','Sold'])
m3 = df_bureau['CREDIT_ACTIVE'] == 'Bad debt'

v1 = df_bureau['DAYS_CREDIT_ENDDATE'] - df_bureau['DAYS_CREDIT']
v2 = df_bureau['DAYS_ENDDATE_FACT'] - df_bureau['DAYS_CREDIT']
v3 = df_bureau['DAYS_CREDIT']

df_bureau['CREDIT_LENGTH_IN_DAYS'] = np.select([m1, m2, m3], [v1, v2, v3], np.nan)
print (df_bureau)
   SK_ID_CURR CREDIT_ACTIVE  DAYS_CREDIT  DAYS_CREDIT_ENDDATE  \
0      436084          Sold        -2835              -2094.0   
1      436084        Active         -987               -438.0   
2      436084          Sold        -1875              -1494.0   
3      436084        Active        -1135               -951.0   
4      436084      Bad debt         -986                  NaN   
5      436084        Active         -968               -845.0   
6      436084        Active         -987               -803.0   

   DAYS_ENDDATE_FACT  CREDIT_LENGTH_IN_DAYS  
0            -2436.0                  399.0  
1                NaN                  549.0  
2            -1494.0                  381.0  
3                NaN                  184.0  
4                NaN                 -986.0  
5                NaN                  123.0  
6                NaN                  184.0  

您的解决方案分别与每一行一起使用,因此不需要过滤,还需要将|更改为or,因为使用标量:

def func(x):
    if x['CREDIT_ACTIVE'] == 'Active':
        return  x['DAYS_CREDIT_ENDDATE'] - x['DAYS_CREDIT']
    elif (x['CREDIT_ACTIVE'] == 'Closed') or (x['CREDIT_ACTIVE'] == 'Sold'):
        return x['DAYS_ENDDATE_FACT'] - x['DAYS_CREDIT']
    elif x['CREDIT_ACTIVE'] == 'Bad debt':
        return x['DAYS_CREDIT']

df_bureau['CREDIT_LENGTH_IN_DAYS'] = df_bureau.apply(func, axis=1)