在Pandas DataFrame中获取重叠年龄范围中的年龄总和

时间:2016-11-25 08:57:17

标签: python pandas dataset

    target_value        title    people     start end   twitter_map
0   AGE_13_TO_17      13 to 17       1        13  17  AGE_13_TO_17
1   AGE_13_TO_24      13 to 24     NaN        13  24           NaN
2   AGE_13_TO_34      13 to 34     NaN        13  34           NaN
3   AGE_13_TO_49      13 to 49     NaN        13  49           NaN
4   AGE_13_TO_54      13 to 54     NaN        13  54           NaN
5   AGE_OVER_13     Age Over 13    NaN        13   -           NaN
6   AGE_18_TO_24      18 to 24       7        18  24  AGE_18_TO_24
7   AGE_18_TO_54      18 to 54     NaN        18  54           NaN
8   AGE_OVER_18     Age Over 18    NaN        18   -           NaN
9   AGE_21_TO_34      21 to 34     NaN        21  34           NaN
10  AGE_21_TO_49      21 to 49     NaN        21  49           NaN
11  AGE_21_TO_54      21 to 54     NaN        21  54           NaN
12  AGE_25_TO_34      25 to 34      34        25  34  AGE_25_TO_34
13  AGE_25_TO_49      25 to 49     NaN        25  49           NaN
14   AGE_OVER_25    Age Over 25    NaN        25   -           NaN
15  AGE_35_TO_44      35 to 44      15        35  44  AGE_35_TO_44
16   AGE_OVER_35    Age Over 35    NaN        35   -           NaN
17  AGE_45_TO_54      45 to 54       1        45  54  AGE_45_TO_54
18   AGE_OVER_50    Age Over 50    NaN        50   -           NaN
19  AGE_55_TO_64      55 to 64       3        55  64  AGE_55_TO_64
20   AGE_OVER_65          65+        6        65   -   AGE_OVER_65
21          None       All Ages    NaN  All Ages   -           NaN

所以我有如上所示的这个数据帧,并显示年龄开始和年龄结束时的一些值。但是有一些重叠的年龄段。我需要根据

中的已知值正确填写

前两行的预期输出

    target_value        title    people     start end   twitter_map
0   AGE_13_TO_17      13 to 17       1        13  17    AGE_13_TO_17
1   AGE_13_TO_24      13 to 24       8        13  24           NaN

1 个答案:

答案 0 :(得分:2)

我将研究一个简化的例子:

people start end
     1    13  17
   NaN    13  24
   NaN    13  34
   NaN    13   -
     7    18  24
   NaN    18   -
    34    25  34

首先用无穷大替换-并将all转换为float:

import numpy as np
df = df.replace({'-': np.inf}).astype(float)

然后选择给出'people'数量的行,这将是输入:

df_input = df.dropna()

现在定义以下功能:

def func(row):
    return df_input.loc[
            (df_input['start'] >= row['start']) & (df_input['end'] <= row['end']),
            'people'
        ].sum()

对于数据框中的每一行,它将输入中满足定义年龄段的条件的所有数字相加(这是无穷大有用的地方)。

最后应用函数:

In [36]: df.apply(func, axis=1)
Out[36]: 
0     1.0
1     8.0
2    42.0
3    42.0
4     7.0
5    41.0
6    34.0