Question

我们想在数据框中创建一个称为feature col的列，该列是当前值和前两个值的范围，如图中所示，最大值和最小值之差。我们如何在大熊猫中计算呢？

数据集中有多个ID [！[在此处输入图片描述] [2]] [2]

ID  Year    percentage
123 2009    0   
123 2010    -27 
123 2011    0
123 2012    -50
123 2013    3
123 2014    -3
123 2015    0
123 2016    -28
123 2017    -5

Answer 1

将Series.rolling与numpy方法np.ptp一起使用，但首先在必要时删除%并将值转换为数字：

df['feature_col'] = df['percentage'].str.strip('%').astype(int).rolling(3).apply(np.ptp)
print (df)
    ID  Year percentage  feature_col
0  123  2009         0%          NaN
1  123  2010       -27%          NaN
2  123  2011         0%         27.0
3  123  2012       -50%         50.0
4  123  2013         3%         53.0
5  123  2014        -3%         53.0
6  123  2015         0%          6.0
7  123  2016       -28%         28.0
8  123  2017        -5%         28.0

如果需要使用%输出，则可以使用：

df['feature_col'] = (df['percentage'].str.strip('%')
                                     .astype(int)
                                     .rolling(3)
                                     .apply(np.ptp)
                                     .mask(lambda x: x.notna(), lambda x: x.astype('Int64').astype(str).add('%'))
                                     )
print (df)
    ID  Year percentage feature_col
0  123  2009         0%         NaN
1  123  2010       -27%         NaN
2  123  2011         0%         27%
3  123  2012       -50%         50%
4  123  2013         3%         53%
5  123  2014        -3%         53%
6  123  2015         0%          6%
7  123  2016       -28%         28%
8  123  2017        -5%         28%

编辑：如果需要按ID按组进行处理：

print (df)
    ID  Year percentage
0  123  2009         0%
1  123  2010       -27%
2  123  2011         0%
3  123  2012       -50%
4  123  2013         3%
5  124  2014        -3%
6  124  2015         0%
7  124  2016       -28%
8  124  2017        -5%


df['feature_col'] = (df['percentage'].str.strip('%')
                                     .astype(int)
                                     .groupby(df['ID'])
                                     .rolling(3)
                                     .apply(np.ptp)
                                     .reset_index(level=0, drop=True))
print (df)
    ID  Year percentage  feature_col
0  123  2009         0%          NaN
1  123  2010       -27%          NaN
2  123  2011         0%         27.0
3  123  2012       -50%         50.0
4  123  2013         3%         53.0
5  124  2014        -3%          NaN
6  124  2015         0%          NaN
7  124  2016       -28%         28.0
8  124  2017        -5%         28.0

熊猫变换功能可进行自定义行操作

1 个答案: