Question

我有一个按某个时期的日期索引的数据框。我的列是对给定年份结束时变量值的预测。我的原始数据框看起来像这样：

            2016  2017  2018
2016-01-01   0.0     1   NaN
2016-07-01   1.0     1   4.1
2017-01-01   NaN     5   3.0
2017-07-01   NaN     2   2.0

其中 NaN 表示该给定年份的预测不存在。

由于我工作了 20 多年，并且大多数预测都是针对未来 2-3 年的，因此我的真实数据框有 20 多列，主要包含 NaN 值。例如，2005 年的列有 2003-2005 年的预测，但在 2006-2020 年的范围内都是 NaN。

我想将我的数据框转换成这样：

            Y_0  Y_1  Y_2
2016-01-01    0    1  NaN
2016-07-01    1    1  4.1
2017-01,01    5    3  NaN
2017-07-01    2    2  NaN

其中 Y_j 代表对 year = index.year + j 的预测。这样，我将有一个只有 4 列（Y_0、Y_1、Y_2、Y_3）的数据框。

我实际上做到了这一点，但我认为这是一种非常低效的方式：


for i in range(4):
    df[f'Y_{i}'] = numpy.nan  # create columns [Y_0, Y_1, Y_2, Y_3]

for index, row in df.iterrows():  # iterate through each row of df
    
    for year in row.dropna().index:  # iterate through each year where a prediction exists
        
        year_diff = int(year) - index.year # get the difference between the years for which the prediction was made and when it was made (possible values: 0, 1, 2 or 3)
        
        df.loc[index, f'Y_{year_diff}'] = df.loc[index, year]  # set  the values for the columns 'Y_0', 'Y_1', 'Y_2' and 'Y_3' cell by cell.

        df = df.iloc[:, -4:]  # delete all but the new columns

对于只有 1000 行的数据帧，这需要将近 3 秒才能运行。谁能想到更好的解决方案？

Answer 1

让我们尝试 stack 然后计算年差：

# in index is not already datetime
df.index = pd.to_datetime(df.index)

df = (df.stack().reset_index()
   .assign(date_diff=lambda x: x['level_1'].astype(int) - x['level_0'].dt.year)
   .pivot(index='level_0', columns='date_diff', values=0)
   .add_prefix('Y_')
)

输出：

date_diff   Y_0  Y_1  Y_2
level_0                  
2016-01-01  0.0  1.0  NaN
2016-07-01  1.0  1.0  4.1
2017-01-01  5.0  3.0  NaN
2017-07-01  2.0  2.0  NaN

转换熊猫数据框：需要更高效的解决方案

1 个答案: