复杂的变换熊猫

时间:2015-07-13 08:19:41

标签: python pandas

假设我有以下数据集

table = [[datetime.datetime(2015, 1, 1), 1, 0.5],
         [datetime.datetime(2015, 1, 27), 1, 0.5],
         [datetime.datetime(2015, 1, 31), 1, 0.5],
         [datetime.datetime(2015, 2, 1), 1, 2],
         [datetime.datetime(2015, 2, 3), 1, 2],
         [datetime.datetime(2015, 2, 15), 1, 2], 
         [datetime.datetime(2015, 2, 28), 1, 2],
         [datetime.datetime(2015, 3, 1), 1, 3],
         [datetime.datetime(2015, 3, 17), 1, 3],
         [datetime.datetime(2015, 3, 31), 1, 3]]

df = pd.DataFrame(table, columns=['Date', 'Id', 'Value'])

现在,我想找到每个月的最后一个值,将其按月移动到下个月的值,最后获取这些值的累积乘积。对上述数据执行此过程应导致(执行每个步骤):

  1. 查找每个月的最后一个条目并按月移动它们将导致

                Date       Id   Value  Temp
             0 2015-01-01   1    0.5    NaN
             1 2015-01-27   1    0.5    NaN
             2 2015-01-31   1    0.5    NaN
             3 2015-02-01   1    2.0    0.5
             4 2015-02-03   1    2.0    0.5
             5 2015-02-15   1    2.0    0.5
             6 2015-02-28   1    2.0    0.5
             7 2015-03-01   1    3.0    2.0
             8 2015-03-17   1    3.0    2.0
             9 2015-03-31   1    3.0    2.0
    
  2. 使用NaN填充1,获取累积产品,然后删除temp将导致

                 Date      Id   Value  Result
             0 2015-01-01   1    0.5     1
             1 2015-01-27   1    0.5     1
             2 2015-01-31   1    0.5     1
             3 2015-02-01   1    2.0    0.5
             4 2015-02-03   1    2.0    0.5
             5 2015-02-15   1    2.0    0.5
             6 2015-02-28   1    2.0    0.5
             7 2015-03-01   1    3.0    1.0
             8 2015-03-17   1    3.0    1.0
             9 2015-03-31   1    3.0    1.0
    
  3. 我希望这很清楚。如果有人想知道为什么在地球上我想这样做是因为我有MTD数据,需要重新采样。谢谢,Tingis。

    修改 每月的条目数是"随机",因为它们可以是一个月或更短(业务数据......)

1 个答案:

答案 0 :(得分:1)

以下代码并不假设您每个月只有两行。我们的想法是首先进行分组计算,然后使用.reindex()填充一些NaN并使用向后填充填充那些NaN,因为我们已经获得了每月最后一个条目的值。

# your data
# ==================================
import pandas as pd
import datetime

table = [[datetime.datetime(2015, 1, 1), 1, 0.5],
         [datetime.datetime(2015, 1, 31), 1, 0.5],
         [datetime.datetime(2015, 2, 1), 1, 2], 
         [datetime.datetime(2015, 2, 28), 1, 2],
         [datetime.datetime(2015, 3, 1), 1, 3],
         [datetime.datetime(2015, 3, 31), 1, 3]]

df = pd.DataFrame(table, columns=['Date', 'Id', 'Value'])

# better to set Date column to index
df = df.set_index('Date')
print(df)

            Id  Value
Date                 
2015-01-01   1    0.5
2015-01-31   1    0.5
2015-02-01   1    2.0
2015-02-28   1    2.0
2015-03-01   1    3.0
2015-03-31   1    3.0

# processing
# =================================================
# get last entry from each month
df_temp = df.groupby(lambda idx: idx.month).tail(1)
# do the cumprod, reindex to have the same index as original df, backward fill
df['Result'] = df_temp['Value'].shift(1).fillna(1).cumprod().reindex(df.index).fillna(method='bfill')

print(df)

            Id  Value  Result
Date                         
2015-01-01   1    0.5     1.0
2015-01-31   1    0.5     1.0
2015-02-01   1    2.0     0.5
2015-02-28   1    2.0     0.5
2015-03-01   1    3.0     1.0
2015-03-31   1    3.0     1.0

关于后续问题:

# your data
# ==================================
import pandas as pd
import datetime

table = [[datetime.datetime(2015, 1, 1), 1, 0.5],
         [datetime.datetime(2015, 1, 27), 1, 0.5],
         [datetime.datetime(2015, 1, 31), 1, 0.5],
         [datetime.datetime(2015, 2, 1), 1, 2],
         [datetime.datetime(2015, 2, 3), 1, 2],
         [datetime.datetime(2015, 2, 15), 1, 2], 
         [datetime.datetime(2015, 2, 28), 1, 2],
         [datetime.datetime(2015, 3, 1), 1, 3],
         [datetime.datetime(2015, 3, 17), 1, 3],
         [datetime.datetime(2015, 3, 31), 1, 3]]

df1 = pd.DataFrame(table, columns=['Date', 'Id', 'Value'])
df2 = df1.copy()
df2.Id = 2

df = df1.append(df2)
# better to set Date column to index
df = df.set_index('Date')

print(df)


            Id  Value
Date                 
2015-01-01   1    0.5
2015-01-27   1    0.5
2015-01-31   1    0.5
2015-02-01   1    2.0
2015-02-03   1    2.0
2015-02-15   1    2.0
2015-02-28   1    2.0
2015-03-01   1    3.0
2015-03-17   1    3.0
2015-03-31   1    3.0
2015-01-01   2    0.5
2015-01-27   2    0.5
2015-01-31   2    0.5
2015-02-01   2    2.0
2015-02-03   2    2.0
2015-02-15   2    2.0
2015-02-28   2    2.0
2015-03-01   2    3.0
2015-03-17   2    3.0
2015-03-31   2    3.0

def my_func(group):
    # get last entry from each month
    df_temp = group.groupby(lambda idx: idx.month).tail(1)
    # do the cumprod, reindex to have the same index as original df
    group['Result'] = df_temp['Value'].shift(1).fillna(1).cumprod().reindex(group.index).fillna(method='bfill')
    return group

df.groupby('Id').apply(my_func)



            Id  Value  Result
Date                         
2015-01-01   1    0.5     1.0
2015-01-27   1    0.5     1.0
2015-01-31   1    0.5     1.0
2015-02-01   1    2.0     0.5
2015-02-03   1    2.0     0.5
2015-02-15   1    2.0     0.5
2015-02-28   1    2.0     0.5
2015-03-01   1    3.0     1.0
2015-03-17   1    3.0     1.0
2015-03-31   1    3.0     1.0
2015-01-01   2    0.5     1.0
2015-01-27   2    0.5     1.0
2015-01-31   2    0.5     1.0
2015-02-01   2    2.0     0.5
2015-02-03   2    2.0     0.5
2015-02-15   2    2.0     0.5
2015-02-28   2    2.0     0.5
2015-03-01   2    3.0     1.0
2015-03-17   2    3.0     1.0
2015-03-31   2    3.0     1.0