Python Dataframe使用group by滚动中位数

时间:2017-07-02 09:40:09

标签: python dataframe group-by median

我有一个包含三列的数据框,即datecommodityvalues。我想添加另一列median_20,即commodity中每个df的过去20天的滚动中位数。另外,我想添加其他显示n天前值的列,例如,lag_1列显示给定commodity前一天的值,lag_2显示值2几天前,等等。我的df非常大(> 200万行)。

dates = pd.date_range('2017-01-01', '2017-07-02')
df1 = pd.DataFrame({'date':dates, 'commodity':np.random.normal(size = len(dates)), 'market':'GOLD'})
df2 = pd.DataFrame({'date':dates, 'commodity':np.random.normal(size = len(dates)), 'market':'SILVER'})
df = pd.concat([df1, df2])
df = df.sort('date')

          date  commodity     value
0   2017-01-01       GOLD -1.239422
0   2017-01-01     SILVER -0.209840
1   2017-01-02     SILVER  0.146293
1   2017-01-02       GOLD  1.422454
2   2017-01-03       GOLD  0.453222
...

2 个答案:

答案 0 :(得分:1)

我确信有更有效的方法,同时尝试这个解决方案:

for commo in df.market.unique():
    df.loc[df.market==commo,'lag_1'] = df.loc[df.market==commo,'commodity'].shift(1)
    df.loc[df.market==commo,'median_20'] = pd.rolling_median(df.loc[df.market==commo,'commodity'],20)

答案 1 :(得分:1)

尝试:

import pandas as pd
import numpy as np

# create dataframe
dates = pd.date_range('2017-01-01', '2017-07-02')
df1 = pd.DataFrame({'date':dates, 'commodity':np.random.normal(size = len(dates)), 'market':'GOLD'})
df2 = pd.DataFrame({'date':dates, 'commodity':np.random.normal(size = len(dates)), 'market':'SILVER'})
df = pd.concat([df1, df2])
df = df.sort_values(by='date').reset_index(drop=True)

# create columns
df['median_20_temp'] = df.groupby('market')['commodity'].rolling(20).median()
df['median_20'] = df.groupby('market')['median_20_temp'].shift(1)
df['lag_1'] = df.groupby('market')['commodity'].shift(1)
df['lag_2'] = df.groupby('market')['commodity'].shift(2)
df.drop(['median_20_temp'], axis=1, inplace=True)

编辑:

以下内容适用于版本0.16.2

import numpy as np
import pandas as pd


np.random.seed(123)
dates = pd.date_range('2017-01-01', '2017-07-02')
df1 = pd.DataFrame({'date':dates, 'commodity':np.random.normal(size = len(dates)), 'market':'GOLD'})
df2 = pd.DataFrame({'date':dates, 'commodity':np.random.normal(size = len(dates)), 'market':'SILVER'})
df = pd.concat([df1, df2])
df = df.sort('date').reset_index(drop=True)

# create columns
df['median_20_temp'] = df.groupby('market')['commodity'].apply(lambda s: pd.rolling_median(s, 20))
df['median_20'] = df.groupby('market')['median_20_temp'].shift(1)
df['lag_1'] = df.groupby('market')['commodity'].shift(1)
df['lag_2'] = df.groupby('market')['commodity'].shift(2)
df.drop(['median_20_temp'], axis=1, inplace=True)

我希望这会有所帮助。