我有一个包含三列的数据框,即date
,commodity
和values
。我想添加另一列median_20
,即commodity
中每个df
的过去20天的滚动中位数。另外,我想添加其他显示n
天前值的列,例如,lag_1
列显示给定commodity
前一天的值,lag_2
显示值2几天前,等等。我的df
非常大(> 200万行)。
dates = pd.date_range('2017-01-01', '2017-07-02')
df1 = pd.DataFrame({'date':dates, 'commodity':np.random.normal(size = len(dates)), 'market':'GOLD'})
df2 = pd.DataFrame({'date':dates, 'commodity':np.random.normal(size = len(dates)), 'market':'SILVER'})
df = pd.concat([df1, df2])
df = df.sort('date')
date commodity value
0 2017-01-01 GOLD -1.239422
0 2017-01-01 SILVER -0.209840
1 2017-01-02 SILVER 0.146293
1 2017-01-02 GOLD 1.422454
2 2017-01-03 GOLD 0.453222
...
答案 0 :(得分:1)
我确信有更有效的方法,同时尝试这个解决方案:
for commo in df.market.unique():
df.loc[df.market==commo,'lag_1'] = df.loc[df.market==commo,'commodity'].shift(1)
df.loc[df.market==commo,'median_20'] = pd.rolling_median(df.loc[df.market==commo,'commodity'],20)
答案 1 :(得分:1)
尝试:
import pandas as pd
import numpy as np
# create dataframe
dates = pd.date_range('2017-01-01', '2017-07-02')
df1 = pd.DataFrame({'date':dates, 'commodity':np.random.normal(size = len(dates)), 'market':'GOLD'})
df2 = pd.DataFrame({'date':dates, 'commodity':np.random.normal(size = len(dates)), 'market':'SILVER'})
df = pd.concat([df1, df2])
df = df.sort_values(by='date').reset_index(drop=True)
# create columns
df['median_20_temp'] = df.groupby('market')['commodity'].rolling(20).median()
df['median_20'] = df.groupby('market')['median_20_temp'].shift(1)
df['lag_1'] = df.groupby('market')['commodity'].shift(1)
df['lag_2'] = df.groupby('market')['commodity'].shift(2)
df.drop(['median_20_temp'], axis=1, inplace=True)
以下内容适用于版本0.16.2
:
import numpy as np
import pandas as pd
np.random.seed(123)
dates = pd.date_range('2017-01-01', '2017-07-02')
df1 = pd.DataFrame({'date':dates, 'commodity':np.random.normal(size = len(dates)), 'market':'GOLD'})
df2 = pd.DataFrame({'date':dates, 'commodity':np.random.normal(size = len(dates)), 'market':'SILVER'})
df = pd.concat([df1, df2])
df = df.sort('date').reset_index(drop=True)
# create columns
df['median_20_temp'] = df.groupby('market')['commodity'].apply(lambda s: pd.rolling_median(s, 20))
df['median_20'] = df.groupby('market')['median_20_temp'].shift(1)
df['lag_1'] = df.groupby('market')['commodity'].shift(1)
df['lag_2'] = df.groupby('market')['commodity'].shift(2)
df.drop(['median_20_temp'], axis=1, inplace=True)
我希望这会有所帮助。