我有一个数据集,每个观察都有一个开始日期和一个结束日期。对于给定的日期集,我想计算观察值的加权平均值,其开始日期小于或等于该日期,结束日期大于或等于该日期。我也想对数据集中的不同组执行此操作。
我已经设法通过循环来做到这一点,但是它相当慢,而且我觉得有更好的方法可以做到这一点。任何帮助将不胜感激!
这是我当前的代码,其中包含一些测试数据
# Setup
import pandas as pd
import numpy as np
# Dates to loop over and df to hold result
dates = pd.date_range(start='10/1/2011', periods=5, freq='M')
result = pd.DataFrame(columns=["date","calc"])
result['date'] = dates
result = result.set_index('date')
# Test data
data = {"group": ['group1', 'group2']*5,
"start_date": pd.to_datetime(['2011-11-02', '2011-11-03', '2011-11-02', '2011-11-01','2011-11-04', '2011-11-04', '2011-11-04', '2011-11-07',
'2011-11-07', '2011-11-07']),
"end_date": pd.to_datetime(['2012-02-02', '2011-11-17', '2011-11-16', '2011-12-01', '2012-02-06', '2011-11-18', '2012-02-06', '2011-12-07',
'2012-02-07', '2012-03-07']),
"value": np.random.randint(100, size=10)}
df = pd.DataFrame(data)
# For one group
df2 = df[df.group == 'group1']
for date in dates:
tmp = pd.DataFrame()
tmp = df2.loc[(df2.start_date <= date) & (df2.end_date >= date)]
if tmp.empty:
continue
tmp['volume_x_days'] = tmp.apply(lambda x: (x.end_date - date).days * x.value, axis=1)
result.loc[date, "calc"] = tmp.volume_x_days.sum() / tmp.value.sum()
此输出应为每个日期的值加权平均值。像这样:
calc
date
2011-10-31 NaN
2011-11-30 56.8957
2011-12-31 44.7739
2012-01-31 13.7739
2012-02-29 7