我的数据包含700,00行
我尝试过使用for循环,该过程花了30个小时。请让我知道获得结果的更快方法。
我正在附上示例数据集。每行对于Columns [period,dimname,facility,serv,cpt]都是唯一的。我想针对column [period-dimname-facility-cpt]查找column(gcr)滚动月份的平均值。 (最后一列(avg6month
)包含期望的结果)。为了更好地了解JPEG格式的附加过滤器结果集。
data.sort_values(by='period', inplace=True, ascending=True)
for fa in data.loc[(data.dimname == 'fac_cpt'), ].facility.dropna().unique():
for pr in data.loc[(data.dimname == 'fac_cpt') & (data.facility == fa), ].cpt.dropna().unique():
data.loc[(data.dimname == 'fac_cpt') & (data.facility == fa) & (data.cpt == pr), ['avg6monthgcr']]=round(data.loc[(data.dimname == 'fac_cpt') & (data.facility == fa) & (data.cpt == pr), ].gcr.rolling(6, min_periods=1).mean(), 4)
Sample_Data:
Samples_Results:
答案 0 :(得分:0)
我设法通过向量运算获得了所需的东西,因此它应该是最快的方法。
import pandas as pd
data = pd.DataFrame({
"period": [
'3/1/2017', '3/1/2017', '3/1/2017', '3/1/2017', '3/1/2017', '3/1/2017', '3/1/2017',
'4/1/2017', '4/1/2017', '4/1/2017', '4/1/2017', '4/1/2017', '4/1/2017', '4/1/2017'
],
"dimname": [
'fac_cpt', 'fac_cpt', 'fac_cpt', 'fac_cpt', 'fac_cpt', 'ser_cpt', 'ser_cpt',
'fac_cpt', 'fac_cpt', 'fac_cpt', 'fac_cpt', 'fac_cpt', 'ser_cpt', 'ser_cpt'
],
"facility": ['a', 'a', 'a', 'b', 'b', None, None, 'a', 'a', 'a', 'b', 'b', None, None],
"cpt": ['p1', 'p2', 'p3', 'p1', 'p2', 'p1', 'p2', 'p1', 'p2', 'p3', 'p1', 'p2', 'p1', 'p1'],
"ser": [None, None, None, None, None, 'c', 'c', None, None, None, None, None, 'd', 'd'],
"gcr": [1, 10, 2, 3, 8, 12, 4, 4, 10, 2, 4, 11, 6, 2]
})
data.period = data.period.apply(pd.to_datetime)
data[["period", "dimname", "facility", "cpt", "gcr"]].groupby(
['dimname', 'facility', 'cpt']
).rolling(6, min_periods=1, on='period').mean().reset_index(
3, drop=True
).reset_index().rename(columns={'gcr': 'avg6monthgcr'})
# Output:
| dimname | facility | cpt | avg6monthgcr | period
----------------------------------------------------
0 | fac_cpt | a | p1 | 1.0 | 2017-03-01
1 | fac_cpt | a | p1 | 2.5 | 2017-04-01
2 | fac_cpt | a | p2 | 10.0 | 2017-03-01
3 | fac_cpt | a | p2 | 10.0 | 2017-04-01
4 | fac_cpt | a | p3 | 2.0 | 2017-03-01
5 | fac_cpt | a | p3 | 2.0 | 2017-04-01
6 | fac_cpt | b | p1 | 3.0 | 2017-03-01
7 | fac_cpt | b | p1 | 3.5 | 2017-04-01
8 | fac_cpt | b | p2 | 8.0 | 2017-03-01
9 | fac_cpt | b | p2 | 9.5 | 2017-04-01
我在您的数据集上设置了时间,但只有一点点收获,可能是因为所有初始化都花费了大部分时间,而不是计算时间,所以您应该尝试使用整个数据集。
# your method:
27.6 ms ± 1.85 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# my method:
24.9 ms ± 2.03 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
如果您需要将其合并回原始数据帧,则应修改代码以保留原始索引,因为合并对它们而言是更快的,所以看起来像这样:
avg_data = data[["period", "dimname", "facility", "cpt", "gcr"]].groupby(['dimname', 'facility', 'cpt']).rolling(6, min_periods=1, on='period').mean().reset_index(level=3).reset_index(drop=True).set_index('level_3').rename(columns={'gcr': 'avg6monthgcr'}).drop('period', axis=1)
data.merge(avg_data, left_index=True, right_index=True, how='left')