Question

我有一个数据框，在大约6000个城市之间的每日数据有700万行温度差，索引=每日日期。

col_1 = City2 - City1 (sum of temp spread day hours 6am - 9pm)
col_2 = City2 - City1 (sum of temp spread night hours 10pm-6am)

我想计算：

positive_spread_sum = sum(City2-City1, where City2-City1 > 0)
negative_spread_sum = sum(City2-CIty1, where City2-CIty1 < 0)

我这样做如下：

pos_sum = df.groupby(['city1', 'city2']).resample('M', how={'col_1':lambda x: x[x>0].sum()})

neg_sum = df.groupby(['city1', 'city2']).resample('M', how={'col_2':lambda x: x[x<0].sum()})

问题是我在迭代数据帧时多次折叠/聚合同一数据帧。每次崩溃/聚合需要0.019秒，这导致37x2 = 74小时的执行时间。

效率低下。

无论如何我可以执行ONE collapse / agg / groupby并通过将多个变量传递给lambda函数来计算多个统计数据？

lambda x,y: x[x>0].sum(), y[y<0].sum()

更新：

我能够在一个组中这样做：

    pos_neg_sum = df.groupby(['city1', 'city2']).resample('M', how={'col_1': lambda x: x[x>0].sum(),
'col_2': lambda y: y[y>0].sum()})

我将多个dict函数传递给=。

这是最有效的方法吗？

将多个变量传递给lambda函数in how = in resample（）

0 个答案: