Question

我想基于groupby（'col1'）扩展col2的均值，但我希望均值不包括行本身（仅包括其上方的行）

HSSFWorkbook workbook = new HSSFWorkbook();
workbook.writeProtectWorkbook("s3cr3t", "scooter");

我会认为他们的结果将是相同的。 two_liner 是预期的结果。 one_liner 在组之间混合数字。

花了很长时间才弄清楚这个解决方案，有人能解释一下逻辑吗？为什么one_liner无法给出预期的结果？

Answer 1

您正在expanding().mean()中寻找shift()和groupby()：

groups = df.groupby('col1')
df['one_liner'] = groups.col2.apply(lambda x: x.expanding().mean().shift())

df['two_liner'] = groups.one_liner.apply(lambda x: x.expanding().mean().shift())

输出：

  col1  col2  one_liner  two_liner
0    a     1        NaN        NaN
1    a     2        1.0        NaN
2    a     3        1.5        1.0
3    b     4        NaN        NaN
4    b     5        4.0        NaN
5    b     6        4.5        4.0
6    c     7        NaN        NaN
7    c     8        7.0        NaN

说明：

(dummy.groupby('col1').col2.shift()   # this shifts col2 within the groups 
     .expanding().mean()              # this ignores the grouping and expanding on the whole series
     .reset_index(level=0, drop=True) # this is not really important
)

因此上述链接命令等效于

s1 = dummy.groupby('col1').col2.shift()
s2 = s1.expanding.mean()
s3 = s2.reset_index(level=0, drop=True)

如您所见，只有s1会考虑按col1进行分组。

我如何计算每组的偏移扩展均值

1 个答案: