我有一个以前已移动过的熊猫数据透视表,现在看起来像这样:
pivot
A B C D E
0 5.3 5.1 3.5 4.2 4.5
1 5.3 4.1 3.5 4.2 NaN
2 4.3 4.1 3.5 NaN NaN
3 4.3 4.1 NaN NaN NaN
4 4.3 NaN NaN NaN NaN
我正在尝试在每个对角的反对角线上迭代一个可变窗口(在这种情况下为3个周期和4个周期)来计算滚动平均值,并尝试将该值存储在新的数据框中,如下所示:
expected_df with a 3 periods window
A B C D E
0 4.3 4.1 3.5 4.2 4.5
expected_df with a 4 periods window
A B C D E
0 4.5 4.3 3.5 4.2 4.5
到目前为止,我试图将原始数据透视表子集化,并创建一个仅包含每一列的指定窗口值的不同数据框,然后计算平均值,如下所示:
subset
A B C D E
0 4.3 4.1 3.5 4.2 4.5
1 4.3 4.1 3.5 4.2 NaN
2 4.3 4.1 3.5 NaN NaN
为此,我尝试构建以下for循环:
df2 = pd.DataFrame()
size = pivot.shape[0]
window = 3
for i in range(size):
df2[i] = pivot.iloc[size-window-i:size-i,i]
即使此pivot.iloc[size-window-i:size-i,i]
确实返回了我手动传递索引时所需的值,这也不起作用,但是在for循环中,它错过了第二列的第一个值,依此类推:>
df2
A B C D E
0 4.3 NaN NaN NaN NaN
1 4.3 4.1 NaN NaN NaN
2 4.3 4.1 3.5 NaN NaN
有人对如何计算移动平均值或如何修复for循环部分有一个好主意吗?预先感谢您的评论。
答案 0 :(得分:5)
IIUC:
shift
一切退回shifted = pd.concat([df.iloc[:, i].shift(i) for i in range(df.shape[1])], axis=1)
shifted
A B C D E
0 5.3 NaN NaN NaN NaN
1 5.3 5.1 NaN NaN NaN
2 4.3 4.1 3.5 NaN NaN
3 4.3 4.1 3.5 4.2 NaN
4 4.3 4.1 3.5 4.2 4.5
那么你就可以明白自己的意思。
# Change this to get the last n number of rows
shifted.iloc[-3:].mean()
A 4.3
B 4.1
C 3.5
D 4.2
E 4.5
dtype: float64
或者滚动平均值
# Change this to get the last n number of rows
shifted.rolling(3, min_periods=1).mean()
A B C D E
0 5.300000 NaN NaN NaN NaN
1 5.300000 5.100000 NaN NaN NaN
2 4.966667 4.600000 3.5 NaN NaN
3 4.633333 4.433333 3.5 4.2 NaN
4 4.300000 4.100000 3.5 4.2 4.5
我将使用步幅构造一个3-D数组并在其中一个轴上求平均值。这样做速度更快,但令人困惑……
此外,我不会使用它。我只是想确定如何通过跨步抓住对角线元素。这对我来说是更多实践,我想分享。
from numpy.lib.stride_tricks import as_strided as strided
a = df.values
roll = 3
r_ = roll - 1 # one less than roll
h, w = a.shape
w_ = w - 1 # one less than width
b = np.empty((h + 2 * w_ + r_, w), dtype=a.dtype)
b.fill(np.nan)
b[w_ + r_:-w_] = a
s0, s1 = b.strides
a_ = np.nanmean(strided(b, (h + w_, roll, w), (s0, s0, s1 - s0))[w_:], axis=1)
pd.DataFrame(a_, df.index, df.columns)
A B C D E
0 5.300000 NaN NaN NaN NaN
1 5.300000 5.100000 NaN NaN NaN
2 4.966667 4.600000 3.5 NaN NaN
3 4.633333 4.433333 3.5 4.2 NaN
4 4.300000 4.100000 3.5 4.2 4.5
我对此比使用大步向前感觉更好
import numpy as np
from numba import njit
import warnings
@njit
def dshift(a, roll):
h, w = a.shape
b = np.empty((h, roll, w), dtype=np.float64)
b.fill(np.nan)
for r in range(roll):
for i in range(h):
for j in range(w):
k = i - j - r
if k >= 0:
b[i, r, j] = a[k, j]
return b
with warnings.catch_warnings():
warnings.simplefilter('ignore', category=RuntimeWarning)
df_ = pd.DataFrame(np.nanmean(dshift(a, 3), axis=1, ), df.index, df.columns)
df_
A B C D E
0 5.300000 NaN NaN NaN NaN
1 5.300000 5.100000 NaN NaN NaN
2 4.966667 4.600000 3.5 NaN NaN
3 4.633333 4.433333 3.5 4.2 NaN
4 4.300000 4.100000 3.5 4.2 4.5