Question

我对熊猫的滚动功能有一些疑问，我不太确定我要去哪里。如果我模拟两个数字测试序列：

df_index = pd.date_range(start='1990-01-01', end ='2010-01-01', freq='D')
test_df = pd.DataFrame(index=df_index)

test_df['Series1'] = np.random.randn(len(df_index))
test_df['Series2'] = np.random.randn(len(df_index))

然后很容易查看它们的滚动年度相关性：

test_df['Series1'].rolling(365).corr(test_df['Series2']).plot()

产生：

到目前为止一切都很好。如果然后我尝试使用datetime偏移量做同样的事情：

test_df['Series1'].rolling('365D').corr(test_df['Series2']).plot()

我得到一个截然不同的结果（显然是错误的）：

熊猫有问题吗？或者我有问题吗？

预先感谢您能为这个令人困扰的难题提供帮助。

Answer 1

这主要是因为两次滚动365和365D的结果不同。例如

sub = test_df.head()
sub['Series2'].rolling(2).sum()
Out[15]: 
1990-01-01         NaN
1990-01-02   -0.355230
1990-01-03    0.844281
1990-01-04    2.515529
1990-01-05    1.508412

sub['Series2'].rolling('2D').sum()
Out[16]: 
1990-01-01   -0.043692
1990-01-02   -0.355230
1990-01-03    0.844281
1990-01-04    2.515529
1990-01-05    1.508412

由于365滚动中有很多NaN，所以两个系列的两个序列的corr都不相同。

Answer 2

这非常棘手，我认为window作为 int 和 offset 的行为是不同的：

0.19.0版中的新功能是可以传递偏移量（或     转换为.rolling（）方法并使其产生可变大小     窗口基于传递的时间窗口。 对于每个时间点，     包括在指定时间内发生的所有先前值     增量。

这对于非常规时间频率索引特别有用。

您应该签出Time-aware Rolling的文档。

r1 = test_df['Series1'].rolling(window=365) # has default `min_periods=365`
r2 = test_df['Series1'].rolling(window='365D') # has default `min_periods=1`
r3 = test_df['Series1'].rolling(window=365, min_periods=1)

r1.corr(test_df['Series2']).plot()
r2.corr(test_df['Series2']).plot()
r3.corr(test_df['Series2']).plot()

此代码将为r2.corr().plot()和r3.corr().plot()产生相似的图形形状，但是请注意，计算结果仍然不同：r2.corr(test_df['Series2']) == r3.corr(test_df['Series2'])。

我认为对于常规的时间频率索引，您应该坚持使用r1。

使用日期/时间偏移量的Pandas .rolling.corr

2 个答案: