我正在尝试验证pandas的ewm.std计算,以便我可以为我的代码实现一步更新。以下是代码问题的完整描述。
mrt = pd.Series(np.random.randn(1000))
N = 100
a = 2/(1+N)
bias = (2-a)/2/(1-a)
x = mrt.iloc[-2]
ma = mrt.ewm(span=N).mean().iloc[-3]
var = mrt.ewm(span=N).var().iloc[-3]
ans = mrt.ewm(span=N).std().iloc[-2]
print(np.sqrt( bias*(1-a) * (var + a * (x- ma)**2)), ans)
(1.1352524643949702,1.1436193844674576)
我使用过标准配方。有人能告诉我为什么这两个值不应该相同吗?即熊猫如何计算指数加权的std?
编辑:在Julien的回答之后 - 让我再举一个用例。我正在绘制由熊猫计算的var的比例,并使用从pandas ewm-covariance的Cython代码推断出的公式I.这个比例应该是1.(我猜我的公式存在问题,如果有人可以指出的话。)mrt = pd.Series(np.random.randn(1000))
N = 100
a = 2./(1+N)
bias = (2-a)/2./(1-a)
mewma = mrt.ewm(span=N).mean()
var_pandas = mrt.ewm(span=N).var()
var_calculated = bias * (1-a) * (var_pandas.shift(1) + a * (mrt-mewma.shift(1))**2)
(var_calculated/var_pandas).plot()
情节清楚地显示了问题。
编辑2:通过反复试验,我找到了正确的公式:
var_calculated = (1-a) * (var_pandas.shift(1) + bias * a * (mrt-mewma.shift(1))**2)
但我不相信它应该是正确的!有人能说清楚吗?
答案 0 :(得分:3)
实际上你的问题实际上减少了pandas如何计算ewm.var()
In [1]:
(np.sqrt(mrt.ewm(span=span).var()) == mrt.ewm(span=span).std())[1:].value_counts()
Out[1]:
True 999
dtype: int64
所以在上面的示例中:ans == np.sqrt(mrt.ewm(span=N).var().iloc[-2])
。
要调查它是如何计算ewmvar()的,可以通过input_x=input_y=mrt
如果我们检查第一个元素:
mrt.ewm(span=span).var()[:2].values
> array([nan, 0.00555309])
现在,使用emcov例程,并将其应用于我们的特定情况:
x0 = mrt.iloc[0]
x1 = mrt.iloc[1]
x2 = mrt.iloc[2]
# mean_x and mean_y are both the same, here we call it y
# This is the same as mrt.ewm(span=span).mean(), I verified that too
y0 = x0
# y1 = mrt.ewm(span=span).mean().iloc[1]
y1 = ((1-alpha)*y0 + x1)/(1+(1-alpha))
#y2 = (((1-alpha)**2+(1-alpha))*y1 + x2) / (1 + (1-alpha) + (1-alpha)**2)
cov0 = 0
cov1 = (((1-alpha) * (cov0 + ((y0 - y1)**2))) +
(1 * ((x1 - y1)**2))) / (1 + (1-alpha))
# new_wt = 1, sum_wt0 = (1-alpha), sum_wt2 = (1-alpha)**2
sum_wt = 1+(1-alpha)
sum_wt2 =1+(1-alpha)**2
numerator = sum_wt * sum_wt # (1+(1-alpha))^2 = 1 + 2(1-alpha) + (1-alpha)^2
denominator = numerator - sum_wt2 # # 2*(1-alpha)
print(np.nan,cov1*(numerator / denominator))
>(nan, 0.0055530905712123432)
答案 1 :(得分:0)
根据ewm
函数的文档,使用默认标志adjust=True
。正如下面的链接所解释的那样,指数加权移动值不是使用递归关系计算的,而是使用权重计算的。这是合理的,特别是对于系列长度很小的情况。
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.ewm.html https://github.com/pandas-dev/pandas/issues/8861
这意味着ewma
和ewmvar
计算为正常加权平均值,而var则权重为指数递减因子
mrt_array = np.array(mrt.tolist())
M = len(mrt_array)
weights = (1-a)**np.arange(M-1, -1, -1) # This is reverse order to match Series order
ewma = sum(weights * mrt_array) / sum(weights)
bias = sum(weights)**2 / (sum(weights)**2 - sum(weights**2))
ewmvar = bias * sum(weights * (mrt_array - ewma)**2) / sum(weights)
ewmstd = np.sqrt(ewmvar)
答案 2 :(得分:0)
@@ kosnik感谢您提供上述答案。将您的代码复制粘贴到下面,并在上面进行构建以回答这里的问题。因此计算出整个数据集的指数移动方差和标准差。计算出的值与.ewm()
的输出相匹配。
# Import libraries
import numpy as np
import pandas as pd
# Create DataFrame
mrt = pd.Series(np.random.randn(1000))
df = pd.DataFrame(data=mrt, columns=['data'])
# Initialize
N = 3 # Span
a = 2./(1+N) # Alpha
# Use .evm() to calculate 'exponential moving variance' directly
var_pandas = df.ewm(span=N).var()
std_pandas = df.ewm(span=N).std()
# Initialize variable
varcalc=[]
stdcalc=[]
# Calculate exponential moving variance
for i in range(0,len(df.data)):
z = np.array(df.data.iloc[0:i+1].tolist())
# Get weights: w
n = len(z)
w = (1-a)**np.arange(n-1, -1, -1) # This is reverse order to match Series order
# Calculate exponential moving average
ewma = np.sum(w * z) / np.sum(w)
# Calculate bias
bias = np.sum(w)**2 / (np.sum(w)**2 - np.sum(w**2))
# Calculate exponential moving variance with bias
ewmvar = bias * np.sum(w * (z - ewma)**2) / np.sum(w)
# Calculate standard deviation
ewmstd = np.sqrt(ewmvar)
# Append
varcalc.append(ewmvar)
stdcalc.append(ewmstd)
#print('ewmvar:',ewmvar)
#varcalc
df['var_pandas'] = var_pandas
df['varcalc'] = varcalc
df['std_pandas'] = std_pandas
df['stdcalc'] = stdcalc
df