我希望使用多个模型的加权平均值来预测看不见的数据。
当要平均的模型包含取决于Theano共享变量的确定性时,我无法使用 sample_ppc_w 。
简化示例:
(笔记本here中更完整的示例)
对 a 和 b 之间的差异进行建模,以预测新的 a 时 b 的HPD范围。 strong>可用。
数据
df = pd.DataFrame(dict(a=np.random.normal(0.1, 1.5, 1000),
b=np.random.normal(-0.05, 1.2, 1000)))
df['diff'] = df['a'] - df['b']
此notebook中显示了我遵循的平均方法。
使用计算出的 diff 观察值时,我能够成功地对多个模型进行平均。
模型1
with pm.Model() as model_t1:
mu = pm.Uniform('mu', lower=-5, upper=5)
sd = pm.Uniform('sd', lower=0, upper=10)
nu = pm.Uniform('nu', lower=0, upper=10)
y = pm.StudentT('y', mu=mu, nu=nu, sd=sd, observed=df['diff'])
trace_t1 = pm.sample(20000, tune=1000, njobs=2)
btrace_t1 = trace_t1[1000:]
模型2
with pm.Model() as model_l1:
mu = pm.Uniform('mu', lower=-5, upper=5)
b = pm.Uniform('b', lower=0, upper=10)
y = pm.Laplace('y', mu=mu, b=b, observed=df['diff'])
trace_l1 = pm.sample(20000, tune=1000, njobs=2)
btrace_l1 = trace_l1[1000:]
比较
models_dict1 = {
model_t1: btrace_t1,
model_l1: btrace_l1,
}
comparison = pm.compare(models_dict1, method='stacking', ic='WAIC')
加权平均值中的采样
traces1 = [btrace_t1, btrace_l1]
models1 = [model_t1, model_l1]
ppc_w1 = pm.sample_ppc_w(traces1, models=models1, weights=comparison1w.weight.sort_index(ascending=True))
但是我希望对看不见的数据进行预测。
我遵循的预测方法是将Theano共享变量用于 a 和 b 并使用确定性变量对 diff 进行建模,如{{3 }}。
我可以成功地使用此方法通过单个模型对看不见的数据进行预测:
使用模型1进行预测
a_shared_t = tt.shared(df['a'].values)
b_shared_t = tt.shared(df['b'].values)
with pm.Model() as model_t2:
arv = pm.Normal('arv', mu=df['a'].mean(), sd=df['a'].std(), observed=a_shared_t)
brv = pm.Normal('brv', mu=df['b'].mean(), sd=df['b'].std(), observed=b_shared_t)
diff = pm.Deterministic('diff', arv - brv)
mu = pm.Uniform('mu', lower=-5, upper=5)
sd = pm.Uniform('sd', lower=0, upper=10)
nu = pm.Uniform('nu', lower=0, upper=10)
y = pm.StudentT('y', mu=mu, nu=nu, sd=sd, observed=diff)
trace_t2 = pm.sample(20000, tune=1000, njobs=2)
btrace_t2 = trace_t2[1000:]
observed_a = 1.8
a_shared_t.set_value(np.append(df['a'].values, observed_a))
b_shared_t.set_value(np.append(df['a'].values, 0.))
ppc_t2_updated = pm.sample_ppc(btrace_t2,
samples=10000,
model=model_t2,
size=100)
sample = ppc_t2_updated['y'][:, -1]
hpd_80_diff = pm.stats.hpd(sample, alpha=0.2)
predicted_b_range = observed_a - hpd_80_diff[1], observed_a - hpd_80_diff[0]
== (-0.7764686082453378, 3.890428414268656)
使用模型2进行预测
a_shared_l = tt.shared(df['a'].values)
b_shared_l = tt.shared(df['b'].values)
with pm.Model() as model_l2:
arv = pm.Normal('arv', mu=df['a'].mean(), sd=df['a'].std(), observed=a_shared_l)
brv = pm.Normal('brv', mu=df['b'].mean(), sd=df['b'].std(), observed=b_shared_l)
diff = pm.Deterministic('diff', arv - brv)
mu = pm.Uniform('mu', lower=-5, upper=5)
b = pm.Uniform('b', lower=0, upper=10)
y = pm.Laplace('y', mu=mu, b=b, observed=diff)
trace_l2 = pm.sample(20000, tune=1000, njobs=2)
btrace_l2 = trace_l2[1000:]
observed_a = 1.8
dummy = 0.
a_shared_l.set_value(np.append(df['a'].values, observed_a))
b_shared_l.set_value(np.append(df['a'].values, dummy))
ppc_l2_updated = pm.sample_ppc(btrace_l2,
samples=10000,
model=model_l2,
size=100)
sample = ppc_l2_updated['y'][:, -1]
hpd_80_diff = pm.stats.hpd(sample, alpha=0.2)
predicted_b_range = observed_a - hpd_80_diff[1], observed_a - hpd_80_diff[0]
== (-0.8231007149866174, 3.9518915820568434)
但是,当尝试使用 pm.sample_ppc_w 从平均模型中采样时:
traces2 = [btrace_t2, btrace_l2]
models2 = [model_t2, model_l2]
ppc_w2 = pm.sample_ppc_w(traces2, models=models2)
我收到以下错误:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
~/.pyenv/versions/3.6.4/envs/ganesha/lib/python3.6/site-packages/numpy/core/fromnumeric.py in _wrapfunc(obj, method, *args, **kwds)
51 try:
---> 52 return getattr(obj, method)(*args, **kwds)
53
AttributeError: 'list' object has no attribute 'repeat'
During handling of the above exception, another exception occurred:
ValueError Traceback (most recent call last)
<ipython-input-197-2cc68ceb6ac2> in <module>()
1 traces2 = [btrace_t2, btrace_l2]
2 models2 = [model_t2, model_l2]
----> 3 ppc_w2 = pm.sample_ppc_w(traces2, models=models2)
~/.pyenv/versions/3.6.4/envs/ganesha/lib/python3.6/site-packages/pymc3/sampling.py in sample_ppc_w(traces, samples, models, weights, random_seed, progressbar)
1160
1161 obs = [x for m in models for x in m.observed_RVs]
-> 1162 variables = np.repeat(obs, n)
1163
1164 lengths = list(set([np.atleast_1d(observed).shape for observed in obs]))
~/.pyenv/versions/3.6.4/envs/ganesha/lib/python3.6/site-packages/numpy/core/fromnumeric.py in repeat(a, repeats, axis)
421
422 """
--> 423 return _wrapfunc(a, 'repeat', repeats, axis=axis)
424
425
~/.pyenv/versions/3.6.4/envs/ganesha/lib/python3.6/site-packages/numpy/core/fromnumeric.py in _wrapfunc(obj, method, *args, **kwds)
60 # a downstream library like 'pandas'.
61 except (AttributeError, TypeError):
---> 62 return _wrapit(obj, method, *args, **kwds)
63
64
~/.pyenv/versions/3.6.4/envs/ganesha/lib/python3.6/site-packages/numpy/core/fromnumeric.py in _wrapit(obj, method, *args, **kwds)
40 except AttributeError:
41 wrap = None
---> 42 result = getattr(asarray(obj), method)(*args, **kwds)
43 if wrap:
44 if not isinstance(result, mu.ndarray):
ValueError: operands could not be broadcast together with shape (6,) (2,)
看来这可能是由于跟踪变量的数量不匹配造成的,但是对于我来说,我不知道该如何解决。
在对包含确定性模型的模型进行平均时,是否可以使用 sample_ppc_w ?
非常感谢