Question

我正在玩Hidden Markov模型以解决股市预测问题。我的数据矩阵包含特定安全性的各种功能：

01-01-2001, .025, .012, .01
01-02-2001, -.005, -.023, .02

我适合一个简单的GaussianHMM：

from hmmlearn import GaussianHMM
mdl = GaussianHMM(n_components=3,covariance_type='diag',n_iter=1000)
mdl.fit(train[:,1:])

使用模型（λ），我可以解码观察向量，找到对应于观测向量的最可能的隐藏状态序列：

print mdl.decode(test[0:5,1:])
(72.75, array([2, 1, 2, 0, 0]))

上面，我已经解码了观察向量的隐藏状态序列O _t =（O ₁，O ₂，.. 。，O _d），其中包含测试集中的前五个实例。我想估计测试集中第六个实例的隐藏状态。我们的想法是迭代第六个实例的一组离散的可能特征值，并选择观察序列O _{t + 1}，其中最大似然argmax = P（O ₁ ，O ₂，...，O _{d + 1} |λ）。一旦我们观察到O _{d + 1}的真实特征值，我们就可以将序列（长度为5）移动一次并重新执行：

    l = 5
    for i in xrange(len(test)-l):
        values = []
        for a in arange(-0.05,0.05,.01):
            for b in arange(-0.05,0.05,.01):
                for c in arange(-0.05,0.05,.01):
                    values.append(mdl.decode(vstack((test[i:i+l,1:],array([a,b,c])))))
     print max(enumerate(values),key=lambda x: x[1])

问题在于，当我解码观测向量O _{t + 1}时，具有最高似然性的预测几乎总是相同的（例如，具有最高可能性的估计总是具有O <的特征值。 sub> d + 1 等于[0.04 0.04 0.04]并且是隐藏状态[0]）：

(555, (74.71248518927949, array([2, 1, 2, 0, 0, 0]))) [ 0.04  0.04  0.04]
(555, (69.41963358191555, array([2, 2, 0, 0, 0, 0]))) [ 0.04  0.04  0.04]
(555, (77.11516871816922, array([2, 0, 0, 0, 0, 0]))) [ 0.04  0.04  0.04]

我完全有可能误解了mdl.decode的目的，因而错误地使用了它。如果是这种情况，我怎样才能最好地迭代O _{d + 1}的可能值，然后最大化P（O ₁，O ₂，...，O _{d + 1} |λ）？

Answer 1

您的实际值是否有限[-0.05, 0.05)？

当我最初构建一个样本数据集来查看您的问题时，我在[0,1]中绘制了随机浮点数。当我这样做时，我也得到了你正在观察的相同结果 - 对于第六个条目，每个序列始终是(a,b,c)的最大值，并且始终是相同的预测类。但鉴于我的数据分布（均匀分布在0和1之间）比第六个条目的网格搜索值（-.05和{{1}之间）具有更大的集中趋势}），HMM总是选择最高值的三元组.05，因为它最接近它所训练的分布的主密度。

当我使用与我们允许的第六个条目（(.04,.04,.04)）相同的可能值范围内的均匀分布绘制重建数据集时，输出变化更大：[-0.05,0.05)选择和类别预测显示序列之间的合理分化。从您的示例数据来看，您似乎确实至少同时具有正值和负值，但您可能会尝试绘制每个要素的分布，并查看您的可能的第六个值的范围是否合理。

这是一些示例数据和评估代码。每次有新的最佳O_t+1序列时，或者当第6次观察的预测发生变化时（仅表明它们不完全相同），它将打印出一条消息。每个6元素序列的最高可能性，以及预测和最佳的第6个数据点都存储在(a,b,c)中。

首先，构建一个样本数据集：

best_per_span

现在分为训练和测试集：

import numpy as np
import pandas as pd

dates = pd.date_range(start="01-01-2001", end="31-12-2001", freq='D')
n_obs = len(dates)
n_feat = 3
values = np.random.uniform(-.05, .05, size=n_obs*n_feat).reshape((n_obs,n_feat))
df = pd.DataFrame(values, index=dates)

df.head()
                   0         1         2
2001-01-01  0.020891 -0.048750 -0.027131
2001-01-02  0.013571 -0.011283  0.041322
2001-01-03 -0.008102  0.034088 -0.029202
2001-01-04 -0.019666 -0.005705 -0.003531
2001-01-05 -0.000238 -0.039251  0.029307

在训练数据上拟合三态HMM：

train_pct = 0.7
train_size = round(train_pct*n_obs)
train_ix = np.random.choice(range(n_obs), size=train_size, replace=False)
train_dates = df.index[train_ix]

train = df.loc[train_dates]
test = df.loc[~df.index.isin(train_dates)]

train.shape # (255, 3)
test.shape # (110, 3)

现在网格搜索最佳的第6次（# hmm throws a lot of deprecation warnings, we'll suppress them. import warnings with warnings.catch_warnings(): warnings.filterwarnings("ignore",category=DeprecationWarning) # in the most recent hmmlearn we can't import GaussianHMM directly anymore. from hmmlearn import hmm mdl = hmm.GaussianHMM(n_components=3, covariance_type='diag', n_iter=1000) mdl.fit(train)）观察：

t+1

循环运行时报告输出的示例：

# length of O_t
span = 5

# final store of optimal configurations per O_t+1 sequence
best_per_span = []

# update these to demonstrate heterogenous outcomes
current_abc = None
current_pred = None

for start in range(len(test)-span):
    flag = False
    end = start + span
    first_five = test.iloc[start:end].values
    output = []
    for a in np.arange(-0.05,0.05,.01):
        for b in np.arange(-0.05,0.05,.01):
            for c in np.arange(-0.05,0.05,.01):
                sixth = np.array([a, b, c])[:, np.newaxis].T
                all_six = np.append(first_five, sixth, axis=0)
                output.append((mdl.decode(all_six), (a,b,c)))

    best = max(output, key=lambda x: x[0][0])

    best_dict = {"start":start,
                 "end":end,
                 "sixth":best[1],
                 "preds":best[0][1],
                 "lik":best[0][0]}  
    best_per_span.append(best_dict)

    # below here is all reporting
    if best_dict["sixth"] != current_abc:
        current_abc = best_dict["sixth"]
        flag = True
        print("New abc for range {}:{} = {}".format(start, end, current_abc))

    if best_dict["preds"][-1] != current_pred:
        current_pred = best_dict["preds"][-1]
        flag = True
        print("New pred for 6th position: {}".format(current_pred))

    if flag:
        print("Test sequence StartIx: {}, EndIx: {}".format(start, end))
        print("Best 6th value: {}".format(best_dict["sixth"]))
        print("Predicted hidden state sequence: {}".format(best_dict["preds"]))
        print("Likelihood: {}\n".format(best_dict["nLL"]))

New abc for range 3:8 = [-0.01, 0.01, 0.0] New pred for 6th position: 1 Test sequence StartIx: 3, EndIx: 8 Best 6th value: [-0.01, 0.01, 0.0] Predicted hidden state sequence: [0 2 2 1 0 1] Likelihood: 35.30144407374163 New abc for range 18:23 = [-0.01, -0.01, -0.01] New pred for 6th position: 2 Test sequence StartIx: 18, EndIx: 23 Best 6th value: [-0.01, -0.01, -0.01] Predicted hidden state sequence: [0 0 0 1 2 2] Likelihood: 34.31813078939214的输出示例：

best_per_span

除了报告元素之外，这不是对您的初始方法的重大更改，但它似乎确实按预期工作，而不是每次都最大化。

在GaussianHMM

1 个答案: