首页 文章

解码GaussianHMM中的序列

提问于
浏览
15

我正在玩Hidden Markov模型来解决股市预测问题 . 我的数据矩阵包含特定安全性的各种功能:

01-01-2001, .025, .012, .01
01-02-2001, -.005, -.023, .02

我适合一个简单的GaussianHMM:

from hmmlearn import GaussianHMM
mdl = GaussianHMM(n_components=3,covariance_type='diag',n_iter=1000)
mdl.fit(train[:,1:])

使用模型(λ),我可以解码观察向量以找到对应于观察向量的最可能的隐藏状态序列:

print mdl.decode(test[0:5,1:])
(72.75, array([2, 1, 2, 0, 0]))

在上面,我已经解码了观察向量Ot =(O1,O2,...,Od)的隐藏状态序列,其中包含测试集中的前五个实例 . 我想估计测试集中第六个实例的隐藏状态 . 该想法是迭代第六实例的离散的一组可能的特征值,并选择具有最高似然性argmax = P(O1,O2,...,Od 1 |λ)的观察序列Ot 1 . 一旦我们观察到Od 1的真实特征值,我们就可以将序列(长度为5)移动一次并重新执行:

l = 5
    for i in xrange(len(test)-l):
        values = []
        for a in arange(-0.05,0.05,.01):
            for b in arange(-0.05,0.05,.01):
                for c in arange(-0.05,0.05,.01):
                    values.append(mdl.decode(vstack((test[i:i+l,1:],array([a,b,c])))))
     print max(enumerate(values),key=lambda x: x[1])

问题是当我解码观测向量Ot 1时,具有最高似然性的预测几乎总是相同的(例如,具有最高似然的估计总是具有等于[0.04 0.04 0.04]的Od 1的特征值并且是隐藏状态[ 0]):

(555, (74.71248518927949, array([2, 1, 2, 0, 0, 0]))) [ 0.04  0.04  0.04]
(555, (69.41963358191555, array([2, 2, 0, 0, 0, 0]))) [ 0.04  0.04  0.04]
(555, (77.11516871816922, array([2, 0, 0, 0, 0, 0]))) [ 0.04  0.04  0.04]

它's entirely possible that I'误解了 mdl.decode 的目的,因此使用不当 . 如果是这种情况,我怎样才能最好地迭代Od 1的可能值,然后最大化P(O1,O2,...,Od 1 |λ)?

1 回答

  • 3

    你的实际 Value 是否有限 [-0.05, 0.05)

    当我最初构建一个样本数据集来查看您的问题时,我在 [0,1] 中绘制了随机浮点数 . 当我这样做时,我也得到了你正在观察的相同结果 - 对于第六个条目,每个序列始终是 (a,b,c) 的最大值,并且始终是相同的预测类 . 但鉴于我的数据的分布(均匀地分布在 01 之间)具有比第六个条目(在 -.05.05 之间)的网格搜索值更大的集中趋势,因此HMM始终选择最高值三元组是有意义的 . (.04,.04,.04) ,因为它最接近它所训练的分布的主密度 .

    当我在相同的可能值范围内使用均匀分布重建数据集时,我们允许第六个条目( [-0.05,0.05) ),输出更加多样化:选择和类别预测都显示出跨序列的合理区分 . 根据您的示例数据,您似乎至少同时具有正值和负值,但您可能会尝试绘制每个要素的分布,并查看您的可能的第六项值的范围是否合理 .

    这里's some sample data and evaluation code. It will print out a message each time there'是一个新的最佳 (a,b,c) 序列,或者当第6次观察的预测发生变化时(只是为了表明它们并非完全相同) . 每个6元素序列的最高可能性以及预测和最佳的第6个数据点存储在 best_per_span 中 .

    首先,构建一个样本数据集:

    import numpy as np
    import pandas as pd
    
    dates = pd.date_range(start="01-01-2001", end="31-12-2001", freq='D')
    n_obs = len(dates)
    n_feat = 3
    values = np.random.uniform(-.05, .05, size=n_obs*n_feat).reshape((n_obs,n_feat))
    df = pd.DataFrame(values, index=dates)
    
    df.head()
                       0         1         2
    2001-01-01  0.020891 -0.048750 -0.027131
    2001-01-02  0.013571 -0.011283  0.041322
    2001-01-03 -0.008102  0.034088 -0.029202
    2001-01-04 -0.019666 -0.005705 -0.003531
    2001-01-05 -0.000238 -0.039251  0.029307
    

    现在分为训练和测试集:

    train_pct = 0.7
    train_size = round(train_pct*n_obs)
    train_ix = np.random.choice(range(n_obs), size=train_size, replace=False)
    train_dates = df.index[train_ix]
    
    train = df.loc[train_dates]
    test = df.loc[~df.index.isin(train_dates)]
    
    train.shape # (255, 3)
    test.shape # (110, 3)
    

    在训练数据上拟合三态HMM:

    # hmm throws a lot of deprecation warnings, we'll suppress them.
    import warnings
    with warnings.catch_warnings():
        warnings.filterwarnings("ignore",category=DeprecationWarning)
        # in the most recent hmmlearn we can't import GaussianHMM directly anymore.
        from hmmlearn import hmm
    
    mdl = hmm.GaussianHMM(n_components=3, covariance_type='diag', n_iter=1000)
    mdl.fit(train)
    

    现在网格搜索最佳的第6( t+1 )观测:

    # length of O_t
    span = 5
    
    # final store of optimal configurations per O_t+1 sequence
    best_per_span = []
    
    # update these to demonstrate heterogenous outcomes
    current_abc = None
    current_pred = None
    
    for start in range(len(test)-span):
        flag = False
        end = start + span
        first_five = test.iloc[start:end].values
        output = []
        for a in np.arange(-0.05,0.05,.01):
            for b in np.arange(-0.05,0.05,.01):
                for c in np.arange(-0.05,0.05,.01):
                    sixth = np.array([a, b, c])[:, np.newaxis].T
                    all_six = np.append(first_five, sixth, axis=0)
                    output.append((mdl.decode(all_six), (a,b,c)))
    
        best = max(output, key=lambda x: x[0][0])
    
        best_dict = {"start":start,
                     "end":end,
                     "sixth":best[1],
                     "preds":best[0][1],
                     "lik":best[0][0]}  
        best_per_span.append(best_dict)
    
        # below here is all reporting
        if best_dict["sixth"] != current_abc:
            current_abc = best_dict["sixth"]
            flag = True
            print("New abc for range {}:{} = {}".format(start, end, current_abc))
    
        if best_dict["preds"][-1] != current_pred:
            current_pred = best_dict["preds"][-1]
            flag = True
            print("New pred for 6th position: {}".format(current_pred))
    
        if flag:
            print("Test sequence StartIx: {}, EndIx: {}".format(start, end))
            print("Best 6th value: {}".format(best_dict["sixth"]))
            print("Predicted hidden state sequence: {}".format(best_dict["preds"]))
            print("Likelihood: {}\n".format(best_dict["nLL"]))
    

    循环运行时报告输出的示例:

    New abc for range 3:8 = [-0.01, 0.01, 0.0]
    New pred for 6th position: 1
    Test sequence StartIx: 3, EndIx: 8
    Best 6th value: [-0.01, 0.01, 0.0]
    Predicted hidden state sequence: [0 2 2 1 0 1]
    Likelihood: 35.30144407374163
    
    New abc for range 18:23 = [-0.01, -0.01, -0.01]
    New pred for 6th position: 2
    Test sequence StartIx: 18, EndIx: 23
    Best 6th value: [-0.01, -0.01, -0.01]
    Predicted hidden state sequence: [0 0 0 1 2 2]
    Likelihood: 34.31813078939214
    

    best_per_span 的输出示例:

    [{'end': 5,
      'lik': 33.791537281734904,
      'preds': array([0, 2, 0, 1, 2, 2]),
      'sixth': [-0.01, -0.01, -0.01],
      'start': 0},
     {'end': 6,
      'lik': 33.28967307589143,
      'preds': array([0, 0, 1, 2, 2, 2]),
      'sixth': [-0.01, -0.01, -0.01],
      'start': 1},
     {'end': 7,
      'lik': 34.446813870838156,
      'preds': array([0, 1, 2, 2, 2, 2]),
      'sixth': [-0.01, -0.01, -0.01],
      'start': 2}]
    

    除了报告元素之外,这不是对您的初始方法的重大改变,但它似乎按预期工作,而不是每次都最大化 .

相关问题