有关设置我的输入特征数组以训练LSTM分类器以考虑过去观察的问题

时间:2019-04-19 17:10:11

标签: python keras lstm

我试图了解如何为时间序列二进制分类问题设置带有keras的LSTM。我已经建立了一个LSTM示例示例,但是它似乎并没有从先前的观察中获取信息。我认为我目前的方法仅使用来自当前观测值的特征数据。

下面是我的独立演示代码。

我的问题是:为了使LSTM能够从以前的观察中提取模式,我是否需要定义一个滑动窗口,以便每个观察实际上都包括来自先前观察的数据(包括滑动窗口周期),或者keras从features数组中获得了这些信息吗?

import random
import pandas as pd
import numpy as np
from keras.models import Sequential
from keras.layers.core import Dense, Activation
from sklearn.model_selection import train_test_split
from keras.layers.recurrent import LSTM
from sklearn.preprocessing import LabelEncoder

# this section just generates some sample data
# the pattern we are trying to pick up on is that
# shift_value number of observations prior to a True
# label, the features are always [.5, .5, .5]

shift_value = 5
n_examples = 10000

features = []
labels = []
random.seed(1)

# create the labels
for i in range(n_examples + shift_value):
    labels.append(random.choice([True, False]))

# create the features
for label in labels:
    if label:
        features.append([.5, .5, .5]) 
    else:
        feature_1 = random.random()
        feature_2 = random.random()
        feature_3 = random.random()
        features.append([feature_1, feature_2, feature_3])

df = pd.DataFrame(features)
df['label'] = labels
df.columns = ['A', 'B', 'C', 'label']
df['label'] = df['label'].shift(5)
df = df.dropna()

features_array = df[['A', 'B', 'C']].values
labels_array = df[['label']].values

# reshape the data

X_train, X_test, Y_train, Y_test = train_test_split(features_array, labels_array, test_size = .1, shuffle=False)

X_train_reshaped = np.reshape(X_train, (len(X_train), 1, X_train.shape[1]))
X_test_reshaped = np.reshape(X_test, (len(X_test), 1, X_train.shape[1]))

encoder = LabelEncoder()
Y_train_encoded = encoder.fit_transform(Y_train)
Y_test_encoded  = encoder.fit_transform(Y_test)

# define and run the model

neurons = 10
batch_size = 100
model = Sequential()
model.add(LSTM(neurons, 
               batch_input_shape=(batch_size,
                                  X_train_reshaped.shape[1], 
                                  X_train_reshaped.shape[2] 
                                  ),
               activation = 'sigmoid',
               stateful = False)
               )

model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

model.fit(X_train_reshaped, 
          Y_train_encoded, 
          validation_data=(X_test_reshaped, Y_test_encoded), 
          epochs=10, 
          batch_size=batch_size)

以上示例从未收敛,并且我认为它根本没有考虑先前的观察结果。在True为[.5,.5,.5]之前,它应该能够找到5个观测值的基本模式。

1 个答案:

答案 0 :(得分:1)

这是一个序列问题。如下考虑这个学习问题

  

给出一个长度为seq_length的序列,如果时间步长t上的输入为[0.5,0.5,0.5],则输出为t+shift_value == 1   其他t+shift_value == 0

要对这个学习问题进行建模,您将使用LSTM,它将展开seq_length次,并且每个步骤将输入一个大小为3的输入。同样,每个时间步都有一个大小为1的对应输出(对应于True of False)。如下所示:

enter image description here

代码:

import random
import numpy as np
from keras.models import Sequential
from keras.layers.core import Dense, Activation
from keras.layers.recurrent import LSTM

shift_value = 5
seq_length = 50

def generate_data(n, shift_value, seq_length):
    X = np.random.rand(n, seq_length, 3)
    Y = np.random.randint(0,2,size=(n, seq_length))
    for j in range(len(Y)):
        for i in range(shift_value,len(Y[j])):
            if Y[j][i] == 1:
                X[j][i-shift_value] = np.array([0.5,0.5,0.5])
    return X, Y.reshape(n,seq_length, 1)

# Generate Train and Test Data
X_train, Y_train = generate_data(9000,shift_value,seq_length)
X_test, Y_test = generate_data(100,shift_value,seq_length)

# Train the model
neurons = 32
batch_size = 100
model = Sequential()
model.add(LSTM(neurons, 
               batch_input_shape=(batch_size, seq_length, 3),
               activation = 'relu',
               stateful = False,
               return_sequences = True))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

model.fit(X_train, 
          Y_train,
          validation_data=(X_test, Y_test), 
          epochs=30, 
          batch_size=batch_size)

输出(已过滤):

...
Epoch 30/30
9000/9000 [=========] - loss: 0.1650 - acc: 0.9206 - val_loss: 0.1362 - val_acc: 0.9324

在30个时期内,其验证准确率达到93%。尽管它是确定性函数,但由于前shift_value个标签中的模棱两可,因此该模型永远不会100%准确。