如何为时间序列中可变数量的特征构建lstm模型

时间:2019-03-12 03:15:55

标签: python tensorflow keras lstm

我有一个用于使用一次性数据构建的时间序列数据的keras lstm。.我停止了将其用于系统其他部分的工作,但是现在我将其用于生产,并且需要转换与可变数量的功能一起使用的方法。用户将以少量功能开始,然后逐渐添加它们。

[continous_value: 0.1, categoryA: 1]
[continous_value: 0.2, categoryA: 0, categoryB: 1]
[continous_value: 0.3, categoryA: 1, categoryB: 0]
...

大多数将“掉线”,因为它们不会重复,因此很容易通过随时间移动窗口来进行修剪-但有些重复是有规律的。我的lstm当前是围绕单个用户在窗口中的数据构建的。

每行都是15分钟的样本,而我的样本数据恰好具有2个连续特征和7个分类特征。我有14天的季节性(4 * 24 * 14 = 1344个时间步长),所以我一直在重采样为x:(1344, 14, 9)y:(1344, 9)

这样的形状

现在,为了允许模型适用于不同的用户,我开始添加“填充列”,但这并不理想:我必须猜测最大值将是多少,数字越大,预测性越差该模型是。

通过设置timestep = None(我相信x:(b, None, 9)),Keras LSTM可以具有可变的特征计数,但是我看不到如何使用多元时间序列数据来实现。

我将如何更改它以正确生成数据?

数据争夺:

# the memory of RNN depends on the number of timesteps you select
# if timesteps = n then the output depends on the previous n inputs

n = 14 # well over the weekly periodicity of the data

# Create input set that consists of n dimensions
# hence, the output of current day will be based on 
# the prices of previous n days

len_train = 49 * DAYS
len_test = 7 * DAYS #(4 * 24)

train, test, endog, exog = dataForTest(len_train=len_train, len_test=len_test, offset=3*DAYS)

#print(train.iloc[-1],'\n',test.iloc[-1])
print(endog, exog)

dim = len(endog) + len(exog)

window = len_test # dim * 100 #using dim ensures it is reshapable dividing by dim

X_train = []
y_train = []
for i in range(n, n+(window)): 
    X_train.append(train[i - n: i].values)
    y_train.append(train[i:i+1].values)

X_train, y_train = np.array(X_train), np.array(y_train)

y_train = y_train.reshape((window,dim))

print(X_train.shape, y_train.shape)

(batch_size, timesteps, dim) = X_train.shape

型号:

# Initialise Sequential model
regressor = Sequential()

# units is the output dimensionality
# return sequences will return the sequence
# which will be required to the next LSTM 

# as a great big rule-o-thumb, layers should be less than 10, and perhaps 1 per endog plus 1 for all exog
# also see: https://stats.stackexchange.com/questions/181/how-to-choose-the-number-of-hidden-layers-and-nodes-in-a-feedforward-neural-netw/1097#1097

alphaNh = len(columns) if len(columns) < 10 else 10 # 2-10, with 2 or 5 being common
nh = int(batch_size/(alphaNh*2*len(series.columns)))

dropout = 0.2

print('nh', nh)  

# input shape will need only the last 2 dimensions
# of your input
################# 1st layer #######################
regressor.add(LSTM(units=nh, return_sequences=True, stateful=True, batch_size=batch_size,
                   input_shape=(timesteps, dim)))

# add Dropout to do regulariztion
# standard practise to use 20%
# regressor.add(Dropout(dropout))

layers = (len(endog) + 1) if len(endog) > 1 else 2
print('layers', layers)
for i in range(1, layers):
  # After the first time, it's not required to 
  # specify the input_shape
  ################# layer #######################
#  if i > 5:
#      break
  if i < layers - 1:
    cell = LSTM(units=nh, return_sequences=True, stateful=True, batch_size=batch_size)
  else:
    cell = LSTM(units=nh, stateful=True, batch_size=batch_size)

  regressor.add(cell)

################# Dropout layer #################
# After training layers we use some dropout.
# another option is to put this after each dim 
# layer (above)
#
# standard practise to use 20%

regressor.add(Dropout(dropout))

################# Last layer ####################
# Last layer would be the fully connected layer,
# or the Dense layer
#
# The last word will predict a single number
# hence units=1

regressor.add(Dense(units=dim))

# Compiling the RNN
# The loss function for classification problem is 
# cross entropy, since this is a regression problem
# the loss function will be mean squared error

regressor.compile(optimizer='adam', loss='mean_squared_error')

### src: https://keras.io/callbacks/
#saves the model weights after each epoch if the validation loss decreased
###
checkpointer = ModelCheckpoint(filepath='weights.hdf5', verbose=1, monitor='loss', mode='min', save_best_only=True)

0 个答案:

没有答案