尝试train_on_batches时形状不匹配

时间:2017-11-06 23:16:09

标签: python machine-learning keras lstm

按照上一个问题Keras LSTM Accuracy too high我意识到我无法在我的GPU中训练,因为to_categorical会引发MemoryError,所以经过一些研究后我才发现我需要使用train_in_batches并相应地拆分我的数据集,现在我陷入了困境,我似乎无法在任何地方找到答案,这里是我的代码:

import numpy
from keras.utils import np_utils
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers import Dropout, TimeDistributed
from keras.preprocessing.sequence import pad_sequences
from pandas import read_csv
import simplejson
from keras.utils.np_utils import to_categorical

numpy.random.seed(7)

dataset = read_csv("dataset_6_cols_short.csv", delimiter=",", quotechar='"').values

char_to_int = dict((c, i) for i, c in enumerate(dataset[:,1]))
char_to_int_timezone = dict((c, i) for i, c in enumerate(dataset[:,2]))

f = open('char_to_int_v2.txt', 'w')
simplejson.dump(char_to_int, f)
f.close()

num_classes = 6

# Length of sequence to predict
seq_length = 1

max_len = 5

dataX = []
dataY = []

for i in range(0, len(dataset) - seq_length, 1):
    start = numpy.random.randint(len(dataset)-2)
    end = numpy.random.randint(start, min(start+max_len,len(dataset)-1))
    sequence_in = dataset[start:end+1]
    sequence_out = dataset[end + 1]
    dataX.append([[char[0], char_to_int[char[1]], char_to_int_timezone[char[2]], char[3], char[4], char[5]] for char in sequence_in])
    dataY.append([char_to_int[sequence_out[1]]])

X = pad_sequences(dataX, maxlen=max_len, dtype='float32')
X = numpy.reshape(X, (X.shape[0], max_len, num_classes))

batch_size = 100

n_nxt = 1
n_prev = 5

model = Sequential()
model.add(LSTM(32, batch_input_shape=(batch_size, n_prev, num_classes), unit_forget_bias=True, return_sequences=True, stateful=True))
model.add(Dropout(0.2))
model.add(LSTM(32, batch_input_shape=(batch_size, n_prev, num_classes), unit_forget_bias=True, return_sequences=True, stateful=True))
model.add(Dropout(0.2))
model.add(TimeDistributed(Dense(1, activation='softmax')))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()

n_epoch = 40

print('Training')
numIteration = len(X)//batch_size
for i in range(n_epoch):
    print('Epoch', i, '/', n_epoch)
    for j in range(numIteration):
        print('Batch', j, '/',numIteration,'Epoch', i)
        x = X[j*batch_size:j*batch_size+batch_size,]
        y = dataY[j*batch_size:j*batch_size+batch_size]
        print(x) # array of 100 results [[12], [124], [534], etc...]
        y = to_categorical(y)
        print(x.shape) #(100, 5, 6)
        print(y.shape) #(100, 4700) (why 4700?)
        y = numpy.reshape(y, (y.shape[0], y.shape[1], 1))
        print(y.shape) #(100, 4700, 1)
        model.train_on_batch(x, y)
    model.reset_states()

我的数据集如下所示:

"time_date","name","timezone","col1","col2","user_id"
1402,"Sugar","Chicago",1,1,3012
1402,"Milk","Chicago",1,1,3012
1802,"Tomatoes","Chicago",1,1,3012
1802,"Cucumber","Chicago",1,1,3012
...

我对此有多个问题

  • 为什么to_categorical会创建一个(100,4700)的形状,4700来自哪里?
  • 如果它太大,我怎么可能将那个形状传递给Dense图层?
  • 这是分割数据的正确方法吗?我有大约400k行的数据。
  • 由于我想从中获得1个结果,是否正确配置了Dense图层?

我的主要目标是,基于最多5个连续条目,如下所示:

 [[0,0,0,0,0,0], 
  [0,0,0,0,0,0], 
  [0,0,0,0,0,0], 
  [1402,"Sugar","Chicago",1,1,3012], 
  [1402,"Milk","Chicago",1,1,3012]]

基于第二列返回1个预测,在这种情况下,"西红柿" (我的数据集中的第三个例子)。

这是实际错误:ValueError: Error when checking target: expected time_distributed_1 to have shape (100, 5, 1) but got array with shape (100, 4700, 1)

希望一切都清楚。谢谢大家。

编辑1

for col in y: print(len(set(y[col])))

的输出
[[[  2.00200000e+03   4.27100000e+03   4.33800000e+03   1.00000000e+00
     3.40000000e+01   3.52500000e+03]
  [  2.00200000e+03   4.27200000e+03   4.33800000e+03   1.00000000e+00
     3.40000000e+01   3.52500000e+03]
  [  2.00200000e+03   4.32400000e+03   4.33800000e+03   1.00000000e+00
     3.40000000e+01   3.52500000e+03]
  [  2.00200000e+03   4.29100000e+03   4.33800000e+03   1.00000000e+00
     3.40000000e+01   3.52500000e+03]
  [  2.00200000e+03   4.27500000e+03   4.33800000e+03   1.00000000e+00
     3.40000000e+01   3.52500000e+03]]

 [[  0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
     0.00000000e+00   0.00000000e+00]
  [  1.40500000e+03   9.18000000e+02   1.18000000e+03   1.00000000e+00
     3.30000000e+01   3.14100000e+03]
  [  1.40500000e+03   2.95000000e+03   1.18000000e+03   1.00000000e+00
     3.30000000e+01   3.14100000e+03]
  [  1.80500000e+03   5.39000000e+02   1.18000000e+03   1.00000000e+00
     3.30000000e+01   3.14100000e+03]
  [  1.80500000e+03   1.10500000e+03   1.18000000e+03   1.00000000e+00
     3.30000000e+01   3.14100000e+03]]

 [[  1.00400000e+03   4.30700000e+03   4.33800000e+03   1.00000000e+00
     3.40000000e+01   3.52500000e+03]
  [  1.00400000e+03   4.68600000e+03   4.33800000e+03   1.00000000e+00
     3.40000000e+01   3.52500000e+03]
  [  1.00400000e+03   4.30900000e+03   4.33800000e+03   1.00000000e+00
     3.40000000e+01   3.52500000e+03]
  [  1.00400000e+03   4.32600000e+03   4.33800000e+03   1.00000000e+00
     3.40000000e+01   3.52500000e+03]
  [  1.00400000e+03   4.69200000e+03   4.33800000e+03   1.00000000e+00
     3.40000000e+01   3.52500000e+03]]

 ...,
 [[  0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
     0.00000000e+00   0.00000000e+00]
  [  0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
     0.00000000e+00   0.00000000e+00]
  [  0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
     0.00000000e+00   0.00000000e+00]
  [  8.04000000e+02   1.09100000e+03   1.18000000e+03   1.00000000e+00
     3.30000000e+01   3.14100000e+03]
  [  1.20400000e+03   1.10200000e+03   1.18000000e+03   1.00000000e+00
     3.30000000e+01   3.14100000e+03]]

 [[  0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
     0.00000000e+00   0.00000000e+00]
  [  8.03000000e+02   4.65000000e+03   4.70000000e+03   1.00000000e+00
     3.60000000e+01   3.64100000e+03]
  [  1.00300000e+03   4.42800000e+03   4.70000000e+03   1.00000000e+00
     3.60000000e+01   3.64100000e+03]
  [  1.20300000e+03   4.55500000e+03   4.70000000e+03   1.00000000e+00
     3.60000000e+01   3.64100000e+03]
  [  1.80300000e+03   4.68800000e+03   4.70000000e+03   1.00000000e+00
     3.60000000e+01   3.64100000e+03]]

 [[  0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
     0.00000000e+00   0.00000000e+00]
  [  1.00100000e+03   4.32600000e+03   4.33800000e+03   1.00000000e+00
     3.40000000e+01   3.52500000e+03]
  [  1.20100000e+03   4.33100000e+03   4.33800000e+03   1.00000000e+00
     3.40000000e+01   3.52500000e+03]
  [  1.20100000e+03   4.33200000e+03   4.33800000e+03   1.00000000e+00
     3.40000000e+01   3.52500000e+03]
  [  1.20100000e+03   4.33300000e+03   4.33800000e+03   1.00000000e+00
     3.40000000e+01   3.52500000e+03]]]

0 个答案:

没有答案