按照上一个问题Keras LSTM Accuracy too high我意识到我无法在我的GPU中训练,因为to_categorical
会引发MemoryError,所以经过一些研究后我才发现我需要使用train_in_batches并相应地拆分我的数据集,现在我陷入了困境,我似乎无法在任何地方找到答案,这里是我的代码:
import numpy
from keras.utils import np_utils
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers import Dropout, TimeDistributed
from keras.preprocessing.sequence import pad_sequences
from pandas import read_csv
import simplejson
from keras.utils.np_utils import to_categorical
numpy.random.seed(7)
dataset = read_csv("dataset_6_cols_short.csv", delimiter=",", quotechar='"').values
char_to_int = dict((c, i) for i, c in enumerate(dataset[:,1]))
char_to_int_timezone = dict((c, i) for i, c in enumerate(dataset[:,2]))
f = open('char_to_int_v2.txt', 'w')
simplejson.dump(char_to_int, f)
f.close()
num_classes = 6
# Length of sequence to predict
seq_length = 1
max_len = 5
dataX = []
dataY = []
for i in range(0, len(dataset) - seq_length, 1):
start = numpy.random.randint(len(dataset)-2)
end = numpy.random.randint(start, min(start+max_len,len(dataset)-1))
sequence_in = dataset[start:end+1]
sequence_out = dataset[end + 1]
dataX.append([[char[0], char_to_int[char[1]], char_to_int_timezone[char[2]], char[3], char[4], char[5]] for char in sequence_in])
dataY.append([char_to_int[sequence_out[1]]])
X = pad_sequences(dataX, maxlen=max_len, dtype='float32')
X = numpy.reshape(X, (X.shape[0], max_len, num_classes))
batch_size = 100
n_nxt = 1
n_prev = 5
model = Sequential()
model.add(LSTM(32, batch_input_shape=(batch_size, n_prev, num_classes), unit_forget_bias=True, return_sequences=True, stateful=True))
model.add(Dropout(0.2))
model.add(LSTM(32, batch_input_shape=(batch_size, n_prev, num_classes), unit_forget_bias=True, return_sequences=True, stateful=True))
model.add(Dropout(0.2))
model.add(TimeDistributed(Dense(1, activation='softmax')))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()
n_epoch = 40
print('Training')
numIteration = len(X)//batch_size
for i in range(n_epoch):
print('Epoch', i, '/', n_epoch)
for j in range(numIteration):
print('Batch', j, '/',numIteration,'Epoch', i)
x = X[j*batch_size:j*batch_size+batch_size,]
y = dataY[j*batch_size:j*batch_size+batch_size]
print(x) # array of 100 results [[12], [124], [534], etc...]
y = to_categorical(y)
print(x.shape) #(100, 5, 6)
print(y.shape) #(100, 4700) (why 4700?)
y = numpy.reshape(y, (y.shape[0], y.shape[1], 1))
print(y.shape) #(100, 4700, 1)
model.train_on_batch(x, y)
model.reset_states()
我的数据集如下所示:
"time_date","name","timezone","col1","col2","user_id"
1402,"Sugar","Chicago",1,1,3012
1402,"Milk","Chicago",1,1,3012
1802,"Tomatoes","Chicago",1,1,3012
1802,"Cucumber","Chicago",1,1,3012
...
我对此有多个问题
我的主要目标是,基于最多5个连续条目,如下所示:
[[0,0,0,0,0,0],
[0,0,0,0,0,0],
[0,0,0,0,0,0],
[1402,"Sugar","Chicago",1,1,3012],
[1402,"Milk","Chicago",1,1,3012]]
基于第二列返回1个预测,在这种情况下,"西红柿" (我的数据集中的第三个例子)。
这是实际错误:ValueError: Error when checking target: expected time_distributed_1 to have shape (100, 5, 1) but got array with shape (100, 4700, 1)
希望一切都清楚。谢谢大家。
编辑1
for col in y: print(len(set(y[col])))
[[[ 2.00200000e+03 4.27100000e+03 4.33800000e+03 1.00000000e+00
3.40000000e+01 3.52500000e+03]
[ 2.00200000e+03 4.27200000e+03 4.33800000e+03 1.00000000e+00
3.40000000e+01 3.52500000e+03]
[ 2.00200000e+03 4.32400000e+03 4.33800000e+03 1.00000000e+00
3.40000000e+01 3.52500000e+03]
[ 2.00200000e+03 4.29100000e+03 4.33800000e+03 1.00000000e+00
3.40000000e+01 3.52500000e+03]
[ 2.00200000e+03 4.27500000e+03 4.33800000e+03 1.00000000e+00
3.40000000e+01 3.52500000e+03]]
[[ 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
0.00000000e+00 0.00000000e+00]
[ 1.40500000e+03 9.18000000e+02 1.18000000e+03 1.00000000e+00
3.30000000e+01 3.14100000e+03]
[ 1.40500000e+03 2.95000000e+03 1.18000000e+03 1.00000000e+00
3.30000000e+01 3.14100000e+03]
[ 1.80500000e+03 5.39000000e+02 1.18000000e+03 1.00000000e+00
3.30000000e+01 3.14100000e+03]
[ 1.80500000e+03 1.10500000e+03 1.18000000e+03 1.00000000e+00
3.30000000e+01 3.14100000e+03]]
[[ 1.00400000e+03 4.30700000e+03 4.33800000e+03 1.00000000e+00
3.40000000e+01 3.52500000e+03]
[ 1.00400000e+03 4.68600000e+03 4.33800000e+03 1.00000000e+00
3.40000000e+01 3.52500000e+03]
[ 1.00400000e+03 4.30900000e+03 4.33800000e+03 1.00000000e+00
3.40000000e+01 3.52500000e+03]
[ 1.00400000e+03 4.32600000e+03 4.33800000e+03 1.00000000e+00
3.40000000e+01 3.52500000e+03]
[ 1.00400000e+03 4.69200000e+03 4.33800000e+03 1.00000000e+00
3.40000000e+01 3.52500000e+03]]
...,
[[ 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
0.00000000e+00 0.00000000e+00]
[ 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
0.00000000e+00 0.00000000e+00]
[ 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
0.00000000e+00 0.00000000e+00]
[ 8.04000000e+02 1.09100000e+03 1.18000000e+03 1.00000000e+00
3.30000000e+01 3.14100000e+03]
[ 1.20400000e+03 1.10200000e+03 1.18000000e+03 1.00000000e+00
3.30000000e+01 3.14100000e+03]]
[[ 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
0.00000000e+00 0.00000000e+00]
[ 8.03000000e+02 4.65000000e+03 4.70000000e+03 1.00000000e+00
3.60000000e+01 3.64100000e+03]
[ 1.00300000e+03 4.42800000e+03 4.70000000e+03 1.00000000e+00
3.60000000e+01 3.64100000e+03]
[ 1.20300000e+03 4.55500000e+03 4.70000000e+03 1.00000000e+00
3.60000000e+01 3.64100000e+03]
[ 1.80300000e+03 4.68800000e+03 4.70000000e+03 1.00000000e+00
3.60000000e+01 3.64100000e+03]]
[[ 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
0.00000000e+00 0.00000000e+00]
[ 1.00100000e+03 4.32600000e+03 4.33800000e+03 1.00000000e+00
3.40000000e+01 3.52500000e+03]
[ 1.20100000e+03 4.33100000e+03 4.33800000e+03 1.00000000e+00
3.40000000e+01 3.52500000e+03]
[ 1.20100000e+03 4.33200000e+03 4.33800000e+03 1.00000000e+00
3.40000000e+01 3.52500000e+03]
[ 1.20100000e+03 4.33300000e+03 4.33800000e+03 1.00000000e+00
3.40000000e+01 3.52500000e+03]]]