我想训练LSTM网络以解决标记化问题。
我的语料库看起来像这样:
电脑坏了。
00010000000010010000001
我在python中的经验代码:
from keras.models import Sequential
from keras.layers import Dense
from keras.layers.recurrent import LSTM
from keras.utils import np_utils
import numpy as np
np.random.seed(7)
maxlen = 100
filename = 'corpus_mini.txt'
data = np.array([list(line.rstrip('\n')) for line in open(filename, encoding='utf-8').readlines()])
Xdata = data[::2]
Ydata = data[1::2]
chars = set()
for sentence in Xdata:
for char in sentence:
chars.add(char)
chars = sorted(list(chars))
char_to_int = dict((c, i) for i, c in enumerate(chars))
X = []
y = []
seq_length = 30
for s in range(0,len(Xdata)):
sentence = Xdata[s]
for i in range(0, len(sentence) - seq_length, 1):
seq_in = sentence[i:i + seq_length]
X.append([char_to_int[char] for char in seq_in])
seq_out = Ydata[s][i + seq_length]
y.append(char_to_int[seq_out]-8)
# reshape X to be [samples, time steps, features]
X = np.reshape(X, (len(X), seq_length, 1))
# one hot encode the output variable
cat = np_utils.to_categorical(y)
model = Sequential()
model.add(LSTM(64, input_shape=(X.shape[1], X.shape[2])))
model.add(Dense(2, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam')
model.fit(X, cat, nb_epoch=12, batch_size=3)
我尝试了很多变化,但我无法弄清楚我的代码有什么问题。我不能教它学习甚至空间,这个问题有可能不适合LSTM吗?