我一直在从事kaggle比赛。但是我对模型的指责很少帮助我
我已经尝试过lstm模型,但是准确性仍然较低。我已经对数据集进行了热编码,然后再对其进行文本排序代币
tok = Tokenizer(num_words=max_features,filters='!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~',lower=True)
tok.fit_on_texts(list(X_train) + list(X_test))
x_train = tok.texts_to_sequences(X_train)
x_test = tok.texts_to_sequences(X_test)
print(len(x_train), 'train sequences')
print(len(x_test), 'test sequences')
word_index=tok.word_index
word_index['PAD']=0
word_index['START']=1
word_index['UNX']=2
print('found %s unique tokens' %len(tok.word_index))
print('Average train sequence length: {}'.format(np.mean(list(map(len, x_train)), dtype=int)))
print('Average test sequence length: {}'.format(np.mean(list(map(len, x_test)), dtype=int)))
x_train = sequence.pad_sequences(x_train, maxlen=maxlen)
x_test = sequence.pad_sequences(x_test, maxlen=maxlen)
print('x_train shape:', x_train.shape)
print('x_test shape:', x_test.shape)
模型
model = Sequential()
model.add(Embedding(10000,128,input_length=9))
model.add(SpatialDropout1D(0.4))
model.add(Bidirectional(LSTM(128,dropout=0.4,recurrent_dropout=0.3)))
model.add(Dense(128, input_dim=9, activation='relu'))
model.add(Dense(128, activation='relu'))
model.add(Dense(5, activation="softmax"))
tensorboard= TensorBoard(log_dir="log\{}".format(NAME))
model.compile(optimizer="adam",loss="categorical_crossentropy", metrics=["accuracy"])
当一个标签非常多时,精度极低,最大为0.51