训练时NLP模型的精度保持在0.5098

时间:2018-07-13 08:00:00

标签: machine-learning keras nlp lstm kaggle

我为kaggle的电影评论数据集构建了一个两层的LSTM模型(keras模型):Dataset

在训练模型时,每个时期的准确度均为0.5098。

然后我认为它可能不是在学习长距离依赖关系。然后我使用了双向LSTM而不是LSTM。但是,训练时每个时期的模型精度仍为0.5098。我在CPU上将模型训练了8小时/ 35个时期。然后我停止了训练。

代码:

import pandas as pd
from sentiment_utils import *
import keras
import keras.backend as k
import numpy as np


train_data = pd.read_table('train.tsv')
X_train = train_data.iloc[:,2]
Y_train = train_data.iloc[:,3]


from sklearn.preprocessing import OneHotEncoder
Y_train = Y_train.reshape(Y_train.shape[0],1)
ohe = OneHotEncoder(categorical_features=[0])
Y_train = ohe.fit_transform(Y_train).toarray()


maxLen = len(max(X_train, key=len).split())
words_to_index, index_to_words, word_to_vec_map = read_glove_vectors("glove/glove.6B.50d.txt")
m = X_train.shape[0]

def read_glove_vectors(path):
with open(path, encoding='utf8') as f:
    words = set()
    word_to_vec_map = {}

    for line in f:
        line = line.strip().split()
        cur_word = line[0]
        words.add(cur_word)
        word_to_vec_map[cur_word] = np.array(line[1:], dtype=np.float64)


    i = 1
    words_to_index = {}
    index_to_words = {}
    for w in sorted(words):
        words_to_index[w] = i
        index_to_words[i] = w
        i = i + 1
return words_to_index, index_to_words, word_to_vec_map



def sentance_to_indices(X_train, words_to_index, maxLen, dash_index_list, keys):
m = X_train.shape[0]    
X_indices = np.zeros((m, maxLen))

for i in range(m):
    if i in dash_index_list:
        continue

    sentance_words = X_train[i].lower().strip().split()

    j = 0
    for word in sentance_words:
        if word in keys:
            X_indices[i, j] = words_to_index[word]
        j += 1

return X_indices






def pretrained_embedding_layer(word_to_vec_map, words_to_index):
emb_dim = word_to_vec_map['pen'].shape[0]
vocab_size = len(words_to_index) + 1
emb_matrix = np.zeros((vocab_size, emb_dim))

for word, index in words_to_index.items():
    emb_matrix[index, :] = word_to_vec_map[word]

emb_layer= keras.layers.embeddings.Embedding(vocab_size, emb_dim, trainable= False)

emb_layer.build((None,))
emb_layer.set_weights([emb_matrix])

return emb_layer



def get_model(input_shape, word_to_vec_map, words_to_index):

sentance_indices = keras.layers.Input(shape = input_shape, dtype='int32')
embedding_layer = pretrained_embedding_layer(word_to_vec_map, words_to_index)
embeddings = embedding_layer(sentance_indices)

X = keras.layers.Bidirectional(keras.layers.LSTM(128, return_sequences=True))(embeddings)
X = keras.layers.Dropout(0.5)(X)

X = keras.layers.Bidirectional(keras.layers.LSTM(128, return_sequences=True))(X)
X = keras.layers.Dropout(0.5)(X)

X = keras.layers.Bidirectional(keras.layers.LSTM(128, return_sequences=False))(X)
X = keras.layers.Dropout(0.5)(X)
X = keras.layers.Dense(5)(X)

X = keras.layers.Activation('softmax')(X)

model = keras.models.Model(sentance_indices, X)

return model



model = get_model((maxLen,), word_to_vec_map,words_to_index)
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

dash_index_list = []
for i in range(m):
if '-' in X_train[i]:
    dash_index_list.append(i)


keys = []
for key in word_to_vec_map.keys():
keys.append(key)


X_train_indices = sentance_to_indices(X_train, words_to_index, maxLen, dash_index_list, keys)


model.fit(X_train_indices, Y_train, epochs = 50, batch_size = 32, shuffle=True)

1 个答案:

答案 0 :(得分:1)

我认为您定义模型架构的方式没有任何意义!尝试在Keras github存储库上使用LSTM的IMDB电影评论中查看此示例:Trains an LSTM model on the IMDB sentiment classification task.