如何在Gensim的Word2Vec中正确使用get_keras_embedding()?

时间:2018-07-24 07:23:46

标签: python keras gensim word2vec word-embedding

我正在尝试使用嵌入和RNN构建翻译网络。我已经训练了Gensim Word2Vec模型,并且可以很好地学习单词联想。但是,我无法理解如何正确地将层添加到Keras模型中。 (以及如何对输出进行“反向嵌入”。但这是另一个已回答的问题:默认情况下您不能这样做。)

在Word2Vec中,当您输入一个字符串时,例如model.wv[‘hello’],您将得到单词的向量表示。但是,我相信Word2Vec的get_keras_embedding()返回的keras.layers.Embedding层采用单键/标记化输入,而不是字符串输入。但是文档没有提供有关适当输入内容的解释。 我无法弄清楚如何获得与嵌入层的输入具有1对1对应关系的词汇表的一键热/令牌化向量。

下面有更多详细说明:

当前,我的解决方法是在将Keras馈送到网络之前,将其应用于Keras外部。这样做有什么害处吗?无论如何,我都会将嵌入设置为不可训练。到目前为止,我已经注意到内存使用效率极低(甚至在声明Keras模型以收集64个单词长的句子之前,甚至是50GB)都必须将填充的输入和权重加载到模型外部。也许发电机可以提供帮助。

以下是我的代码。输入被填充为64个字长。 Word2Vec嵌入具有300个尺寸。由于尝试反复进行嵌入工作的反复实验,此处可能存在很多错误。欢迎提出建议。

import gensim
word2vec_model = gensim.models.Word2Vec.load(“word2vec.model")
from keras.models import Sequential
from keras.layers import Embedding, GRU, Input, Flatten, Dense, TimeDistributed, Activation, PReLU, RepeatVector, Bidirectional, Dropout
from keras.optimizers import Adam, Adadelta
from keras.callbacks import ModelCheckpoint
from keras.losses import sparse_categorical_crossentropy, mean_squared_error, cosine_proximity

keras_model = Sequential()
keras_model.add(word2vec_model.wv.get_keras_embedding(train_embeddings=False))
keras_model.add(Bidirectional(GRU(300, return_sequences=True, dropout=0.1, recurrent_dropout=0.1, activation='tanh')))
keras_model.add(TimeDistributed(Dense(600, activation='tanh')))
# keras_model.add(PReLU())
# ^ For some reason I get error when I add Activation ‘outside’:
# int() argument must be a string, a bytes-like object or a number, not 'NoneType'
# But keras_model.add(Activation('relu')) works.
keras_model.add(Dense(source_arr.shape[1] * source_arr.shape[2]))
# size = max-output-sentence-length * embedding-dimensions to learn the embedding vector and find the nearest word in word2vec_model.wv.similar_by_vector() afterwards.
# Alternatively one can use Dense(vocab_size) and train the network to output one-hot categorical words instead.
# Remember to change Keras loss to sparse_categorical_crossentropy.
# But this won’t benefit from Word2Vec. 

keras_model.compile(loss=mean_squared_error,
              optimizer=Adadelta(),
              metrics=['mean_absolute_error'])
keras_model.summary()
_________________________________________________________________ 
Layer (type)                 Output Shape              Param #   
================================================================= 
embedding_19 (Embedding)     (None, None, 300)         8219700   
_________________________________________________________________ 
bidirectional_17 (Bidirectio (None, None, 600)         1081800   
_________________________________________________________________ 
activation_4 (Activation)    (None, None, 600)         0         
_________________________________________________________________ 
time_distributed_17 (TimeDis (None, None, 600)         360600    
_________________________________________________________________ 
dense_24 (Dense)             (None, None, 19200)       11539200  
================================================================= 
Total params: 21,201,300 
Trainable params: 12,981,600 
Non-trainable params: 8,219,700
_________________________________________________________________
filepath="best-weights.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='val_mean_absolute_error', verbose=1, save_best_only=True, mode='auto')
callbacks_list = [checkpoint]
keras_model.fit(array_of_word_lists, array_of_word_lists_AFTER_being_transformed_by_word2vec, epochs=100, batch_size=2000, shuffle=True, callbacks=callbacks_list, validation_split=0.2)

当我尝试用文本拟合模型时哪个抛出错误:

Train on 8000 samples, validate on 2000 samples Epoch 1/100

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-32-865f8b75fbc3> in <module>()
      2 checkpoint = ModelCheckpoint(filepath, monitor='val_mean_absolute_error', verbose=1, save_best_only=True, mode='auto')
      3 callbacks_list = [checkpoint]
----> 4 keras_model.fit(array_of_word_lists, array_of_word_lists_AFTER_being_transformed_by_word2vec, epochs=100, batch_size=2000, shuffle=True, callbacks=callbacks_list, validation_split=0.2)

~/virtualenv/lib/python3.6/site-packages/keras/engine/training.py in fit(self, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, steps_per_epoch, validation_steps, **kwargs)
   1040                                         initial_epoch=initial_epoch,
   1041                                         steps_per_epoch=steps_per_epoch,
-> 1042                                         validation_steps=validation_steps)
   1043 
   1044     def evaluate(self, x=None, y=None,

~/virtualenv/lib/python3.6/site-packages/keras/engine/training_arrays.py in fit_loop(model, f, ins, out_labels, batch_size, epochs, verbose, callbacks, val_f, val_ins, shuffle, callback_metrics, initial_epoch, steps_per_epoch, validation_steps)
    197                     ins_batch[i] = ins_batch[i].toarray()
    198 
--> 199                 outs = f(ins_batch)
    200                 if not isinstance(outs, list):
    201                     outs = [outs]

~/virtualenv/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py in __call__(self, inputs)
   2659                 return self._legacy_call(inputs)
   2660 
-> 2661             return self._call(inputs)
   2662         else:
   2663             if py_any(is_tensor(x) for x in inputs):

~/virtualenv/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py in _call(self, inputs)
   2612                 array_vals.append(
   2613                     np.asarray(value,
-> 2614                                dtype=tensor.dtype.base_dtype.name))
   2615         if self.feed_dict:
   2616             for key in sorted(self.feed_dict.keys()):

~/virtualenv/lib/python3.6/site-packages/numpy/core/numeric.py in asarray(a, dtype, order)
    490 
    491     """
--> 492     return array(a, dtype, copy=False, order=order)
    493 
    494 

ValueError: could not convert string to float: 'hello'

以下是an excerpt from Rajmak,演示了如何使用标记器将单词转换为Keras嵌入的输入。

tokenizer = Tokenizer(num_words=MAX_NB_WORDS) 
tokenizer.fit_on_texts(all_texts) 
sequences = tokenizer.texts_to_sequences(texts)
word_index = tokenizer.word_index
……
indices = np.arange(data.shape[0]) # get sequence of row index 
np.random.shuffle(indices) # shuffle the row indexes 
data = data[indices] # shuffle data/product-titles/x-axis
……
nb_validation_samples = int(VALIDATION_SPLIT * data.shape[0]) 
x_train = data[:-nb_validation_samples]
……
word2vec = KeyedVectors.load_word2vec_format(EMBEDDING_FILE, binary=True)
  

Keras嵌入层可以通过Gensim Word2Vec的word2vec.get_keras_embedding(train_embeddings = False)方法获得,或采用如下所示的方法构造。空词嵌入表示在我们的预训练向量(在此情况下为Google新闻)中找不到的词数。在这种情况下,这可能是品牌的独特用语。

from keras.layers import Embedding
word_index = tokenizer.word_index
nb_words = min(MAX_NB_WORDS, len(word_index))+1

embedding_matrix = np.zeros((nb_words, EMBEDDING_DIM))
for word, i in word_index.items():
    if word in word2vec.vocab:
        embedding_matrix[i] = word2vec.word_vec(word)
print('Null word embeddings: %d' % np.sum(np.sum(embedding_matrix, axis=1) == 0))

embedding_layer = Embedding(embedding_matrix.shape[0], # or len(word_index) + 1
                            embedding_matrix.shape[1], # or EMBEDDING_DIM,
                            weights=[embedding_matrix],
                            input_length=MAX_SEQUENCE_LENGTH,
                            trainable=False)

from keras.models import Sequential
from keras.layers import Conv1D, GlobalMaxPooling1D, Flatten
from keras.layers import Dense, Input, LSTM, Embedding, Dropout, Activation

model = Sequential()
model.add(embedding_layer)
model.add(Dropout(0.2))
model.add(Conv1D(300, 3, padding='valid',activation='relu',strides=2))
model.add(Conv1D(150, 3, padding='valid',activation='relu',strides=2))
model.add(Conv1D(75, 3, padding='valid',activation='relu',strides=2))
model.add(Flatten())
model.add(Dropout(0.2))
model.add(Dense(150,activation='sigmoid'))
model.add(Dropout(0.2))
model.add(Dense(3,activation='sigmoid'))

model.compile(loss='categorical_crossentropy',optimizer='rmsprop',metrics=['acc'])

model.summary()

model.fit(x_train, y_train, validation_data=(x_val, y_val), epochs=2, batch_size=128)
score = model.evaluate(x_val, y_val, verbose=0)
print('Test loss:', score[0])
print('Test accuracy:', score[1])

在这里embedding_layer使用以下方式明确创建:

for word, i in word_index.items():
    if word in word2vec.vocab:
        embedding_matrix[i] = word2vec.word_vec(word)

但是,如果我们使用get_keras_embedding(),则嵌入矩阵已被构建并固定。我不知道如何强制分词器中的每个word_index与{{ 1}}的Keras嵌入输入。

那么,在Keras中使用Word2Vec的get_keras_embedding()的正确方法是什么?

1 个答案:

答案 0 :(得分:4)

所以我找到了解决方案。标记词索引可以在word2vec_model.wv.vocab[word].index中找到,反之可以通过word2vec_model.wv.index2word[word_index]获得。 get_keras_embedding()将前者作为输入。

我进行如下转换:

source_word_indices = []
for i in range(len(array_of_word_lists)):
    source_word_indices.append([])
    for j in range(len(array_of_word_lists[i])):
        word = array_of_word_lists[i][j]
        if word in word2vec_model.wv.vocab:
            word_index = word2vec_model.wv.vocab[word].index
            source_word_indices[i].append(word_index)
        else:
            # Do something. For example, leave it blank or replace with padding character's index.
            source_word_indices[i].append(padding_index)
source = numpy.array(source_word_indices)

然后最后keras_model.fit(source, ...