我试图自学喀拉拉邦,所以我写了一个简单的NN(实际上只是一个矩阵变换),它从一组整数编码的单词到一些与这些单词相关的文档的预训练嵌入。为了提出整数编码,我使用了Keras的分词器(“ texts_to_matrix”)。分词器将单词排列成字典(“ word_index”)。
训练神经网络后,我得到的权重的行数=我的词汇量。我的问题是,是否可以确保权重矩阵中的行严格按照令牌生成器的word_index中的索引排序?这意味着,例如,行= 3的权重对应于分词器词典中索引3的词。
这是我的玩具示例:
import pandas as pd
import numpy as np
# some random "documents" and embeddings
docs = ['machine learning','deep learning','artificial intelligence','supervised learning']
emds = np.random.rand(4,5).tolist()
df = pd.DataFrame({'searches':docs,'mean_embed':emds})
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
search_train = df['searches']
upc_embed_train = df['mean_embed']
# we got 6 different words
vocab_size = 6
t = Tokenizer(num_words = vocab_size,char_level=False)
t.fit_on_texts(search_train)
print("word index = ",t.word_index)
这给我们:
word index = {'learning':1,'machine':2,'deep':3,'artificial':4,4,'intelligence':5,'supervised':6}
# integer encode documents
encoded_docs_train = t.texts_to_matrix(search_train, mode='count')
output_emb_train = np.array(upc_embed_train)
output_emb_train2 = np.matrix(upc_embed_train.tolist())
from keras.models import Model, Sequential
from keras.layers import Input, Dense, Flatten, Embedding, LSTM
model = Sequential()
# the 5 is to match the dimensions of the embedding vector
model.add(Dense(5,input_dim=vocab_size, activation='linear',use_bias=True))
model.compile(loss='mse', optimizer='adam', metrics=['mse'])
model.fit(encoded_docs_train,output_emb_train2)
tmp1 = model.layers[0].get_weights()[0]
tmp2 = model.layers[0].get_weights()[1]
# now can we guarantee that weight 2 corresponds to
# word_index with value =2 ("machine"), etc?
value_of_interest = 2
for word, counter in t.word_index.items():
if counter == value_of_interest:
theword = word
# subtract 1 because the count starts with 0 in the np array
weight_for_word = tmp1[(value_of_interest-1),:]
print("word = ",theword,", its weight ",weight_for_word)
它打印出来:
word =机器,其重量[-0.69765383 -0.63167644 0.62771523 -0.68510187 0.5576754]
我在Keras文档中找不到任何东西可以使我确信情况确实如此,因此,非常感谢您的帮助。