根据Mikolov的文章,我正在使用Keras编码DM。
我们要面对的文档非常特殊:我们有很多不同的文档(大约2万个),但很少有不同的单词(只有100个)。
当我使用文档嵌入功能对文档进行分类时,它似乎不起作用...
我不知道这是由于特定数据还是我编写DM的方式造成的。
任何线索将不胜感激!
这是我的代码:
## Libraries
import numpy as np
import pandas as pd
import keras.backend as K
from keras.models import Model
from keras.layers import Dense, Embedding, Lambda,concatenate, Input, Flatten
from keras.utils.data_utils import get_file
from keras.utils import np_utils
from keras.preprocessing import sequence
from keras.preprocessing.text import Tokenizer
from keras import optimizers
from keras.constraints import non_neg
import random as rd
## Parameters
seuil = 0.0001 # Mikolov threshold
loss = 'categorical_crossentropy'
dim=20 # dimension of words and documents embeddings
lr = 0.005 # learning rate
epochs = 15
shuffle=True
steps_per_epoch = 10000
window_size = 7
seed=123
rd.seed(seed)
V1 = 100 # number of words
V2 = 20000 # number of documents
这是我的语料库:
new_corpus =
[[2],
[33, 33, 11],
[8, 16],
[27, 10],
[13],
[],
[21, 21, 32, 10, 1],
[],
[1, 27],
... ]
使用的神经网络:
input_words=Input((window_size*2,))
cbow_words = Embedding(input_dim=V1, output_dim=dim, input_length=window_size*2 ,embeddings_constraint=non_neg())(input_words)
input_texts=Input((V2,))
cbow_texts = Embedding(input_dim=V2, output_dim=dim, input_length=V2 ,embeddings_constraint=non_neg())(input_texts)
concat = concatenate([cbow_words,cbow_texts],axis=1)
lambd = Lambda(lambda x: K.mean(x, axis=1), output_shape=(dim,))(concat)
output = Dense(V1+1, activation='softmax')(lambd) # V1+1 because words are indexed from 1
cbow = Model(inputs=[input_words,input_texts], outputs=output)
cbow.compile(loss=loss, optimizer=optimizers.Adadelta(lr=lr, rho=0.95, epsilon=None, decay=0.0))
cbow.summary()
它打印:
Layer (type) Output Shape Param # Connected to
==================================================================================================
input_47 (InputLayer) (None, 14) 0
__________________________________________________________________________________________________
input_48 (InputLayer) (None, 20000) 0
__________________________________________________________________________________________________
embedding_47 (Embedding) (None, 14, 20) 2160 input_47[0][0]
__________________________________________________________________________________________________
embedding_48 (Embedding) (None, 20000, 20) 400000 input_48[0][0]
__________________________________________________________________________________________________
concatenate_24 (Concatenate) (None, 20000, 20) 0 embedding_47[0][0]
embedding_48[0][0]
__________________________________________________________________________________________________
lambda_24 (Lambda) (None, 20) 0 concatenate_24[0][0]
__________________________________________________________________________________________________
dense_24 (Dense) (None, 101) 2121 lambda_24[0][0]
==================================================================================================
这是将数据转换为生成器的方式:
generatedData=generate_data(new_corpus, window_size, V1)
for x, y in generatedData:
print(x,y)
Fisrt元素表示上下文。
第二个元素用一个热矢量表示文档的ID。
第三元素将预测单词表示为一个热门向量。
这是前两行
[array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 4]]), array([[0., 0., 0., ..., 0., 0., 0.]], dtype=float32)] [[... 0. 0. 0. 0. 1. ... 0. 0. 0. 0.]]
[array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 23]]), array([[0., 0., 0., ..., 0., 0., 0.]], dtype=float32)] [[... 0. 0. 1. .... 0. 0. 0. 0.]]
...
然后我拟合模型:
cbow.fit_generator(generatedData,steps_per_epoch=steps_per_epoch,epochs=epochs,shuffle=shuffle)
模型收敛。
E1=pd.DataFrame(cbow.layers[2].get_weights()[0]) # words embedding
E2=pd.DataFrame(cbow.layers[3].get_weights()[0]) # docs embedding
文件最终被归类为随机文件...