段落向量的Doc2Vec分布式存储模型(PV-DM)未分类

时间:2018-10-03 10:21:06

标签: python machine-learning keras embedding doc2vec

根据Mikolov的文章,我正在使用Keras编码DM。

我们要面对的文档非常特殊:我们有很多不同的文档(大约2万个),但很少有不同的单词(只有100个)。

当我使用文档嵌入功能对文档进行分类时,它似乎不起作用...

我不知道这是由于特定数据还是我编写DM的方式造成的。

任何线索将不胜感激!

这是我的代码:

## Libraries

import numpy as np
import pandas as pd
import keras.backend as K
from keras.models import Model
from keras.layers import Dense, Embedding, Lambda,concatenate, Input, Flatten
from keras.utils.data_utils import get_file
from keras.utils import np_utils
from keras.preprocessing import sequence
from keras.preprocessing.text import Tokenizer
from keras import optimizers
from keras.constraints import non_neg
import random as rd


## Parameters

seuil = 0.0001 # Mikolov threshold
loss = 'categorical_crossentropy'
dim=20 # dimension of words and documents embeddings
lr = 0.005 # learning rate
epochs = 15
shuffle=True
steps_per_epoch = 10000
window_size = 7
seed=123
rd.seed(seed)

V1 = 100 # number of words
V2 = 20000 # number of documents

这是我的语料库:

new_corpus =
[[2],
 [33, 33, 11],
 [8, 16],
 [27, 10],
 [13],
 [],
 [21, 21, 32, 10, 1],
 [],
 [1, 27],
 ... ]

使用的神经网络:

input_words=Input((window_size*2,))
cbow_words = Embedding(input_dim=V1, output_dim=dim, input_length=window_size*2 ,embeddings_constraint=non_neg())(input_words)

input_texts=Input((V2,))
cbow_texts = Embedding(input_dim=V2, output_dim=dim, input_length=V2 ,embeddings_constraint=non_neg())(input_texts)

concat = concatenate([cbow_words,cbow_texts],axis=1)
lambd = Lambda(lambda x: K.mean(x, axis=1), output_shape=(dim,))(concat)
output = Dense(V1+1, activation='softmax')(lambd) # V1+1 because words are indexed from 1

cbow = Model(inputs=[input_words,input_texts], outputs=output)

cbow.compile(loss=loss, optimizer=optimizers.Adadelta(lr=lr, rho=0.95, epsilon=None, decay=0.0))
cbow.summary()

它打印:

Layer (type)                    Output Shape         Param #     Connected to                     
==================================================================================================
input_47 (InputLayer)           (None, 14)           0                                            
__________________________________________________________________________________________________
input_48 (InputLayer)           (None, 20000)        0                                            
__________________________________________________________________________________________________
embedding_47 (Embedding)        (None, 14, 20)       2160        input_47[0][0]                   
__________________________________________________________________________________________________
embedding_48 (Embedding)        (None, 20000, 20)    400000      input_48[0][0]                   
__________________________________________________________________________________________________
concatenate_24 (Concatenate)    (None, 20000, 20)    0           embedding_47[0][0]               
                                                                 embedding_48[0][0]               
__________________________________________________________________________________________________
lambda_24 (Lambda)              (None, 20)           0           concatenate_24[0][0]             
__________________________________________________________________________________________________
dense_24 (Dense)                (None, 101)          2121        lambda_24[0][0]                  
==================================================================================================

这是将数据转换为生成器的方式:

generatedData=generate_data(new_corpus, window_size, V1)

for x, y in generatedData:
    print(x,y)

Fisrt元素表示上下文。

第二个元素用一个热矢量表示文档的ID。

第三元素将预测单词表示为一个热门向量。

这是前两行

[array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 4]]), array([[0., 0., 0., ..., 0., 0., 0.]], dtype=float32)] [[... 0. 0. 0. 0. 1. ... 0. 0. 0. 0.]]
[array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 23]]), array([[0., 0., 0., ..., 0., 0., 0.]], dtype=float32)] [[... 0. 0. 1. .... 0. 0. 0. 0.]]
...

然后我拟合模型:

cbow.fit_generator(generatedData,steps_per_epoch=steps_per_epoch,epochs=epochs,shuffle=shuffle)

模型收敛。

E1=pd.DataFrame(cbow.layers[2].get_weights()[0]) # words embedding
E2=pd.DataFrame(cbow.layers[3].get_weights()[0]) # docs embedding

文件最终被归类为随机文件...

0 个答案:

没有答案