具有LSTM的Siamese Network用于Keras中的句子相似性,定期给出相同的结果

时间:2017-09-28 09:46:42

标签: python-3.x keras lstm word2vec sentence-similarity

我是Keras的新手,我正试图在Keras中使用NN来解决类似句子的任务。 我使用word2vec作为单词嵌入,然后使用Siamese Network来预测两个句子的相似程度。 Siamese网络的基础网络是LSTM,为了合并两个基础网络,我使用具有余弦相似度量的Lambda层。 作为使用SICK数据集的数据集,它为每对句子提供一个分数,从1(不同)到5(非常相似)。

我创建了网络并运行,但我有很多疑问: 首先,我不确定我用句子喂LSTM的方式是否正常。我为每个单词采用word2vec嵌入,每个句子只创建一个数组,用零填充到seq_len,以获得相同的长度数组。然后我以这种方式重塑它:data_A = embedding_A.reshape((len(embedding_A), seq_len, feature_dim))

除了我不确定我的暹罗网络是否正确之外,因为不同对的预测很多,并且损失不会发生太大的变化(在10个时期内从0.3300到0.2105,并且它没有& #39; t在100个时代中改变得更多。)

有人可以帮助我找到并理解我的错误吗? 非常感谢(抱歉我的英语不好)

感兴趣的代码

def cosine_distance(vecs):
    #I'm not sure about this function too
    y_true, y_pred = vecs
    y_true = K.l2_normalize(y_true, axis=-1)
    y_pred = K.l2_normalize(y_pred, axis=-1)
    return K.mean(1 - K.sum((y_true * y_pred), axis=-1))

def cosine_dist_output_shape(shapes):
    shape1, shape2 = shapes
    print((shape1[0], 1))
    return (shape1[0], 1)

def contrastive_loss(y_true, y_pred):
    margin = 1
    return K.mean(y_true * K.square(y_pred) + (1 - y_true) * K.square(K.maximum(margin - y_pred, 0)))

def create_base_network(feature_dim,seq_len):

    model = Sequential()  
    model.add(LSTM(100, batch_input_shape=(1,seq_len,feature_dim),return_sequences=True))
    model.add(Dense(50, activation='relu'))    
    model.add(Dense(10, activation='relu'))
    return model


def siamese(feature_dim,seq_len, epochs, tr_dataA, tr_dataB, tr_y, te_dataA, te_dataB, te_y):    

    base_network = create_base_network(feature_dim,seq_len)

    input_a = Input(shape=(seq_len,feature_dim,))
    input_b = Input(shape=(seq_len,feature_dim))

    processed_a = base_network(input_a)
    processed_b = base_network(input_b)

    distance = Lambda(cosine_distance, output_shape=cosine_dist_output_shape)([processed_a, processed_b])

    model = Model([input_a, input_b], distance)

    adam = Adam(lr=0.0001, beta_1=0.9, beta_2=0.999, epsilon=1e-08, decay=0.0)
    model.compile(optimizer=adam, loss=contrastive_loss)
    model.fit([tr_dataA, tr_dataB], tr_y,
              batch_size=128,
              epochs=epochs,
              validation_data=([te_dataA, te_dataB], te_y))


    pred = model.predict([tr_dataA, tr_dataB])
    tr_acc = compute_accuracy(pred, tr_y)
    for i in range(len(pred)):
        print (pred[i], tr_y[i])


    return model


def padding(max_len, embedding):
    for i in range(len(embedding)):
        padding = np.zeros(max_len-embedding[i].shape[0])
        embedding[i] = np.concatenate((embedding[i], padding))

    embedding = np.array(embedding)
    return embedding

def getAB(sentences_A,sentences_B, feature_dim, word2idx, idx2word, weights,max_len_def=0):
    #from_sentence_to_array : function that transforms natural language sentences 
    #into vectors of real numbers. Each word is replaced with the corrisponding word2vec 
    #embedding, and words that aren't in the embedding are replaced with zeros vector.  
    embedding_A, max_len_A = from_sentence_to_array(sentences_A,word2idx, idx2word, weights)
    embedding_B, max_len_B = from_sentence_to_array(sentences_B,word2idx, idx2word, weights)

    max_len = max(max_len_A, max_len_B,max_len_def*feature_dim)

    #padding to max_len
    embedding_A = padding(max_len, embedding_A)
    embedding_B = padding(max_len, embedding_B)

    seq_len = int(max_len/feature_dim)
    print(seq_len)

    #rashape
    data_A = embedding_A.reshape((len(embedding_A), seq_len, feature_dim))
    data_B = embedding_B.reshape((len(embedding_B), seq_len, feature_dim))

    print('A,B shape: ',data_A.shape, data_B.shape)

    return data_A, data_B, seq_len



FEATURE_DIMENSION = 100
MIN_COUNT = 10
WINDOW = 5

if __name__ == '__main__':

    data = pd.read_csv('data\\train.csv', sep='\t')
    sentences_A = data['sentence_A']
    sentences_B = data['sentence_B']
    tr_y = 1- data['relatedness_score']/5

    if not (os.path.exists(EMBEDDING_PATH)  and os.path.exists(VOCAB_PATH)):    
        create_embeddings(embeddings_path=EMBEDDING_PATH, vocab_path=VOCAB_PATH,  size=FEATURE_DIMENSION, min_count=MIN_COUNT, window=WINDOW, sg=1, iter=25)
    word2idx, idx2word, weights = load_vocab_and_weights(VOCAB_PATH,EMBEDDING_PATH)

    tr_dataA, tr_dataB, seq_len = getAB(sentences_A,sentences_B, FEATURE_DIMENSION,word2idx, idx2word, weights)

    test = pd.read_csv('data\\test.csv', sep='\t')
    test_sentences_A = test['sentence_A']
    test_sentences_B = test['sentence_B']
    te_y = 1- test['relatedness_score']/5

    te_dataA, te_dataB, seq_len = getAB(test_sentences_A,test_sentences_B, FEATURE_DIMENSION,word2idx, idx2word, weights, seq_len) 

    model = siamese(FEATURE_DIMENSION, seq_len, 10, tr_dataA, tr_dataB, tr_y, te_dataA, te_dataB, te_y)


    test_a = ['this is my dog']
    test_b = ['this dog is mine']
    a,b,seq_len = getAB(test_a,test_b, FEATURE_DIMENSION,word2idx, idx2word, weights, seq_len)
    prediction  = model.predict([a, b])
    print(prediction)

部分结果:

my prediction | true label 
0.849908 0.8
0.849908 0.8
0.849908 0.74
0.849908 0.76
0.849908 0.66
0.849908 0.72
0.849908 0.64
0.849908 0.8
0.849908 0.78
0.849908 0.8
0.849908 0.8
0.849908 0.8
0.849908 0.8
0.849908 0.74
0.849908 0.8
0.849908 0.8
0.849908 0.8
0.849908 0.66
0.849908 0.8
0.849908 0.66
0.849908 0.56
0.849908 0.8
0.849908 0.8
0.849908 0.76
0.847546 0.78
0.847546 0.8
0.847546 0.74
0.847546 0.76
0.847546 0.72
0.847546 0.8
0.847546 0.78
0.847546 0.8
0.847546 0.72
0.847546 0.8
0.847546 0.8
0.847546 0.78
0.847546 0.8
0.847546 0.78
0.847546 0.78
0.847546 0.46
0.847546 0.72
0.847546 0.8
0.847546 0.76
0.847546 0.8
0.847546 0.8
0.847546 0.8
0.847546 0.8
0.847546 0.74
0.847546 0.8
0.847546 0.72
0.847546 0.68
0.847546 0.56
0.847546 0.8
0.847546 0.78
0.847546 0.78
0.847546 0.8
0.852975 0.64
0.852975 0.78
0.852975 0.8
0.852975 0.8
0.852975 0.44
0.852975 0.72
0.852975 0.8
0.852975 0.8
0.852975 0.76
0.852975 0.8
0.852975 0.8
0.852975 0.8
0.852975 0.78
0.852975 0.8
0.852975 0.8
0.852975 0.78
0.852975 0.8
0.852975 0.8
0.852975 0.76
0.852975 0.8

2 个答案:

答案 0 :(得分:2)

您看到连续相等的值,因为函数cosine_distance的输出形状是错误的。当您在没有K.mean(...)参数的情况下使用axis时,结果是标量。要解决此问题,只需使用K.mean(..., axis=-1)中的cosine_distance替换K.mean(...)

更详细的说明:

调用model.predict()时,首先预先分配输出数组pred,然后填充批处理预测。来自源代码training.py

if batch_index == 0:
    # Pre-allocate the results arrays.
    for batch_out in batch_outs:
        shape = (num_samples,) + batch_out.shape[1:]
        outs.append(np.zeros(shape, dtype=batch_out.dtype))
for i, batch_out in enumerate(batch_outs):
    outs[i][batch_start:batch_end] = batch_out

在您的情况下,您只有单个输出,因此上面代码中的pred只是outs[0]。当batch_out是标量时(例如,结果中显示为0.847546),上面的代码相当于pred[batch_start:batch_end] = 0.847576。由于model.predict()的默认批量大小为32,您可以在发布的结果中看到32个连续的0.847576值。

另一个可能更大的问题是标签错了。您可以通过tr_y = 1- data['relatedness_score']/5将相关性分数转换为标签。现在,如果两个句子“非常相似”,则相关性得分为5,因此这两个句子的tr_y为0。

然而,在对比度损失中,当y_true为零时,术语K.maximum(margin - y_pred, 0)实际上意味着“这两个句子应该具有余弦距离>= margin”。这与您希望模型学习的内容相反(我也不认为您需要K.square)。

答案 1 :(得分:0)

只是为了在某个地方找到答案(我在接受的答案的评论中看到它),你的对比度损失函数应该是:

loss = K.mean((1 - y) * k.square(d) + y * K.square(K.maximum(margin - d, 0)))

您的(1 - y) * ...y * ...混淆了,这可能会让那些使用您的示例作为起点的人失望。这是一个很好的起点。

关于术语的说明:您使用y_truey_pred代替yd。我使用yd因为y是您的标签,应该是0或1,但d不一定在同一范围内(d对于余弦距离,实际上在0到2之间)。它实际上不是y值的预测。您只想在两个输入相似时最小化距离度量d,并在它们不同时将其最大化(或将其推到边距之外)。基本上对比损失并不是试图让d预测y,而只是试图让d变小,相同时变大,差异很大。