如何使用Google的TF-hub通用句子编码器在2个单独的数组上应用语义相似性?

时间:2019-12-29 03:05:15

标签: python numpy tensorflow nlp

我正在使用此工具:https://colab.research.google.com/github/tensorflow/hub/blob/master/examples/colab/semantic_similarity_with_tf_hub_universal_encoder.ipynb

Google给出的示例可用于单个数组,如下所示:

messages = [

# Smartphones

"My phone is not good.",

"Your cellphone looks great.",



# Weather

"Will it snow tomorrow?",

"Recently a lot of hurricanes have hit the US",



# Food and health

"An apple a day, keeps the doctors away",

"Eating strawberries is healthy",

]

我设法使它正常工作,但我想将其应用于2个大小不同的数组(一个数组是4个元素,另一个数组是100个或类似的东西),并且正在努力使其实现工作。

我想让它比较array1中的每个句子与array2中的所有句子,并导出包含每个句子对及其相似性分数的CSV。

这是到目前为止的代码(无法正常运行,因为当导出具有相似得分的CSV列在第一列中,句子1在第二列中,句子2在第三列中时,得分远未按顺序排列如我所知(将句子对的不同分数分配给其他句子对):

from absl import logging

import tensorflow as tf
import tensorflow_hub as hub

import numpy as np
import os
import pandas as pd
import re

import csv
module_url = "module/2/" 
model = hub.Module(module_url)
print ("module %s loaded" % module_url)

def embed(input):
  return model(input)
embed = hub.Module(module_url)
def flatten(listoflists):
    flattenedlist = []
    for x in listoflists:
        for y in x:
            flattenedlist.append(y)
    return flattenedlist

messages = [
# Smartphones
"My phone is not good.",
"Your cellphone looks great.",

# Weather
"Will it snow tomorrow?",
"Recently a lot of hurricanes have hit the US",

# Food and health
"An apple a day, keeps the doctors away",
"Eating strawberries is healthy",
]

messages2 = ["My phone is not turning on.", "I hate snow.", "Apples are the devil", "I like basil.", "Eating strawberries is healthy.", "An apple a day keeps the doctor away", "Your cellphone looks great", "But my cellphone doesnt look so great"]

similarity_input_placeholder = tf.compat.v1.placeholder(tf.string, shape=(None))
similarity_message_encodings = embed(similarity_input_placeholder)
pirnt(similarity_message_encodings)
print(similarity_input_placeholder)
similarity_input_placeholder2 = tf.compat.v1.placeholder(tf.string, shape=(None))
similarity_message_encodings2 = embed(similarity_input_placeholder2)

with tf.Session() as session:
    session.run(tf.compat.v1.global_variables_initializer())
    session.run(tf.compat.v1.tables_initializer())
    message_embeddings_ = session.run(similarity_message_encodings, feed_dict={similarity_input_placeholder: messages})
    message_embeddings_2 = session.run(similarity_message_encodings, feed_dict={similarity_input_placeholder: messages2})

    corr = np.inner(message_embeddings_, message_embeddings_2)
    newlist = corr.tolist()
    print(newlist)
    similarityscore = flatten(newlist)

    if not os.path.exists('results.csv'):
        header_added1 = False
    else:
        header_added1 = True
    count = 0
    count2 = 1
    subtract = 1
    iterscore = iter(similarityscore)
    for score in similarityscore:
        if score > 0.4:

            with open('results for messages .csv', mode='a') as csv_writer:
                csv_writer = csv.writer(csv_writer, delimiter=',',quotechar='"', quoting=csv.QUOTE_MINIMAL)
                if not header_added1:
                    csv_writer.writerow(['Similarity Score', 'Sentences 1','Sentences 2'])
                    header_added1 = True

                for m in messages:
                    if count > len(messages)+len(messages2):
                        count -=len(messages)+len(messages2)
                    csv_writer.writerow([next(iterscore), m, messages2[count]])
            subtract=1
            count +=1
            count2 +=1

使用此代码,我得到以下结果:

相似性得分乱序:

enter image description here

我确定“明天下雪吗?”不像“苹果是魔鬼”。

我的假设是分数运行良好,只是分数不合时宜,但是我当然可能错了,因此请纠正我。

所以我不确定如何按顺序导出它们(如果确实是问题所在)。

0 个答案:

没有答案