Google给出的示例可用于单个数组,如下所示:
messages = [
# Smartphones
"My phone is not good.",
"Your cellphone looks great.",
# Weather
"Will it snow tomorrow?",
"Recently a lot of hurricanes have hit the US",
# Food and health
"An apple a day, keeps the doctors away",
"Eating strawberries is healthy",
]
我设法使它正常工作,但我想将其应用于2个大小不同的数组(一个数组是4个元素,另一个数组是100个或类似的东西),并且正在努力使其实现工作。
我想让它比较array1中的每个句子与array2中的所有句子,并导出包含每个句子对及其相似性分数的CSV。
这是到目前为止的代码(无法正常运行,因为当导出具有相似得分的CSV列在第一列中,句子1在第二列中,句子2在第三列中时,得分远未按顺序排列如我所知(将句子对的不同分数分配给其他句子对):
from absl import logging
import tensorflow as tf
import tensorflow_hub as hub
import numpy as np
import os
import pandas as pd
import re
import csv
module_url = "module/2/"
model = hub.Module(module_url)
print ("module %s loaded" % module_url)
def embed(input):
return model(input)
embed = hub.Module(module_url)
def flatten(listoflists):
flattenedlist = []
for x in listoflists:
for y in x:
flattenedlist.append(y)
return flattenedlist
messages = [
# Smartphones
"My phone is not good.",
"Your cellphone looks great.",
# Weather
"Will it snow tomorrow?",
"Recently a lot of hurricanes have hit the US",
# Food and health
"An apple a day, keeps the doctors away",
"Eating strawberries is healthy",
]
messages2 = ["My phone is not turning on.", "I hate snow.", "Apples are the devil", "I like basil.", "Eating strawberries is healthy.", "An apple a day keeps the doctor away", "Your cellphone looks great", "But my cellphone doesnt look so great"]
similarity_input_placeholder = tf.compat.v1.placeholder(tf.string, shape=(None))
similarity_message_encodings = embed(similarity_input_placeholder)
pirnt(similarity_message_encodings)
print(similarity_input_placeholder)
similarity_input_placeholder2 = tf.compat.v1.placeholder(tf.string, shape=(None))
similarity_message_encodings2 = embed(similarity_input_placeholder2)
with tf.Session() as session:
session.run(tf.compat.v1.global_variables_initializer())
session.run(tf.compat.v1.tables_initializer())
message_embeddings_ = session.run(similarity_message_encodings, feed_dict={similarity_input_placeholder: messages})
message_embeddings_2 = session.run(similarity_message_encodings, feed_dict={similarity_input_placeholder: messages2})
corr = np.inner(message_embeddings_, message_embeddings_2)
newlist = corr.tolist()
print(newlist)
similarityscore = flatten(newlist)
if not os.path.exists('results.csv'):
header_added1 = False
else:
header_added1 = True
count = 0
count2 = 1
subtract = 1
iterscore = iter(similarityscore)
for score in similarityscore:
if score > 0.4:
with open('results for messages .csv', mode='a') as csv_writer:
csv_writer = csv.writer(csv_writer, delimiter=',',quotechar='"', quoting=csv.QUOTE_MINIMAL)
if not header_added1:
csv_writer.writerow(['Similarity Score', 'Sentences 1','Sentences 2'])
header_added1 = True
for m in messages:
if count > len(messages)+len(messages2):
count -=len(messages)+len(messages2)
csv_writer.writerow([next(iterscore), m, messages2[count]])
subtract=1
count +=1
count2 +=1
使用此代码,我得到以下结果:
相似性得分乱序:
我确定“明天下雪吗?”不像“苹果是魔鬼”。
我的假设是分数运行良好,只是分数不合时宜,但是我当然可能错了,因此请纠正我。
所以我不确定如何按顺序导出它们(如果确实是问题所在)。