Question

我试图通过单词emebeddings查找句子相似度，然后应用余弦相似度评分。尝试使用CBOW / Skip Gram方法进行嵌入，但未能解决问题。

我正在这样做以获取产品评论数据。我有两列：

SNo         Product_Title                                Customer_Review   
 1       101.x battery works well                    I have an Apple phone and it's not that
          with Samsung smart phone                     that great.

 2       112.x battery works well                     I have samsung smart tv and I tell that it's
         with Samsung smart phone                     not wort buying.

 3      112.x battery works well                      This charger works very well with samsung 
        with Samsung smart phone.                      phone. It is fast charging.

前两个评论为irrelevant，因为Product_Title和Customer_Review的语义完全不同。

一种算法如何找到句子的语义含义并对其打分。

我的方法：

文本预处理
在我的数据集上使用Gensim训练CBOW /跳过图
通过平均该句子中的所有单词向量来进行句子级编码
获取product_title和reviews的余弦相似度。

问题：无法从句子中找到上下文，因此结果非常糟糕。

方法2：

使用没有经过训练的句子的预训练BERT。结果也没有改善。

1。其他任何可以捕获句子的上下文/语义的方法。

2。如何在不使用预训练模型的情况下从头开始在数据集上训练BERT？

Answer 1

您是否尝试过Universal Sentence Encoder (USE)或Multilingual Universal Sentence Encoder？

有一个合作实验室展示了如何在语义文本相似性基准（STS-B）上为semantic textual similarity with USE和multilingual similarity上的句子对打分。

这是来自Google AI blog post Advances in Semantic Textual Similarity上USE的成对语义相似性评分的热图。该模型在大量的Web数据上进行了训练，因此对于各种输入数据都应该很好地工作。

Pairwise semantic similarity comparison via outputs from TensorFlow Hub Universal Sentence Encoder.

Answer 2

这是一个非常详尽的教程，介绍如何使用NLU中的50多个句子嵌入（例如BERT，USE，Electra等）执行句子相似性分析！ NLU拥有50多种语言，并包含多语言嵌入！
用NLU生成相似性矩阵大约需要5行，并且您可以在仅1行代码中同时使用3个或更多Sentence Embeddings，您所需要做的就是：

nlu.load('embed_sentence.bert embed_sentence.electra use')

但是让我们保持简单，假设我们要为数据框中的每个句子计算相似度矩阵

您需要执行以下3个步骤：

1。计算嵌入

predictions = nlu.load('embed_sentence.bert').predict(your_dataframe)

2。计算相似度矩阵

def get_sim_df_total( predictions,e_col, string_to_embed,pipe=pipe):
  # This function calculates the distances between every sentence pair. Creates for ever sentence a new column with the name equal to the sentence it comparse to 
  # put embeddings in matrix
  embed_mat = np.array([x for x in predictions[e_col]])
  # calculate distance between every embedding pair
  sim_mat = cosine_similarity(embed_mat,embed_mat)
  # for i,v in enumerate(sim_mat): predictions[str(i)+'_sim'] = sim_mat[i]
  for i,v in enumerate(sim_mat): 
    s = predictions.iloc[i].document
    predictions[s] = sim_mat[i]

  return predictions 

sim_matrix_df = get_sim_df_total(predictions,'embed_sentence_bert_embeddings', 'How to get started with Machine Learning and Python' )
sim_matrix_df

3。绘制相似度矩阵的热图

non_sim_columns  = ['text','document','Title','embed_sentence_bert_embeddings']

def viz_sim_matrix_first_n(num_sentences=20, sim_df = sim_matrix_df):
  # Plot heatmap for the first num_sentences
  fig, ax = plt.subplots(figsize=(20,14)) 
  sim_df.index = sim_df.document
  sim_columns = list(sim_df.columns)
  for b in non_sim_columns : sim_columns.remove(b)
  # sim_matrix_df[sim_columns]
  ax = sns.heatmap(sim_df.iloc[:num_sentences][sim_columns[:num_sentences]]) 

  ax.axes.set_title(f"Similarity matrix for the first {num_sentences} in the dataset",)

viz_sim_matrix_first_n()

要了解更多信息，请查看以下链接：）

文章： https://medium.com/spark-nlp/easy-sentence-similarity-with-bert-sentence-embeddings-using-john-snow-labs-nlu-ea078deb6ebf

与NLU进行句子相似性演示的Colab Notebook： https://colab.research.google.com/drive/1LtOdtXtRJ3_N8kYywPd5k2AJMCGcgAdN?usp=sharing

NLU网站： http://nlu.johnsnowlabs.com/

如何使用深度学习找到句子相似度？

2 个答案: