对于ElMo,FastText和Word2Vec,我平均将单词嵌入到一个句子中,然后使用HDBSCAN / KMeans聚类将相似的句子分组在一起。
在这篇简短的文章中可以看到实现的一个很好的例子:http://ai.intelligentonlinetools.com/ml/text-clustering-word-embedding-machine-learning/
我想使用BERT(使用拥抱脸的BERT python包)做同样的事情,但是我不熟悉如何提取原始单词/句子向量以便将它们输入到聚类算法中。我知道BERT可以输出句子表示形式-那么我实际上如何从句子中提取原始向量呢?
任何信息都会有帮助
答案 0 :(得分:6)
您可以使用Sentence Transformers生成句子嵌入。与从bert-as-service获得的嵌入相比,这些嵌入的意义要大得多,因为它们已经过微调,以使语义相似的句子具有更高的相似性评分。如果要聚类的句子数百万或更多,则可以使用基于FAISS的聚类算法,因为像聚类算法这样的香草K均值需要二次时间。
答案 1 :(得分:1)
Bert在每个样本/句子的开头添加一个特殊的[CLS]令牌。在对下游任务进行微调之后,此[CLS]令牌或pooled_output在其拥抱面实现中所称的嵌入表示句子的嵌入。
但是我认为您没有标签,因此您将无法微调,因此您不能将pooled_output用作句子嵌入。取而代之的是,您应该在encoding_layers中使用embeddings这个词,它是一个张量为(12,seq_len,768)的张量。在此张量中,您具有Bert中12个层中每个层的嵌入(尺寸768)。要获取单词嵌入,可以使用最后一层的输出,可以将最后4层的输出串联或求和,依此类推。
以下是用于提取功能https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/examples/extract_features.py
的脚本答案 2 :(得分:0)
您首先需要为句子生成bert embeddidngs。 bert-as-service提供了一种非常简单的方法来生成句子的嵌入。
这是为所需聚类的句子列表添加bert向量的方法。在bert-as-service存储库中对此进行了很好的解释: https://github.com/hanxiao/bert-as-service
安装:
pip install bert-serving-server # server
pip install bert-serving-client # client, independent of `bert-serving-server`
从https://github.com/google-research/bert
下载其中一种预先训练的模型启动服务:
bert-serving-start -model_dir /your_model_directory/ -num_worker=4
为句子列表生成向量:
from bert_serving.client import BertClient
bc = BertClient()
vectors=bc.encode(your_list_of_sentences)
这将为您提供向量列表,您可以将它们写入csv并使用任何聚类算法,因为句子被简化为数字。
答案 3 :(得分:0)
不确定是否仍然需要它,但是最近一篇论文提到了如何使用文档嵌入对文档进行聚类并从每个聚类中提取单词来表示一个主题。这是链接: https://arxiv.org/pdf/2008.09470.pdf,https://github.com/ddangelov/Top2Vec
受上述论文的启发,这里提到了另一种使用BERT生成句子嵌入的主题建模算法: https://towardsdatascience.com/topic-modeling-with-bert-779f7db187e6,https://github.com/MaartenGr/BERTopic
以上两个库提供了从语料库中提取主题的端到端解决方案。但是,如果您只对生成句子嵌入感兴趣,请参阅Gensim的doc2vec(https://radimrehurek.com/gensim/models/doc2vec.html)或其他答案中提到的句子变形器(https://github.com/UKPLab/sentence-transformers)。如果您使用句子变形器,建议您在特定领域的语料库上训练模型,以获得良好的结果。
答案 4 :(得分:0)
作为 Subham Kumar mentioned,可以使用这个 Python 3 库来计算句子相似度:https://github.com/UKPLab/sentence-transformers
图书馆有几个 code examples 来执行聚类:
"""
This is a more complex example on performing clustering on large scale dataset.
This examples find in a large set of sentences local communities, i.e., groups of sentences that are highly
similar. You can freely configure the threshold what is considered as similar. A high threshold will
only find extremely similar sentences, a lower threshold will find more sentence that are less similar.
A second parameter is 'min_community_size': Only communities with at least a certain number of sentences will be returned.
The method for finding the communities is extremely fast, for clustering 50k sentences it requires only 5 seconds (plus embedding comuptation).
In this example, we download a large set of questions from Quora and then find similar questions in this set.
"""
from sentence_transformers import SentenceTransformer, util
import os
import csv
import time
# Model for computing sentence embeddings. We use one trained for similar questions detection
model = SentenceTransformer('paraphrase-MiniLM-L6-v2')
# We donwload the Quora Duplicate Questions Dataset (https://www.quora.com/q/quoradata/First-Quora-Dataset-Release-Question-Pairs)
# and find similar question in it
url = "http://qim.fs.quoracdn.net/quora_duplicate_questions.tsv"
dataset_path = "quora_duplicate_questions.tsv"
max_corpus_size = 50000 # We limit our corpus to only the first 50k questions
# Check if the dataset exists. If not, download and extract
# Download dataset if needed
if not os.path.exists(dataset_path):
print("Download dataset")
util.http_get(url, dataset_path)
# Get all unique sentences from the file
corpus_sentences = set()
with open(dataset_path, encoding='utf8') as fIn:
reader = csv.DictReader(fIn, delimiter='\t', quoting=csv.QUOTE_MINIMAL)
for row in reader:
corpus_sentences.add(row['question1'])
corpus_sentences.add(row['question2'])
if len(corpus_sentences) >= max_corpus_size:
break
corpus_sentences = list(corpus_sentences)
print("Encode the corpus. This might take a while")
corpus_embeddings = model.encode(corpus_sentences, batch_size=64, show_progress_bar=True, convert_to_tensor=True)
print("Start clustering")
start_time = time.time()
#Two parameters to tune:
#min_cluster_size: Only consider cluster that have at least 25 elements
#threshold: Consider sentence pairs with a cosine-similarity larger than threshold as similar
clusters = util.community_detection(corpus_embeddings, min_community_size=25, threshold=0.75)
print("Clustering done after {:.2f} sec".format(time.time() - start_time))
#Print for all clusters the top 3 and bottom 3 elements
for i, cluster in enumerate(clusters):
print("\nCluster {}, #{} Elements ".format(i+1, len(cluster)))
for sentence_id in cluster[0:3]:
print("\t", corpus_sentences[sentence_id])
print("\t", "...")
for sentence_id in cluster[-3:]:
print("\t", corpus_sentences[sentence_id])
"""
This is a simple application for sentence embeddings: clustering
Sentences are mapped to sentence embeddings and then k-mean clustering is applied.
"""
from sentence_transformers import SentenceTransformer
from sklearn.cluster import KMeans
embedder = SentenceTransformer('paraphrase-MiniLM-L6-v2')
# Corpus with example sentences
corpus = ['A man is eating food.',
'A man is eating a piece of bread.',
'A man is eating pasta.',
'The girl is carrying a baby.',
'The baby is carried by the woman',
'A man is riding a horse.',
'A man is riding a white horse on an enclosed ground.',
'A monkey is playing drums.',
'Someone in a gorilla costume is playing a set of drums.',
'A cheetah is running behind its prey.',
'A cheetah chases prey on across a field.'
]
corpus_embeddings = embedder.encode(corpus)
# Perform kmean clustering
num_clusters = 5
clustering_model = KMeans(n_clusters=num_clusters)
clustering_model.fit(corpus_embeddings)
cluster_assignment = clustering_model.labels_
clustered_sentences = [[] for i in range(num_clusters)]
for sentence_id, cluster_id in enumerate(cluster_assignment):
clustered_sentences[cluster_id].append(corpus[sentence_id])
for i, cluster in enumerate(clustered_sentences):
print("Cluster ", i+1)
print(cluster)
print("")
"""
This is a simple application for sentence embeddings: clustering
Sentences are mapped to sentence embeddings and then agglomerative clustering with a threshold is applied.
"""
from sentence_transformers import SentenceTransformer
from sklearn.cluster import AgglomerativeClustering
import numpy as np
embedder = SentenceTransformer('paraphrase-MiniLM-L6-v2')
# Corpus with example sentences
corpus = ['A man is eating food.',
'A man is eating a piece of bread.',
'A man is eating pasta.',
'The girl is carrying a baby.',
'The baby is carried by the woman',
'A man is riding a horse.',
'A man is riding a white horse on an enclosed ground.',
'A monkey is playing drums.',
'Someone in a gorilla costume is playing a set of drums.',
'A cheetah is running behind its prey.',
'A cheetah chases prey on across a field.'
]
corpus_embeddings = embedder.encode(corpus)
# Normalize the embeddings to unit length
corpus_embeddings = corpus_embeddings / np.linalg.norm(corpus_embeddings, axis=1, keepdims=True)
# Perform kmean clustering
clustering_model = AgglomerativeClustering(n_clusters=None, distance_threshold=1.5) #, affinity='cosine', linkage='average', distance_threshold=0.4)
clustering_model.fit(corpus_embeddings)
cluster_assignment = clustering_model.labels_
clustered_sentences = {}
for sentence_id, cluster_id in enumerate(cluster_assignment):
if cluster_id not in clustered_sentences:
clustered_sentences[cluster_id] = []
clustered_sentences[cluster_id].append(corpus[sentence_id])
for i, cluster in clustered_sentences.items():
print("Cluster ", i+1)
print(cluster)
print("")