将预训练的模型加载到TF-Hub上以在Gensim或spaCy上计算单词移动器的距离(WMD)

时间:2019-09-09 14:47:55

标签: tensorflow nlp gensim spacy tensorflow-hub

我想使用Word Mover's Distance嵌入来计算Universal Sentence Encoder on TensorFlow Hub

我已经在spaCy for WMD-relax上尝试过该示例,该示例从spaCy加载了“ en”模型,但是我找不到其他方式来填充其他嵌入内容。

gensim中,似乎只接受load_word2vec_format文件(file.bin)或load文件(file.vec)。

据我所知,有人写了Bert to token embeddings based on pytorch,但并没有推广到tf-hub上的其他模型。

还有其他方法可以将tf-hub上的预训练模型转换为spaCy格式或word2vec格式吗?

2 个答案:

答案 0 :(得分:0)

您需要两件事。

首先告诉SpaCy为您的文档,跨度或标记使用外部向量。可以通过设置user_hooks来完成: -user_hooks["vector"]用于文档向量 -user_span_hooks["vector"]用于跨度向量 -user_token_hooks["vector"]用于令牌向量

鉴于您具有从TF Hub检索Doc / Span / Token的向量的事实(它们全部具有属性text):

import spacy
import tensorflow_hub as hub


model = hub.load(TFHUB_URL)
def embed(element):
    # get the text
    text = element.text
    # then get your vector back. The signature is for batches/arrays
    results = model([text])
    # get the first element because we queried with just one text
    result = np.array(results)[0]
    return result

您可以编写以下管道组件,以告诉spacy如何获取文档,跨度和标记的自定义嵌入:

def overwrite_vectors(doc):
    doc.user_hooks["vector"] = embed
    doc.user_span_hooks["vector"] = embed
    doc.user_token_hooks["vector"] = embed

# add this to your nlp pipeline to get it on every document
nlp = spacy.blank('en') # or any other Language
nlp.add_pipe(overwrite_vectors)

对于与自定义距离有关的问题,还有一个用户钩子:

def word_mover_similarity(a, b):
    vector_a = a.vector
    vector_b = b.vector
    # your distance score needs to be converted to a similarity score
    similarity = TODO_IMPLEMENT(vector_a, vector_b)
    return similarity

def overwrite_similarity(doc):
    doc.user_hooks["similarity"] = word_mover_similarity
    doc.user_span_hooks["similarity"] = word_mover_similarity
    doc.user_token_hooks["similarity"] = word_mover_similarity

# as before, add this to the pipeline
nlp.add_pipe(overwrite_similarity)

我有一个TF Hub通用语句编码器的实现,该实现以这种方式使用user_hookshttps://github.com/MartinoMensio/spacy-universal-sentence-encoder-tfhub

答案 1 :(得分:0)

Here是WMD的有效实施。您可以创建WMD对象并加载自己的嵌入内容:

import numpy
from wmd import WMD
embeddings_numpy_array = # your array with word vectors
calc = WMD(embeddings_numpy_array, ...)

或者,如this示例中所示。,您可以创建自己的类:

import spacy
spacy_nlp = spacy.load('en_core_web_lg')

class SpacyEmbeddings(object):
    def __getitem__(self, item):
        return spacy_nlp.vocab[item].vector # here you can return your own vector instead

calc = WMD(SpacyEmbeddings(), documents)
...
...
calc.nearest_neighbors("some text")
...