Question

如何从文本中对实体之间的分类学关系进行一般性推论？在en_core_web_lg模型的word2vec中寻找“ type of”附近的词，它们似乎都无关。但是，“类型”附近的单词与此更为相似。但是，如何在文本中使用常用短语并应用一些通用相似性来从SVO三元组等推断分类法呢？可以执行Sense2Vec类型方法，但想知道是否可以在不进行新培训的情况下使用现有的东西。

以下代码的输出：

['eradicate', 'wade', 'equator', 'educated', 'lcd', 'byproducts', 'two', 'propensity', 'rhinos', 'procrastinate']

 def get_related(word):
        filtered_words = [w for w in word.vocab if w.is_lower == word.is_lower and w.prob >= -15]
        similarity = sorted(filtered_words, key=lambda w: word.similarity(w), reverse=True)
        return similarity[:10]

print ([w.lower_ for w in get_related(nlp.vocab[u'type_of'])])

Answer 1

您的代码检索到的所有相似点均为0.0，因此对列表进行排序无效。

您将“ type_of”视为单词（更准确地说是lexeme），并假设 spaCy 会将其理解为短语“类型”。请注意，第一个带有下划线，而第二个没有下划线。但是，即使没有下划线，它也不是模型词汇表中的词素。由于该模型在“ type_of”上没有足够的数据用于相似性评分，因此您与之比较的每个单词的评分均为0.0。

相反，您可以创建单词“ type of”的Span并在其上调用similarity()。只需对您的代码进行少量更改：

import spacy


def get_related(span):  # this now expects a Span instead of a Lexeme

    filtered_words = [w for w in span.vocab if
                      w.is_lower == span.text.islower()
                      and w.prob >= -15]  # filter by probability and case
                                          # (use the lowercase words if and only if the whole Span is in lowercase)
    similarity = sorted(filtered_words,
                        key=lambda w: span.similarity(w),
                        reverse=True)  # sort by the similarity of each word to the whole Span
    return similarity[:10]  # return the 10 most similar words


nlp = spacy.load('en_core_web_lg')  # load the model

print([w.lower_ for w in get_related(nlp(u'type')[:])])  # print related words for "type"
print([w.lower_ for w in get_related(nlp(u'type of')[:])])  # print related words for "type of"

输出：

[“类型”，“类型”，“种类”，“排序”，“特定”，“示例”，“特定”，“相似”，“不同”，“样式”]

[“类型”，“类型”，“类型”，“种类”，“特殊”，“排序”，“不同”，“这样”，“相同”，“相关”]

如您所见，所有单词在某种程度上都与输入有关，“ type”和“ type of”的输出相似但不相同。

发现与Spacy的分类关系

1 个答案: