有没有一种简单的方法告诉SpaCy使用.similarity方法时忽略停用词?

时间:2018-10-14 21:03:43

标签: python nlp spacy

因此,现在我有一个非常简单的程序,它将提取一个句子并在给定的书中找到语义上最相似的句子,并打印出该句子以及接下来的几个句子。

import spacy
nlp = spacy.load('en_core_web_lg')

#load alice in wonderland
from gutenberg.acquire import load_etext
from gutenberg.cleanup import strip_headers
text = strip_headers(load_etext(11)).strip()

alice = nlp(text)

sentences = list(alice.sents)

mysent = nlp(unicode("example sentence, could be whatever"))

best_match = None
best_similarity_value = 0
for sent in sentences:
    similarity = sent.similarity(mysent)
    if similarity > best_similarity_value:
        best_similarity_value = similarity
        best_match = sent

print sentences[sentences.index(best_match):sentences.index(best_match)+10]

我想通过告诉SpaCy在执行此过程时忽略停用词来获得更好的结果,但是我不知道解决此问题的最佳方法。就像我可以创建一个新的空白列表并将不是停用词的每个单词附加到列表中一样

for sentence in sentences:
    for word in sentence:
        if word.is_stop == 'False':
            newlist.append(word)

但是我必须使其比上面的代码更复杂,因为我必须保持原始句子列表的完整性(因为如果以后要打印完整的句子,索引必须相同)。另外,如果我这样做,则必须使用SpaCy重新运行此新列表列表,才能使用.similarity方法。

我觉得必须有更好的方法来解决这个问题,我非常感谢任何指导。即使没有比将每个不停词都追加到新列表更好的方法了,我也很感谢在创建列表列表方面的任何帮助,以使索引与原始“句子”变量相同。

非常感谢!

2 个答案:

答案 0 :(得分:1)

您需要做的是覆盖spaCy计算相似度的方式。

对于相似度计算,spaCy首先通过平均每个令牌的矢量(token.vector属性)为每个文档计算一个矢量,然后通过执行以下操作来执行余弦相似度:

return np.dot(vector1, vector2) / (np.linalg.norm(vector1) * np.linalg.norm(vector2))

您必须对此稍作调整,而不要考虑停用词的向量。

以下代码应为您工作:

import spacy
from spacy.lang.en import STOP_WORDS
import numpy as np
nlp = spacy.load('en_core_web_lg')
doc1 = nlp("This is a sentence")
doc2 = nlp("This is a baby")

def compute_similarity(doc1, doc2):
    vector1 = np.zeros(300)
    vector2 = np.zeros(300)
    for token in doc1:
        if (token.text not in STOP_WORDS):
            vector1 = vector1 + token.vector
    vector1 = np.divide(vector1, len(doc1))
    for token in doc2:
        if (token.text not in STOP_WORDS):
            vector2 = vector2 + token.vector
    vector2 = np.divide(vector2, len(doc2))
    return np.dot(vector1, vector2) / (np.linalg.norm(vector1) * np.linalg.norm(vector2))

print(compute_similarity(doc1, doc2)))

希望有帮助!

答案 1 :(得分:0)

这是一个稍微优雅的解决方案:我们将覆盖spacy在后台计算文档向量的方式,这会将这种自定义传播到TextCategorizer等任何下游管道组件。

这是基于此处的文档:https://spacy.io/usage/processing-pipelines#custom-components-user-hooks

此解决方案是围绕加载预训练的嵌入而设计的。代替直接引用停用词列表,我将假设在我的已加载嵌入中超出标准的任何内容都是我要在文档向量计算中忽略的标记。

class FancyDocumentVectors(object):
    def __call__(self, doc):
        doc.user_hooks["vector"] = self.vector
        return doc

    def vector(self, doc):
        """
        Constrain attention to non-zero vectors.
        Returns concatenation of mean and max pooling
        """
        # This is the part where we filter out stop words 
        # (really any token for which we couldn't calculate a vector representation).
        # If you'd rather invoke a stopword list, change the line below to something like:
        # doc_vecs = np.array([t.vector for t in doc if t in STOPWORDS])
        doc_vecs = np.array([t.vector for t in doc if t.has_vector])
        if sum(doc_vecs.shape) == 0: 
            doc_vecs = np.array([doc[0].vector])

        mean_pooled = doc_vecs.mean(axis=0)
        
        # Because I'm fancy, I'm going to augment my custom document vector with 
        # some additional information. For a demonstration of the value of this 
        # approach, reference the SWEM paper: https://arxiv.org/abs/1805.09843
        max_pooled = doc_vecs.max(axis=0)
        doc_vec = np.hstack([mean_pooled, max_pooled])
        return doc_vec

        # If you're not into it, just return mean_pooled instead.
        # return mean_pooled

nlp.add_pipe(FancyDocumentVectors())

这是使用在stackoverflow上训练的向量的具体示例!

首先,我们将预训练的嵌入内容加载到空语言模型中。

import spacy
from gensim.models.keyedvectors import KeyedVectors

# https://github.com/vefstathiou/SO_word2vec
word_vect = KeyedVectors.load_word2vec_format("SO_vectors_200.bin", binary=True)
nlp = spacy.blank('en')
nlp.vocab.vectors = spacy.vocab.Vectors(data=word_vect.syn0, keys=word_vect.index2word) 

更改任何内容之前的默认行为:

doc = nlp("This is a question about spacy.")
for token in doc:
  print(token, token.vector_norm, token.vector.sum())
print(doc.vector_norm, doc.vector.sum())

# This 0.0 0.0
# is 0.0 0.0
# a 0.0 0.0
# question 25.44337 -41.958717
# about 0.0 0.0
# spacy 13.833485 -6.3489656
# . 0.0 0.0
# 4.353660220883036 -6.901098

在重写文档向量计算之后的修改行为:

# MAGIC!
nlp.add_pipe(FancyDocumentVectors())

doc = nlp("This is a question about spacy.")
for token in doc:
  print(token, token.vector_norm, token.vector.sum())
print(doc.vector_norm, doc.vector.sum())

# This 0.0 0.0
# is 0.0 0.0
# a 0.0 0.0
# question 25.44337 -41.958717
# about 0.0 0.0
# spacy 13.833485 -6.3489656
# . 0.0 0.0
# 24.601780061609414 109.74769