ws = {}
nlp = spacy.load('de_core_news_sm')
data = 'Some long text'
train_corpus = nlp(data)
train_corpus = [token.text for token in train_corpus if not token.is_stop and len(token) > 4]
test_corpus = nlp('Some short sentence')
ae = train_corpus.similarity(test_corpus)
我在AttributeError: 'list' object has no attribute 'similarity'
得到ae = train_corpus.similarity(test_corpus)
。如果我删除train_corpus = [token.text for token in train_corpus if not token.is_stop and len(token) > 4]
,它会起作用,但带有停用词。
如何删除停用词以使其仍然有效?
编辑:ae = nlp(train_corpus).similarity(test_corpus)
指向TypeError: Argument 'string' has incorrect type (expected str, got list)
。
答案 0 :(得分:0)
请注意,您正在对德语短语使用德语模型。在您的情况下,您需要重新粘合剩余的令牌并再次创建“ spacy对象”。在您的情况下,无论如何都应通过len(token)> 4条件删除所有令牌。
import spacy
nlp = spacy.load('en_core_web_sm')
# nlp = spacy.load('de_core_news_sm')
ws = {}
#data = 'Some long text'
data = 'Some long text Elephant'
train_corpus = nlp(data)
train_corpus = nlp(" ".join([token.text for token in train_corpus if not token.is_stop and len(token) > 4]))
test_corpus = nlp('Some short sentence')
ae = train_corpus.similarity(test_corpus)
print(ae)