Question

我试图将自定义STOP_WORDS添加到spacy。以下代码将自定义STOP_WORD“ Bestellung”添加到STOP_WORDS的标准集合中。我的问题是，添加有效，即该集合在添加后包含“ Bestellung”，但在使用.is_stop测试自定义停用词“ Bestellung”时，python返回FALSE。

另一个具有默认STOP_WORD的测试（即STOP_WORDS中的标准测试）“ darunter”返回TRUE。我不明白，因为单词“ Bestellung”和“ darunter”都在同一组STOP_WORDS中。

有人知道为什么会这样吗？

谢谢

import spacy
from spacy.lang.de.stop_words import STOP_WORDS

STOP_WORDS.add("Bestellung")
print(STOP_WORDS) #Printing STOP_WORDS proofs, that "Bestellung" is part of the Set "STOP_WORDS". Both tested words "darunter" and "Bestellung" are part of it.
nlp=spacy.load("de_core_news_sm")
print(nlp.vocab["Bestellung"].is_stop) # return: FALSE
print(nlp.vocab["darunter"].is_stop) # return: TRUE

谢谢

Answer 1

这与以前的spaCy模型中的错误有关。在最新空间中效果很好。英语模型示例：

>>> import spacy
>>> nlp = spacy.load('en')
>>> from spacy.lang.en.stop_words import STOP_WORDS
>>> STOP_WORDS.add("Bestellung")
>>> print(nlp.vocab["Bestellung"].is_stop)
True

如果要在现有SpaCy上解决此问题，可以使用此解决方法，它会更改STOP_WORDS中存在的单词的is_stop属性。

nlp.vocab.add_flag(lambda s: s.lower() in spacy.lang.en.stop_words.STOP_WORDS, spacy.attrs.IS_STOP)

在Github上的spaCy issue中提到了这一点

Spacy-自定义停用词不起作用

1 个答案: