Question

我正在尝试对字符串列表应用标点符号删除，停用词删除和词形修饰

我尝试使用 lemma _ ， is_stop 和 is_punct

data = ['We will pray and hope for the best', 
    'Though it may not make landfall all week if it follows that track',
    'Heavy rains, capable of producing life-threatening flash floods, are possible']

import spacy
from spacy.lang.en.stop_words import STOP_WORDS

nlp = spacy.load("en")

doc = list(nlp.pipe(data))

data_clean = [[w.lemma_ for w in doc if not w.is_stop and not w.is_punct and not w.like_num] for doc in data]

我有以下错误： AttributeError：“ spacy.tokens.doc.Doc”对象没有属性“ lemma _”

（ is_stop 和 is_punct 的相同问题）

Answer 1

您可以在外循环中遍历未处理的字符串data列表，但是您需要遍历doc。此外，您的变量名称不利，以下命名应减少混乱：

docs = list(nlp.pipe(data))
data_clean = [[w.lemma_ for w in doc if (not w.is_stop and not w.is_punct and not w.like_num)] for doc in docs]

带有标点符号的标点符号，停用词和词形限制

1 个答案: