带有标点符号的标点符号,停用词和词形限制

时间:2019-09-01 16:17:54

标签: python spacy

我正在尝试对字符串列表应用标点符号删除,停用词删除和词形修饰

我尝试使用 lemma _ is_stop is_punct

data = ['We will pray and hope for the best', 
    'Though it may not make landfall all week if it follows that track',
    'Heavy rains, capable of producing life-threatening flash floods, are possible']

import spacy
from spacy.lang.en.stop_words import STOP_WORDS

nlp = spacy.load("en")

doc = list(nlp.pipe(data))

data_clean = [[w.lemma_ for w in doc if not w.is_stop and not w.is_punct and not w.like_num] for doc in data]

我有以下错误:     AttributeError:“ spacy.tokens.doc.Doc”对象没有属性“ lemma _”

is_stop is_punct 的相同问题)

1 个答案:

答案 0 :(得分:1)

您可以在外循环中遍历未处理的字符串data列表,但是您需要遍历doc。 此外,您的变量名称不利,以下命名应减少混乱:

docs = list(nlp.pipe(data))
data_clean = [[w.lemma_ for w in doc if (not w.is_stop and not w.is_punct and not w.like_num)] for doc in docs]