我正在尝试对字符串列表应用标点符号删除,停用词删除和词形修饰
我尝试使用 lemma _ , is_stop 和 is_punct
data = ['We will pray and hope for the best',
'Though it may not make landfall all week if it follows that track',
'Heavy rains, capable of producing life-threatening flash floods, are possible']
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
nlp = spacy.load("en")
doc = list(nlp.pipe(data))
data_clean = [[w.lemma_ for w in doc if not w.is_stop and not w.is_punct and not w.like_num] for doc in data]
我有以下错误: AttributeError:“ spacy.tokens.doc.Doc”对象没有属性“ lemma _”
( is_stop 和 is_punct 的相同问题)
答案 0 :(得分:1)
您可以在外循环中遍历未处理的字符串data
列表,但是您需要遍历doc
。
此外,您的变量名称不利,以下命名应减少混乱:
docs = list(nlp.pipe(data))
data_clean = [[w.lemma_ for w in doc if (not w.is_stop and not w.is_punct and not w.like_num)] for doc in docs]