我正在研究标记化,词形化和从文档中删除停用词。但是,Spacy抛出错误,指出token.pos_模块不接受“ str”。我相信字符串是正确的格式,如果我写错了,请纠正我。如何解决此错误?
words = []
classes = []
documents = []
ignore_words = ['?']
# loop through each sentence in our training data
for pattern in training_data:
# tokenize each word in the sentence
w = gensim.utils.simple_preprocess(str(pattern['sentence']), deacc=True)
# add to our words list
words.extend(w)
# add to documents in our corpus
documents.append((w, pattern['class']))
# add to our classes list
if pattern['class'] not in classes:
classes.append(pattern['class'])
nltk.download('stopwords')
stop_words = stopwords.words('english')
stop_words.extend(["be", "use", "fig"])
words = [word for word in words if word not in stop_words]
# stem and lower each word and remove duplicates
import en_core_web_lg
nlp = en_core_web_lg.load()
print(words[0:10])
words = [token.lemma_ for token in words if token.pos_ in postags]
words = list(set(words))
AttributeError Traceback (most recent call last)
<ipython-input-72-5c31e2b5a13c> in <module>()
26
27 from spacy import tokens
---> 28 words = [token.lemma_ for token in words if token.pos in postags]
29 words = list(set(words))
30
<ipython-input-72-5c31e2b5a13c> in <listcomp>(.0)
26
27 from spacy import tokens
---> 28 words = [token.lemma_ for token in words if token.pos in postags]
29 words = list(set(words))
30
AttributeError: 'str' object has no attribute 'pos'
答案 0 :(得分:2)
您的代码显示words
如下:
words = [word for word in words if word not in stop_words]
每个word
是stringType
,不是spaCy的token
对象。因此,您正在看到该错误。
要解决此问题:
# Make the spaCy doc obj for the sentence
doc = nlp(pattern['sentence'])
# get words (tokens) for the sentence
words = [w for w in doc.sents[0]]
# Now words will have the .pos_ tag and .lemma_
toks = [token.lemma_ for token in words if token.pos_ not in postags]