Question

我正在研究标记化，词形化和从文档中删除停用词。但是，Spacy抛出错误，指出token.pos_模块不接受“ str”。我相信字符串是正确的格式，如果我写错了，请纠正我。如何解决此错误？

words = []
classes = []
documents = []
ignore_words = ['?']
# loop through each sentence in our training data
for pattern in training_data:
    # tokenize each word in the sentence
    w = gensim.utils.simple_preprocess(str(pattern['sentence']), deacc=True)
    # add to our words list
    words.extend(w)
    # add to documents in our corpus
    documents.append((w, pattern['class']))
    # add to our classes list
    if pattern['class'] not in classes:
        classes.append(pattern['class'])

nltk.download('stopwords')
stop_words = stopwords.words('english')
stop_words.extend(["be", "use", "fig"])
words = [word for word in words if word not in stop_words] 

# stem and lower each word and remove duplicates
import en_core_web_lg
nlp = en_core_web_lg.load()
print(words[0:10])

words = [token.lemma_ for token in words if token.pos_ in postags]
words = list(set(words))

AttributeError                            Traceback (most recent call last)
<ipython-input-72-5c31e2b5a13c> in <module>()
     26 
     27 from spacy import tokens
---> 28 words = [token.lemma_ for token in words if token.pos in postags]
     29 words = list(set(words))
     30 

<ipython-input-72-5c31e2b5a13c> in <listcomp>(.0)
     26 
     27 from spacy import tokens
---> 28 words = [token.lemma_ for token in words if token.pos in postags]
     29 words = list(set(words))
     30 

AttributeError: 'str' object has no attribute 'pos'

Answer 1

您的代码显示words如下：

words = [word for word in words if word not in stop_words]

每个word是stringType，不是spaCy的token对象。因此，您正在看到该错误。

要解决此问题：

# Make the spaCy doc obj for the sentence
doc = nlp(pattern['sentence'])

# get words (tokens) for the sentence
words = [w for w in doc.sents[0]]

# Now words will have the .pos_ tag and .lemma_
toks = [token.lemma_ for token in words if token.pos_ not in postags]

Spacy token.pos_上的AttributeError

1 个答案: