Question

我在使用spaCy停止词时遇到问题。任何帮助，将不胜感激。我将TED谈话成绩单加载到pandas数据框

df['parsed_transcript'] = df['transcript'].apply(nlp)

#making a list of stop words to add
my_stop_words = ["thing", "people", "way", "year", " year " "time",  "lot", "day"]

#adding the list to the stop words
for stopword in my_stop_words:
    lexeme = nlp.vocab[stopword]
    lexeme.is_stop = True

#filtering out stop words and all non noun words
def preprocess_texts(texts_as_csv_column):
#Takes a column from a pandas datafram and converts it into a list of nouns.
    lemmas = []
    for doc in texts_as_csv_column: 
    # Append the lemmas of all nouns that are not stop words
        lemma = ([token.lemma_ for token in doc if token.pos_ == 'NOUN' and not token.is_stop])
        lemmas.append(lemma)

    return lemmas

现在，如果我算上一个单词＆＃34; year＆＃34;它减少了大约4,000，但它仍然显示超过8,000次。

count = 0
for row in df['list_of_words']:
    for word in row:
        if word == "year":
            count +=1

 print(count)

有些令牌被完全删除，有些已被部分删除，有些则根本没有。我尝试添加尾随和领先的空白区域，但这并没有帮助。关于我可能做错的任何想法？感谢

Answer 1

代码看起来正确，只是您在year中有两次my_stop_words而在第二个实例和time之间没有逗号，这将在文档中被解释为year time

spaCy中的Custum停用词不起作用

1 个答案: