删除停用词和string.punctuation

时间:2017-08-04 22:15:20

标签: python nltk punctuation

我无法弄清楚为什么这不起作用:

import nltk
from nltk.corpus import stopwords
import string

with open('moby.txt', 'r') as f:
    moby_raw = f.read()
    stop = set(stopwords.words('english'))
    moby_tokens = nltk.word_tokenize(moby_raw)
    text_no_stop_words_punct = [t for t in moby_tokens if t not in stop or t not in string.punctuation]

    print(text_no_stop_words_punct)

看着输出我有这个:

[...';', 'surging', 'from', 'side', 'to', 'side', ';', 'spasmodically', 'dilating', 'and', 'contracting',...]
似乎标点符号仍在那里。我做错了什么?

3 个答案:

答案 0 :(得分:7)

必须是and,而不是or

if t not in stop and t not in string.punctuation

或者:

if not (t in stop or t in string.punctuation):

或者:

all_stops = stop | set(string.punctuation)
if t not in all_stops:

后一种解决方案是最快的。

答案 1 :(得分:4)

在此换行中尝试将'或'更改为'和',这样您的列表将只返回既不是停用词也不是标点符号的单词。

text_no_stop_words = [t for t in moby_tokens if t not in stop or t not in string.punctuation]

答案 2 :(得分:1)

关闭。 您需要在比较中使用and而不是or。 如果它是一个标点符号,如";"不在stop,然后python不会检查它是否在string.punctuation

text_no_stop_words_punct = [t for t in moby_tokens if t not in stop and t not in string.punctuation]