Question

我有以下句子：

sentence="The other day I met with Juan and Mary"

我想把它标记化，但只保留主要词语，即：其他，白天，我，遇见，胡安，玛丽。到目前为止我所做的是使用nltk库对其进行标记，如下所示：

tokens=nltk.word_tokenize(sentence)

这给了我以下内容：

['The', 'other', 'day', 'I', 'met', 'with', 'Juan', 'and', 'Mary']

我还尝试使用nltk_pos_tag(tokens)获取来标记单词：

[('The', 'DT'), ('other', 'JJ'), ('day', 'NN'), ('I', 'PRP'), ('met', 'VBD'), ('with', 'IN'), ('Juan', 'NNP'), ('and', 'CC'), ('Mary', 'NNP')]

通过这样做，我可以自己删除那些我不想要的内容，就像上面提到的那样简单，就像搜索标签和删除元组一样简单。但是，我想知道是否有更直接的方法来执行此操作，或者nltk中是否有命令可以自行执行此操作。

任何帮助将不胜感激！非常感谢你。

编辑：这篇文章不希望仅删除停用词，而是要查看上面用nltk_pos_tag(tokens)进行上述操作时可能必须执行的不同选项。

Answer 1

像@BoarGules一样在评论中说。好像你想从句子中删除停用词。并寻找直接的方法，为此我已经为你找到了解决方案。

检查一下：

import nltk
from stop_words import get_stop_words
from nltk.corpus import stopwords

stop_words = list(get_stop_words('en'))         #Have around 900 stopwords
nltk_words = list(stopwords.words('english'))   #Have around 150 stopwords
stop_words.extend(nltk_words)

sentence = "The other day I met with Juan and Mary"   #Your sentence
tokens = nltk.word_tokenize(sentence)
output = []

for words in tokens:
    if not words in stop_words:
        output.append(words)

print output

它可以输出这个：

<强>输出：

['The', 'day', 'I', 'met', 'Juan', 'Mary']

希望这会对你有所帮助！谢谢！：）

如何在使用nltk进行标记时排除介词和连词？

1 个答案: