NLTK研究问题

时间:2018-12-07 18:02:28

标签: python python-3.x nltk

我正在尝试标记一个句子,然后删除标点符号。

from nltk import word_tokenize
from nltk import re
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
sentence = "what's good people boy's"


tokens = word_tokenize(sentence)
tokens_nopunct = [word.lower() for word in tokens if re.search("\w",word)]
tokens_lemma = [lemmatizer.lemmatize(token) for token in tokens]

print(tokens_lemma)

这给出了输出:

['what', "'s", 'good', 'people', 'boy', "'s"]

但是我希望它实现输出:['what', 'good', 'people' , 'boy']

我一直在查看nltk和文档,它说re.search是您删除标点符号的方法,但它不起作用,我的代码中还有其他写错的地方吗?

1 个答案:

答案 0 :(得分:1)

这将删除标点符号(不仅仅是's)的所有元素:

import string

punc = set(string.punctuation)
a = ['what', "'s", 'good', 'people', 'boy', "'s"]
without_punc = list(filter(lambda x: x[0] not in punc, a))
print(without_punc)      //['what', 'good', 'people', 'boy']