Question

我正在尝试标记一个句子，然后删除标点符号。

from nltk import word_tokenize
from nltk import re
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
sentence = "what's good people boy's"


tokens = word_tokenize(sentence)
tokens_nopunct = [word.lower() for word in tokens if re.search("\w",word)]
tokens_lemma = [lemmatizer.lemmatize(token) for token in tokens]

print(tokens_lemma)

这给出了输出：

['what', "'s", 'good', 'people', 'boy', "'s"]

但是我希望它实现输出：['what', 'good', 'people' , 'boy']

我一直在查看nltk和文档，它说re.search是您删除标点符号的方法，但它不起作用，我的代码中还有其他写错的地方吗？

Answer 1

这将删除标点符号（不仅仅是's）的所有元素：

import string

punc = set(string.punctuation)
a = ['what', "'s", 'good', 'people', 'boy', "'s"]
without_punc = list(filter(lambda x: x[0] not in punc, a))
print(without_punc)      //['what', 'good', 'people', 'boy']

NLTK研究问题

1 个答案: