我正在尝试标记一个句子,然后删除标点符号。
from nltk import word_tokenize
from nltk import re
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
sentence = "what's good people boy's"
tokens = word_tokenize(sentence)
tokens_nopunct = [word.lower() for word in tokens if re.search("\w",word)]
tokens_lemma = [lemmatizer.lemmatize(token) for token in tokens]
print(tokens_lemma)
这给出了输出:
['what', "'s", 'good', 'people', 'boy', "'s"]
但是我希望它实现输出:['what', 'good', 'people' , 'boy']
我一直在查看nltk和文档,它说re.search是您删除标点符号的方法,但它不起作用,我的代码中还有其他写错的地方吗?
答案 0 :(得分:1)
这将删除标点符号(不仅仅是's
)的所有元素:
import string
punc = set(string.punctuation)
a = ['what', "'s", 'good', 'people', 'boy', "'s"]
without_punc = list(filter(lambda x: x[0] not in punc, a))
print(without_punc) //['what', 'good', 'people', 'boy']