我正在尝试从NLTK的Gutenberg数据集中加载数据。我从数据集中加载词汇表,不包括任何标点符号,并使用它来创建单词到整数的映射字典。但是,当我稍后解析句子并尝试应用映射时,由于出现关键错误,因为它试图在字典中查找'"*'
。
from nltk.corpus import gutenberg
import string
def strip_punctuation(sentence):
return [word.lower() for word in sentence if word not in string.punctuation]
def build_mapping(vocab):
word_to_int = {}
for i, word in enumerate(vocab):
word_to_int[word] = i
return word_to_int
vocab = set()
for fileid in gutenberg.fileids():
words = [w.lower() for w in gutenberg.words(fileid) if w not in string.punctuation]
vocab = vocab.union(words)
word_to_int = build_mapping(vocab)
for fileid in gutenberg.fileids():
for sentence in gutenberg.sents(fileid):
sentence = strip_punctuation(sentence)
for word in sentence:
x = word_to_int[word] #KeyError: '"*'
我了解为什么在删除标点符号时未捕获到这种符号组合,但是由于我对单词和句子使用相同的剥离标点符号的方法,因此我很困惑,因为它会出现在句子中,但不会出现在词汇上。目前,在应用映射之前,我正在检查词汇表中是否有符号,但是我想知道是否有更好的方法来去除标点符号,从而可以避免使用if语句。
答案 0 :(得分:0)
您可以做类似
的操作对于python 3
sentence = "I can't deal ';with it!**"
characters_to_get_rid_of = ".,':;*!?" #define all characters you don't want
sentence = sentence.translate(str.maketrans("","",characters_to_get_rid_of))
print(sentence)
对于python 2
sentence = "I can't deal ';with it!**"
characters_to_get_rid_of = ".,':;*!?" #define all characters you don't want
sentence = sentence.translate(None,characters_to_get_rid_of)
print sentence
结果
'i cant deal with it'