Question

我正在尝试从NLTK的Gutenberg数据集中加载数据。我从数据集中加载词汇表，不包括任何标点符号，并使用它来创建单词到整数的映射字典。但是，当我稍后解析句子并尝试应用映射时，由于出现关键错误，因为它试图在字典中查找'"*'。

from nltk.corpus import gutenberg
import string

def strip_punctuation(sentence):
        return [word.lower() for word in sentence if word not in string.punctuation]

def build_mapping(vocab):
    word_to_int = {}
    for i, word in enumerate(vocab):
        word_to_int[word] = i
    return word_to_int

vocab = set()
for fileid in gutenberg.fileids():
    words = [w.lower() for w in gutenberg.words(fileid) if w not in string.punctuation]
    vocab = vocab.union(words)

word_to_int = build_mapping(vocab)

for fileid in gutenberg.fileids():
    for sentence in gutenberg.sents(fileid):
        sentence = strip_punctuation(sentence)
        for word in sentence:
            x = word_to_int[word] #KeyError: '"*'

我了解为什么在删除标点符号时未捕获到这种符号组合，但是由于我对单词和句子使用相同的剥离标点符号的方法，因此我很困惑，因为它会出现在句子中，但不会出现在词汇上。目前，在应用映射之前，我正在检查词汇表中是否有符号，但是我想知道是否有更好的方法来去除标点符号，从而可以避免使用if语句。

Answer 1

您可以做类似

的操作

对于python 3

sentence = "I can't deal ';with it!**"
characters_to_get_rid_of = ".,':;*!?" #define all characters you don't want
sentence = sentence.translate(str.maketrans("","",characters_to_get_rid_of))
print(sentence)

对于python 2

sentence = "I can't deal ';with it!**"
characters_to_get_rid_of = ".,':;*!?" #define all characters you don't want
sentence = sentence.translate(None,characters_to_get_rid_of)
print sentence

结果

'i cant deal with it'

NLTK：单词不在词汇中，而在句子中

1 个答案: