NLTK:单词不在词汇中,而在句子中

时间:2019-02-10 07:11:06

标签: python-3.x nltk

我正在尝试从NLTK的Gutenberg数据集中加载数据。我从数据集中加载词汇表,不包括任何标点符号,并使用它来创建单词到整数的映射字典。但是,当我稍后解析句子并尝试应用映射时,由于出现关键错误,因为它试图在字典中查找'"*'

from nltk.corpus import gutenberg
import string

def strip_punctuation(sentence):
        return [word.lower() for word in sentence if word not in string.punctuation]

def build_mapping(vocab):
    word_to_int = {}
    for i, word in enumerate(vocab):
        word_to_int[word] = i
    return word_to_int

vocab = set()
for fileid in gutenberg.fileids():
    words = [w.lower() for w in gutenberg.words(fileid) if w not in string.punctuation]
    vocab = vocab.union(words)

word_to_int = build_mapping(vocab)

for fileid in gutenberg.fileids():
    for sentence in gutenberg.sents(fileid):
        sentence = strip_punctuation(sentence)
        for word in sentence:
            x = word_to_int[word] #KeyError: '"*'

我了解为什么在删除标点符号时未捕获到这种符号组合,但是由于我对单词和句子使用相同的剥离标点符号的方法,因此我很困惑,因为它会出现在句子中,但不会出现在词汇上。目前,在应用映射之前,我正在检查词汇表中是否有符号,但是我想知道是否有更好的方法来去除标点符号,从而可以避免使用if语句。

1 个答案:

答案 0 :(得分:0)

您可以做类似

的操作

对于python 3

sentence = "I can't deal ';with it!**"
characters_to_get_rid_of = ".,':;*!?" #define all characters you don't want
sentence = sentence.translate(str.maketrans("","",characters_to_get_rid_of))
print(sentence)

对于python 2

sentence = "I can't deal ';with it!**"
characters_to_get_rid_of = ".,':;*!?" #define all characters you don't want
sentence = sentence.translate(None,characters_to_get_rid_of)
print sentence

结果

'i cant deal with it'