为什么填充词汇的困惑对于nltk.lm bigram不定式?

时间:2019-03-05 09:40:31

标签: python nltk

我正在测试perplexity措施的文本语言模型:

  train_sentences = nltk.sent_tokenize(train_text)
  test_sentences = nltk.sent_tokenize(test_text)

  train_tokenized_text = [list(map(str.lower, nltk.tokenize.word_tokenize(sent))) 
                for sent in train_sentences]

  test_tokenized_text = [list(map(str.lower, nltk.tokenize.word_tokenize(sent))) 
                for sent in test_sentences]

  from nltk.lm.preprocessing import padded_everygram_pipeline
  from nltk.lm import MLE,Laplace
  from nltk.lm import Vocabulary

  vocab = Vocabulary(nltk.tokenize.word_tokenize(train_text),1);

  n = 2
  print(train_tokenized_text)
  print(len(train_tokenized_text))
  train_data, padded_vocab = padded_everygram_pipeline(n, train_tokenized_text)

  # print(list(vocab),"\n >>>>",list(padded_vocab))
  model = MLE(n) # Lets train a 3-grams maximum likelihood estimation model.
  # model.fit(train_data, padded_vocab)
  model.fit(train_data, vocab)

  sentences = test_sentences
  print("len: ",len(sentences))
  print("per all", model.perplexity(test_text)) 

当我在vocab中使用model.fit(train_data, vocab)时,print("per all", model.perplexity(test_text))中的困惑是一个数字(30.2),但是如果我使用padded_vocab时,它还有另外的{{ 1}}和<s>会显示</s>

1 个答案:

答案 0 :(得分:1)

困惑的输入是用ngram表示的文本,而不是字符串列表。您可以通过运行

进行验证
for x in test_text:
    print ([((ngram[-1], ngram[:-1]),model.score(ngram[-1], ngram[:-1])) for ngram in x])

您应该看到标记(ngrams)都是错误的。

如果您的测试数据中的单词超出(训练数据中的)词汇表的话,您仍然会感到困惑

train_sentences = nltk.sent_tokenize(train_text)
test_sentences = nltk.sent_tokenize(test_text)

train_sentences = ['an apple', 'an orange']
test_sentences = ['an apple']

train_tokenized_text = [list(map(str.lower, nltk.tokenize.word_tokenize(sent))) 
                for sent in train_sentences]

test_tokenized_text = [list(map(str.lower, nltk.tokenize.word_tokenize(sent))) 
                for sent in test_sentences]

from nltk.lm.preprocessing import padded_everygram_pipeline
from nltk.lm import MLE,Laplace
from nltk.lm import Vocabulary

n = 1
train_data, padded_vocab = padded_everygram_pipeline(n, train_tokenized_text)
model = MLE(n)
# fit on padded vocab that the model know the new tokens added to vocab (<s>, </s>, UNK etc)
model.fit(train_data, padded_vocab) 

test_data, _ = padded_everygram_pipeline(n, test_tokenized_text)
for test in test_data:
    print("per all", model.perplexity(test))

# out of vocab test data
test_sentences = ['an ant']
test_tokenized_text = [list(map(str.lower, nltk.tokenize.word_tokenize(sent))) 
                for sent in test_sentences]
test_data, _ = padded_everygram_pipeline(n, test_tokenized_text)
for test in test_data:
    print("per all [oov]", model.perplexity(test))