我给了火车文字和测试文字。我要做的是通过训练数据来训练语言模型,以计算测试数据的复杂性。
这是我的代码:
import os
import requests
import io #codecs
from nltk.util import everygrams
from nltk.lm.preprocessing import pad_both_ends
from nltk import word_tokenize, sent_tokenize
fileTest = open("AaronPressman.txt","r");
with io.open('AaronPressman.txt', encoding='utf8') as fin:
textTest = fin.read()
if os.path.isfile('AaronPressmanEdited.txt'):
with io.open('AaronPressmanEdited.txt', encoding='utf8') as fin:
text = fin.read()
# Tokenize the text.
tokenized_text = [list(map(str.lower, word_tokenize(sent)))
for sent in sent_tokenize(text)]
from nltk.lm.preprocessing import padded_everygram_pipeline
from nltk.lm import Laplace
n = 1
padded_bigrams = list(pad_both_ends(word_tokenize(textTest), n=1))
trainTest = everygrams(padded_bigrams, min_len = n , max_len=n);
train_data, padded_sents = padded_everygram_pipeline(n, tokenized_text)
model = Laplace(n)
model.fit(train_data, padded_sents)
print(model.perplexity(trainTest))
当我以n = 1(单字)运行此代码时,得到"1068.332393940235"
。在n = 2或二元组的情况下,我得到"1644.3441077259993"
,在三元组的情况下我得到2552.2085752565313
。
这是什么问题?
答案 0 :(得分:1)
您创建测试数据的方式是错误的(小写火车数据,但测试数据未转换为小写字母。测试数据中缺少起始和结束令牌)。试试这个
import os
import requests
import io #codecs
from nltk.util import everygrams
from nltk.lm.preprocessing import pad_both_ends
from nltk.lm.preprocessing import padded_everygram_pipeline
from nltk.lm import Laplace
from nltk import word_tokenize, sent_tokenize
"""
fileTest = open("AaronPressman.txt","r");
with io.open('AaronPressman.txt', encoding='utf8') as fin:
textTest = fin.read()
if os.path.isfile('AaronPressmanEdited.txt'):
with io.open('AaronPressmanEdited.txt', encoding='utf8') as fin:
text = fin.read()
"""
textTest = "This is an ant. This is a cat"
text = "This is an orange. This is a mango"
n = 2
# Tokenize the text.
tokenized_text = [list(map(str.lower, word_tokenize(sent)))
for sent in sent_tokenize(text)]
train_data, padded_sents = padded_everygram_pipeline(n, tokenized_text)
tokenized_text = [list(map(str.lower, word_tokenize(sent)))
for sent in sent_tokenize(textTest)]
test_data, padded_sents = padded_everygram_pipeline(n, tokenized_text)
model = Laplace(1)
model.fit(train_data, padded_sents)
s = 0
for i, test in enumerate(test_data):
p = model.perplexity(test)
s += p
print ("Perplexity: {0}".format(s/(i+1)))