Question

train-unigram伪代码

create a map counts
create a variable total_count = 0
for each line in the training_file
split line into an array of words
append “**</s>**” to the end of words
for each word in words
add 1 to counts[word]
add 1 to total_count
open the model_file for writing
for each word, count in counts
probability = counts[word]/total_count
print word, probability to model_file

test-unigram伪代码

λ 1 = 0.95, λ unk = 1-λ 1 , V = 1000000, W = 0, H = 0
Load Model
create a map probabilities
for each line in model_file
split line into w and P
set probabilities[w] = P
Test and Print
for each line in test_file
split line into an array of words
append “</s>” to the end of words
for each w in words
add 1 to W
set P = λ unk / V
if probabilities[w] exists
set P += λ 1 * probabilities[w]
else
add 1 to unk
add -log 2 P to H
print “entropy = ”+H/W
print “coverage = ” + (W-unk)/W**

注意-附加不可见的内容，因此请从此处考虑附加“” **

建立训练并测试unigram模型，并计算熵和覆盖率

0 个答案: