我知道对于k-cross验证,我应该将语料库划分为k个相等的部分。在这些k部分中,我使用k-1部分进行训练,剩下的部分用于测试。此过程将重复k次,以便每个部分一次用于测试。
但我不明白培训究竟意味着什么和测试意味着什么。
我的想法是(如果我错了,请纠正我):
1。 训练集(k-1个k):这些集合将用于构建标记转换概率和排放概率表。然后,使用这些概率表(例如。维特比算法)应用一些标记算法
2。 测试集(1套):使用剩余的1套来验证在步骤1中完成的实施。也就是说,此设置来自语料库将有未标记的单词,我应该在这个集合上使用第1步实现。
我的理解是否正确?如果没有,请解释。
感谢。
答案 0 :(得分:2)
我希望这会有所帮助:
from nltk.corpus import brown
from nltk import UnigramTagger as ut
# Let's just take the first 100 sentences.
sents = brown.tagged_sents()[:1000]
num_sents = len(sents)
k = 10
foldsize = num_sents/10
fold_accurracies = []
for i in range(10):
# Locate the test set in the fold.
test = sents[i*foldsize:i*foldsize+foldsize]
# Use the rest of the sent not in test for training.
train = sents[:i*foldsize] + sents[i*foldsize+foldsize:]
# Trains a unigram tagger with the train data.
tagger = ut(train)
# Evaluate the accuracy using the test data.
accuracy = tagger.evaluate(test)
print "Fold", i
print 'from sent', i*foldsize, 'to', i*foldsize+foldsize
print 'accuracy =', accuracy
print
fold_accurracies.append(accuracy)
print 'average accuracy =', sum(fold_accurracies)/k
[OUT]:
Fold 0
from sent 0 to 100
accuracy = 0.785714285714
Fold 1
from sent 100 to 200
accuracy = 0.745431364216
Fold 2
from sent 200 to 300
accuracy = 0.749628896586
Fold 3
from sent 300 to 400
accuracy = 0.743798291989
Fold 4
from sent 400 to 500
accuracy = 0.803448275862
Fold 5
from sent 500 to 600
accuracy = 0.779836277467
Fold 6
from sent 600 to 700
accuracy = 0.772676371781
Fold 7
from sent 700 to 800
accuracy = 0.755679184052
Fold 8
from sent 800 to 900
accuracy = 0.706402915148
Fold 9
from sent 900 to 1000
accuracy = 0.774622079707
average accuracy = 0.761723794252