Question

我知道对于k-cross验证，我应该将语料库划分为k个相等的部分。在这些k部分中，我使用k-1部分进行训练，剩下的部分用于测试。此过程将重复k次，以便每个部分一次用于测试。

但我不明白培训究竟意味着什么和测试意味着什么。

我的想法是（如果我错了，请纠正我）：
1。 训练集（k-1个k）：这些集合将用于构建标记转换概率和排放概率表。然后，使用这些概率表（例如。维特比算法）应用一些标记算法 2。 测试集（1套）：使用剩余的1套来验证在步骤1中完成的实施。也就是说，此设置来自语料库将有未标记的单词，我应该在这个集合上使用第1步实现。

我的理解是否正确？如果没有，请解释。

感谢。

Answer 1

我希望这会有所帮助：

from nltk.corpus import brown
from nltk import UnigramTagger as ut

# Let's just take the first 100 sentences.
sents = brown.tagged_sents()[:1000]
num_sents = len(sents)
k = 10
foldsize = num_sents/10

fold_accurracies = []

for i in range(10):
    # Locate the test set in the fold.
    test = sents[i*foldsize:i*foldsize+foldsize]
    # Use the rest of the sent not in test for training.
    train = sents[:i*foldsize] + sents[i*foldsize+foldsize:]
    # Trains a unigram tagger with the train data.
    tagger = ut(train)
    # Evaluate the accuracy using the test data.
    accuracy = tagger.evaluate(test)
    print "Fold", i 
    print 'from sent', i*foldsize, 'to', i*foldsize+foldsize
    print 'accuracy =', accuracy 
    print
    fold_accurracies.append(accuracy)

print 'average accuracy =', sum(fold_accurracies)/k

[OUT]：

Fold 0
from sent 0 to 100
accuracy = 0.785714285714

Fold 1
from sent 100 to 200
accuracy = 0.745431364216

Fold 2
from sent 200 to 300
accuracy = 0.749628896586

Fold 3
from sent 300 to 400
accuracy = 0.743798291989

Fold 4
from sent 400 to 500
accuracy = 0.803448275862

Fold 5
from sent 500 to 600
accuracy = 0.779836277467

Fold 6
from sent 600 to 700
accuracy = 0.772676371781

Fold 7
from sent 700 to 800
accuracy = 0.755679184052

Fold 8
from sent 800 to 900
accuracy = 0.706402915148

Fold 9
from sent 900 to 1000
accuracy = 0.774622079707

average accuracy = 0.761723794252

k折叠验证在POS标记的上下文中意味着什么？

1 个答案: