我遇到了问题,但我觉得解决方案应该很简单。我正在建立一个模型,并希望通过10倍交叉验证来测试其准确性。要做到这一点,我必须将我的训练语料库90%/ 10%分成训练和测试部分,然后训练我的模型90%并测试10%。我想做十次,每次采用不同的90%/ 10%分割,以便最终将语料库的每个位用作测试数据。然后我将对每次10%测试的结果进行平均。
我曾尝试编写一个脚本来提取10%的训练语料库并将其写入新文件,但到目前为止我还没有得到它。我所做的是计算文件中的总行数,然后将这个数除以10,以了解我想要提取的十个不同测试集中每一个的大小。
trainFile = open("danish.train")
numberOfLines = 0
for line in trainFile:
numberOfLines += 1
lengthTest = numberOfLines / 10
我发现,对于我自己的训练档案,它由3638行组成,因此每个测试应该大约包含363行。
如何将1-363行,364-726行等写入不同的测试文件?
答案 0 :(得分:1)
未经测试,但这是基本想法:
def getNthSeg(fpath, n, segSize):
"""Get the nth segment of segSize many lines"""
answer = []
with open(fpath) as f:
for i,line in enumerate(f):
if (segSize-1)*n <= i < segSize*n:
answer.append(line)
return answer
def getFolds(fpath, k):
""" In your case, k is 10"""
with open(fpath) as f:
numLines = len(f.readlines())
segSize = numLines/k
answer = []
for n in xrange(k):
fold = getNthSeg(fpath, n, segSize)
answer.append(fold)
return answer
答案 1 :(得分:1)
获得行数后,返回文件的开头,然后开始将行复制到danish.train.part-01
。当行号是10%测试集大小的倍数时,为下一部分打开一个新文件。
#!/usr/bin/env python2.7
trainFile = open("danish.train")
numberOfLines = 0
for line in trainFile:
numberOfLines += 1
lengthTest = numberOfLines / 10
# rewind file to beginning
trainFile.seek(0)
numberOfLines = 0
file_number = 0
for line in trainFile:
if numberOfLines % lengthTest == 0:
file_number += 1
output = open('danish.train.part-%02d' % file_number, 'w')
numberOfLines += 1
output.write(line)
在这个输入文件上(抱歉,我不会说丹麦语!):
one
two
three
four
five
six
seven
eight
nine
ten
eleven
twelve
thirteen
fourteen
fifteen
sixteen
seventeen
eighteen
nineteen
twenty
twenty-one
twenty-two
twenty-three
twenty-four
twenty-five
twenty-six
twenty-seven
twenty-eight
twenty-nine
thirty
这会创建文件
danish.train.part-01
danish.train.part-02
danish.train.part-03
danish.train.part-04
danish.train.part-05
danish.train.part-06
danish.train.part-07
danish.train.part-08
danish.train.part-09
danish.train.part-10
例如,和第5部分包含:
thirteen
fourteen
fifteen
答案 2 :(得分:1)
如果您的文件不是很大,可以将其分成90/10,如下所示:
trainFile = open("danish.train")
lines = list(trainFile)
N = len(lines)
testing = lines[:N/10]
training = lines[N/10:]