如何定义每个k折叠的大小?

时间:2016-11-26 19:31:07

标签: python-2.7 scikit-learn keras cross-validation

我目前正在使用交叉验证来训练我的回归网络,我没有任何标签,但是应该映射到特定输出的特定输入,然后网络应该生成映射。我似乎有一些问题如何正在定义折叠。

我做交叉验证的方式是这样的:

############################### Training setup ##################################

#Define 10 folds:
seed = 7
np.random.seed(seed)
kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=seed)
print "Splits"
cvscores_loss = []

for train, test in kfold.split(train_set_data_vstacked_normalized,train_set_output_vstacked):

    print "Model definition!"
    model = Sequential()

    #act = PReLU(init='normal', weights=None)
    model.add(Dense(output_dim=400,input_dim=400, init="normal",activation=K.tanh))

    #act1 = PReLU(init='normal', weights=None)
    model.add(Dense(output_dim=400,input_dim=400, init="normal",activation=K.tanh))

    #act2 = PReLU(init='normal', weights=None)
    model.add(Dense(output_dim=400, input_dim=400, init="normal",activation=K.tanh))

    act4=ELU(10000)
    model.add(Dense(output_dim=13, input_dim=300, init="normal",activation=act4))

    print "Compiling"
    model.compile(loss='mean_squared_error', optimizer='RMSprop',  metrics=["accuracy"])
    print "Compile done! "

    print '\n'

    print "Train start"
    model.fit(train_set_data_vstacked_normalized[train],train_set_output_vstacked[train], nb_epoch=10, verbose=1)

    loss, accuracy = model.evaluate(x=train_set_data_vstacked_normalized[test],y=train_set_output_vstacked[test],verbose=1)
    print
    print('loss: ', loss)
    print('accuracy: ', accuracy)
    print()
    print model.summary()
    print "New Model:"
    cvscores_loss.append(loss)


print("%.2f%% (+/- %.2f%%)" % (numpy.mean(cvscores_loss), numpy.std(cvscores_loss)))

这段代码的问题在于我从不输入for循环..在打印“Splits”之后收到一条警告信息......它是。

Splits
/home/k/.local/lib/python2.7/site-packages/sklearn/model_selection/_split.py:579: Warning: The least populated class in y has only 1 members, which is too few. The minimum number of groups for any class cannot be less than n_splits=10.

这让人质疑kfold如何知道我的神经网络的输入和输出维度是什么?...

我应该在某处定义吗?或者如何?..

2 个答案:

答案 0 :(得分:1)

该消息告诉您问题。您的一个目标类只有一个成员。当它分层10次时,每个级别至少需要10个成员,这样每个级别可以放1个。

您需要检查目标类的计数以找到有问题的类并将其删除。

答案 1 :(得分:0)

我认为你过于复杂了。如果您需要在Keras模型上进行交叉验证,可以使用keras scikit-learn API。要做到这一点,你需要:

导入一些东西:

from keras.wrappers.scikit_learn import KerasClassifier
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_val_score

创建一个定义模型的函数:

def model_creation():
    model = Sequential()
    model.add(...)
    ...
    model.compile(...)
    return model

并使用包装器:

model = KerasClassifier(build_fn=model_creation, nb_epoch=100, batch_size=100, verbose=0)
kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)
results = cross_val_score(model, X, y, cv=kfold)