我关注the IRIS example of tensorflow。
我现在的情况是我将所有数据都放在一个CSV文件中,而不是分开,我想对该数据应用k-fold交叉验证。
我有
data_set = tf.contrib.learn.datasets.base.load_csv(filename="mydata.csv",
target_dtype=np.int)
如何使用多层神经网络对此数据集执行k-fold交叉验证,与IRIS示例相同?
答案 0 :(得分:17)
我知道这个问题已经过时了,但如果有人想要做类似的事情,请继续ahmedhosny's回答:
新的tensorflow数据集API能够使用python生成器创建数据集对象,因此除了scikit-learn的KFold,一个选项可以是从KFold.split()生成器创建数据集:
import numpy as np
from sklearn.model_selection import LeaveOneOut,KFold
import tensorflow as tf
import tensorflow.contrib.eager as tfe
tf.enable_eager_execution()
from sklearn.datasets import load_iris
data = load_iris()
X=data['data']
y=data['target']
def make_dataset(X_data,y_data,n_splits):
def gen():
for train_index, test_index in KFold(n_splits).split(X_data):
X_train, X_test = X_data[train_index], X_data[test_index]
y_train, y_test = y_data[train_index], y_data[test_index]
yield X_train,y_train,X_test,y_test
return tf.data.Dataset.from_generator(gen, (tf.float64,tf.float64,tf.float64,tf.float64))
dataset=make_dataset(X,y,10)
然后,可以在基于图形的张量流或使用急切执行中迭代数据集。使用急切的执行:
for X_train,y_train,X_test,y_test in tfe.Iterator(dataset):
....
答案 1 :(得分:11)
NN通常用于不使用CV的大型数据集 - 而且非常昂贵。对于IRIS(每个物种50个样本),您可能需要它.. 为什么不使用scikit-learn with different random seeds来分割你的训练和测试?
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
表示kfold中的k:
如果您不喜欢随机种子并想要更有条理的K折叠, 你可以使用here。
from sklearn.model_selection import KFold, cross_val_score
X = ["a", "a", "b", "c", "c", "c"]
k_fold = KFold(n_splits=3)
for train_indices, test_indices in k_fold.split(X):
print('Train: %s | test: %s' % (train_indices, test_indices))
Train: [2 3 4 5] | test: [0 1]
Train: [0 1 4 5] | test: [2 3]
Train: [0 1 2 3] | test: [4 5]
答案 2 :(得分:0)
修改@ahmedhosny 答案
from sklearn.model_selection import KFold, cross_val_score
k_fold = KFold(n_splits=k)
train_ = []
test_ = []
for train_indices, test_indices in k_fold.split(all_data.index):
train_.append(train_indices)
test_.append(test_indices)