如何使用train_test_split在交叉验证中保持测试大小不变?

时间:2016-06-22 17:58:49

标签: python csv for-loop scikit-learn svm

我正在处理矩阵X以及此矩阵y中每一行的标签。 X定义为:

df = pd.read_csv("./data/svm_matrix_0.csv", sep=',',header=None, encoding="ISO-8859-1")
df2 = df.convert_objects(convert_numeric=True)
X = df_2.values

y定义为:

df = pd.read_csv('./data/Step7_final.csv', index_col=False, encoding="ISO-8859-1")  
y = df.iloc[:, 1].values  

然后我将机器学习SVM应用于:

clf = svm.SVC(kernel='linear', C=1)    #specify classifier
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)  #splitting randomly the training and test data
clf.fit(X_train,y_train)   #training of machine

现在,我想更改X_train大小,并计算X_train的每个值的列车和测试错误:

test_error = clf.score(X_test, y_test) 
train_error = clf.score(X_train, y_train)

X_train的大小应该增加(例如15个不同的值),然后这些值应该以{{1​​}}的形式存储在字典中。

我试过了:

{X_train size: (test_error, train_error)}

但它不起作用,因为我也改变了array = [0.1, 0.15, 0.2, 0.25, 0.3, 0.35, 0.4, 0.45, 0.5, 0.55, 0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9] dicto = {} for i in array: X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = i) clf.fit(X_train,y_train) test = clf.score(X_test, y_test) train = clf.score(X_train, y_train) dicto[i] = test, train print(dicto) 。有人知道如何调整我的代码,使其仅变化X_test的大小(以便在增加的总数据集大小时计算错误)?

1 个答案:

答案 0 :(得分:1)

您可以做的是先将测试数据分开......

X_train_prev, X_test_prev, y_train_prev, y_test_prev = train_test_split(X, y, test_size = 0.2)

现在运行for循环改变列车大小,但测试**之前的测试数据*

像这样 -

array = [0.1, 0.15, 0.2, 0.25, 0.3, 0.35, 0.4, 0.45, 0.5, 0.55, 0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9]
dicto = {}
for i in array: 
    X_train, _, y_train, _ = train_test_split(X, y, test_size = i)
    clf.fit(X_train,y_train)   
    #use the previous test data...
    test = clf.score(X_test_prev, y_test_prev) 
    train = clf.score(X_train, y_train)
    dicto[i] = test, train

print(dicto)

但请注意,由于数据是随机的,我所做的事情可能会降低看不见的数据中的模型指标得分,我们也会污染测试数据。那么你可以做些什么来避免它被分成火车数据,以便你的测试数据保持分离!!

像这样(for循环中的行) -

X_train, _, y_train, _ = train_test_split(X_train_prev, y_train_prev, test_size = i)