Question

我想根据训练数据集的大小来比较不同分类器（例如CNN，SVM ...）的分类性能（准确性）。给出的是图像数据集（例如MNIST），其中80％的图像是随机确定的，但符合类别平衡。随后，将再次以相同的方式从该子集中确定下一个较小子集的图像的80％。重复此过程，直到最终达到约1000张图像的小训练量。现在应该使用这些子集来训练每个分类器。

目标是能够做出这样的声明，例如从5000张图像的训练量来看，分类器A明显优于分类器B。

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0, test_size= 0.2, stratify=y)
X_train_2, X_test_2, y_train_2, y_test_2 = train_test_split(X_train, y_train, random_state=0, test_size= 0.2, stratify=y_train)
X_train_3, X_test_3, y_train_3, y_test_3 = train_test_split(X_train_2, y_train_2, random_state=0, test_size= 0.8, stratify=y_train_2)
.....
.....
.....

我的问题是，使用上述代码时，我不确定这是否真的是随机抽样。获得子集会更好，例如使用numpy.random.randint？

对于任何帮助，我将非常感谢。

数据集的随机子集

0 个答案: