我正在处理一个文本分类问题,为此我正在使用30倍交叉验证。在开始实验之前,我确保每个班级至少有30名成员。然后,我进行了必要的文本处理,并将数据集分为测试集和训练集。
x_train, x_test, y_train, y_test = cross_validation.train_test_split(data['event_name_description'], data['category_id'], test_size=0.2, random_state=42)
测试集占总数据的20%。现在,当我运行模型进行训练时,会收到以下警告:
/home/hp/anaconda3/envs/tensorflow/lib/python3.5/site-packages/sklearn/cross_validation.py:553: Warning: The least populated class in y has only 23 members, which is too few. The minimum number of labels for any class cannot be less than n_folds=30.
显然,在将数据分为测试集和训练集后,看来,我在训练集中至少有一个班级,只有23个成员。我说得对吗?