ValueError: The least populated class in y has only 1 member, which is too few. The minimum number of groups for any class cannot be less than 2.
这是我从以下代码中得到的错误
# List of machine learning algorithms that will be used for predictions
estimator = [('Logistic Regression', LogisticRegression), ('Ridge Classifier', RidgeClassifier),
('SGD Classifier', SGDClassifier), ('Passive Aggressive Classifier', PassiveAggressiveClassifier),
('SVC', SVC), ('Linear SVC', LinearSVC), ('Nu SVC', NuSVC),
('K-Neighbors Classifier', KNeighborsClassifier),
('Gaussian Naive Bayes', GaussianNB), ('Multinomial Naive Bayes', MultinomialNB),
('Bernoulli Naive Bayes', BernoulliNB), ('Complement Naive Bayes', ComplementNB),
('Decision Tree Classifier', DecisionTreeClassifier),
('Random Forest Classifier', RandomForestClassifier), ('AdaBoost Classifier', AdaBoostClassifier),
('Gradient Boosting Classifier', GradientBoostingClassifier), ('Bagging Classifier', BaggingClassifier),
('Extra Trees Classifier', ExtraTreesClassifier), ('XGBoost', XGBClassifier)]
# Separating independent features and dependent feature from the dataset
#X_train = titanic.drop(columns='Survived')
#y_train = titanic['Survived']
# Creating a dataframe to compare the performance of the machine learning models
comparison_cols = ['Algorithm', 'Training Time (Avg)', 'Accuracy (Avg)', 'Accuracy (3xSTD)']
comparison_df = pd.DataFrame(columns=comparison_cols)
# Generating training/validation dataset splits for cross validation
cv_split = StratifiedShuffleSplit(n_splits=10, test_size=0.3, random_state=0)
# Performing cross-validation to estimate the performance of the models
for idx, est in enumerate(estimator):
cv_results = cross_validate(est[1](), X, y, cv=cv_split)
comparison_df.loc[idx, 'Algorithm'] = est[0]
comparison_df.loc[idx, 'Training Time (Avg)'] = cv_results['fit_time'].mean()
comparison_df.loc[idx, 'Accuracy (Avg)'] = cv_results['test_score'].mean()
comparison_df.loc[idx, 'Accuracy (3xSTD)'] = cv_results['test_score'].std() * 3
comparison_df.set_index(keys='Algorithm', inplace=True)
comparison_df.sort_values(by='Accuracy (Avg)', ascending=False, inplace=True)
我猜cv_split部分给了我问题
我找到了使用train_test_split的解决方案,但这并没有像cv_split
但是奇怪的是,我将此代码与其他kaggle问题一起很好地使用了
所以我尝试比较kaggle
毫无疑问地摇摆
打印(X.shape)
打印(y.shape)
(891,9)
(891,)
array([0,1,1,1,0,0,0,0,1,1,1,1,0,0,1,1,0, 1,0,1,0,1, 1,1,0,1,0,0,1,0,0,1,1,0,0,0,1,0,0,1,0,0,0,1, 1,0,0,1,0,0,0,0,1,1,0,1,1,0,1,0,0,1,0,0,0,1, 1 .....])
================================================ =============
嘲笑问题(错误)
打印(X.shape)
打印(y.shape)
(15035,24)
(15035,)
array([221900。,180000.,510000., ...,360000.,400000.,325000.])
两个内核的形状在我看来都一样
我不知道这两个内核的X,y的区别。
任何人都知道为什么出现以下错误?
答案 0 :(得分:0)
是您y拾取索引值..虽然不确定。 您可以改用StratifiedKFold ..以下对我有用
kfold = StratifiedKFold(n_splits = 10,random_state = 7) 结果= cross_val_score(model,X_train,y_train,cv = kfold)
答案 1 :(得分:0)
使用train_test_split时出现类似错误。这是因为我分配了参数stratify=data
而不是stratify=target
。