我正在尝试使用与银行业历史相关的三个二进制解释变量:默认,住房和贷款,以使用Logistic回归分类器预测二进制响应变量。
我有以下数据集:
映射功能,将文本(是/是)转换为整数0/1
convert_to_binary = {'no' : 0, 'yes' : 1}
default = bank['default'].map(convert_to_binary)
housing = bank['housing'].map(convert_to_binary)
loan = bank['loan'].map(convert_to_binary)
response = bank['response'].map(convert_to_binary)
我添加了三个解释变量并响应数组
data = np.array([np.array(default), np.array(housing), np.array(loan),np.array(response)]).T
kfold = KFold(n_splits=3)
scores = []
for train_index, test_index in kfold.split(data):
X_train, X_test = data[train_index], data[test_index]
y_train, y_test = response[train_index], response[test_index]
model = LogisticRegression().fit(X_train, y_train)
pred = model.predict(data[test_index])
results = model.score(X_test, y_test)
scores.append(results)
print(np.mean(scores))
我的准确性始终是100%,我知道这是不正确的。准确性应该在50-65%左右?
我在做错什么吗?
答案 0 :(得分:0)
分割不正确
这是正确的拆分
X_train, X_labels = data[train_index], response[train_index]
y_test, y_labels = data[test_index], response[test_index]
model = LogisticRegression().fit(X_train, X_labels)
pred = model.predict(y_test)
acc = sklearn.metrics.accuracy_score(y_labels,pred,normalize=True)