我正在尝试根据一天中的时间预测事件IS_Slight是否发生的可能性。但是,我感到自己在某个地方迈出了错误的一步,因为我在MultinomialNB和LogisticRegression中的混淆矩阵正在产生一些奇怪的结果,即只有误报和真报。我觉得应该有一些真否定和假否定。低的roc_auc_score也让我失望了,应该不高吗?我知道需要仔细研究很多,但谢谢您的任何建议。
#if word slight found then set to 1 else set to 0
df['IS_SLIGHT'] = df['Accident_Severity'].apply(lambda x: 1 if 'Slight' in x else 0)
cvec = CountVectorizer()
X = df.Time
y = df.IS_SLIGHT
X_train, X_test, y_train, y_test = train_test_split(X,y,random_state=1)
cvec = CountVectorizer()
cvec.fit(X_train)
RT_train = pd.DataFrame(cvec.transform(X_train).todense(),columns=cvec.get_feature_names())
RT_test = pd.DataFrame(cvec.transform(X_test).todense(),columns=cvec.get_feature_names())
print(RT_train)
出局:
afternoon evening morning night
--- ----------- --------- --------- -------
0 0 1 0 0
1 0 1 0 0
2 0 0 1 0
...
print(RT_train.shape)
出局:(1390029,4)
print(RT_test.shape)
出局:(463344,4)
print(y_train.shape)
出局:(1390029,)
print(y_test.shape)
出局:(463344,)
lr = LogisticRegression()
lr.fit(RT_train,y_train)
lrypred = lr.predict(RT_test)
print(metrics.accuracy_score(y_test, lrypred))
出局:0.8502969715805104
print(metrics.confusion_matrix(y_test, lrypred))
出: [[0 69364] [0 393980]]
nb = MultinomialNB()
nb.fit(RT_train,y_train)
ypred = nb.predict(RT_test)
print(metrics.accuracy_score(y_test,ypred))
出局:0.8502969715805104
print(metrics.confusion_matrix(y_test,ypred))
出局:[[0 69364] [0 393980]]
y_pred_prob = nb.predict_proba(RT_test)[:,1]
lry_pred_prob = lr.predict_proba(RT_test)[:, 1]
print(y_pred_prob)
输出:[0.85453109 0.85453109 0.78504675 ... 0.85453109 0.78504675 0.86917669]
print(metrics.roc_auc_score(y_test,y_pred_prob))
出场:0.5436548840102361
print(metrics.roc_auc_score(y_test,lry_pred_prob))
出场:0.5436548840102361
print(classification_report(y_test, lrypred))
出局:
precision recall f1-score support
-------------- ----------- ---------- ------------ ----------
0 0.00 0.00 0.00 69364
1 0.85 1.00 0.92 393980
micro avg 0.85 0.85 0.85 463344
macro avg 0.43 0.50 0.46 463344
weighted avg 0.72 0.85 0.78 463344