我在Jupyter Notebook上使用Python 3.7。我已经成功地训练了一个ML模型(逻辑回归),并希望创建一个显示预测类(二进制值,0和1)以及该模型计算出的概率的图。绘图的x轴应显示日期时间,这非常困难。
在创建具有概率的数据框之前,一切正常,但随后变得混乱。我试图做的是将pred_lr数据框(包含预测的类)与概率数据框组合在一起,但这似乎是一个真正的难题。必须有更优雅的方式。因此,真正重要的是索引(时间戳,最初仅在y_test中)保持不变并且不会被打乱。
import pandas as pd
import numpy as np
data = pd.read_csv('testrooms_data.csv', parse_dates=['timestamp'])
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
##split dataset into test and trainig set
X = data.drop("value", axis=1) # X contains all the features
y = data["value"] # y contains only the label
##sampling split
from sklearn.model_selection import KFold
kf= KFold(n_splits=3)
from sklearn.model_selection import StratifiedKFold
skf = StratifiedKFold(n_splits=3, random_state=1)
skf.get_n_splits(X, y)
for train_index, test_index in skf.split(X,y):
print("TRAIN:", train_index, "TEST:", test_index)
X_train, X_test = X.iloc[train_index], X.iloc[test_index]
y_train, y_test = y.iloc[train_index], y.iloc[test_index]
###undersampling
from imblearn.under_sampling import ClusterCentroids
cc = ClusterCentroids(random_state=0)
X_resampled1, y_resampled1 = cc.fit_resample(X_train, y_train)
####models###
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
clf_lr = LogisticRegression(class_weight="balanced")
# fit the dataset into LogisticRegression Classifier
clf_lr = clf_lr.fit(X_resampled1, y_resampled1)
#predict on the unseen data
pred_lr = clf_lr.predict(X_test)
#obtain probabilities
prob1 = clf_lr.predict_proba(X_test)
#create dataframe with probabilities
prob_unocc=pd.Series(prob1[:,0])
probability_lr = pd.DataFrame(prob1)
probability_lr = probability_lr.rename(columns={0:'prob_unoccupied', 1:'prob_occupied'})
#dataframe of predicted classes (column: value) + probability
y_new = pd.DataFrame(pred_lr)
y_new['prob_unoccupied']=prob_unocc
y_new.head()
value prob_unoccupied
TS_TIMESTAMP
2019-02-10 13:45:00 0.0 NaN
2019-02-10 14:00:00 0.0 NaN
2019-02-10 14:15:00 0.0 NaN
2019-02-10 14:30:00 0.0 NaN
2019-02-10 14:45:00 0.0 NaN
我想要的是一个散点图,显示预测的类(例如,红色和绿色,以显示关于我的数据集中的地面真实性的预测是对还是错),然后还要计算概率每个时间戳加上一条平行于x轴的分界线,该分界线显示了模型针对1类还是0类的概率值。 任何帮助深表感谢!