希望输出一个干净的数据框,显示模型名称,模型中使用的参数以及得到的评分指标。如果有更智能的方法来迭代度量函数(给定变化的参数),那就更好了。 Example picture of what I'm aiming for.
这是我到目前为止所拥有的:
def train_predict_score(clf, X_train, y_train, X_test, y_test):
clf = clf.fit(X_train, y_train)
y_pred_train = clf.predict(X_train)
y_pred_test = clf.predict(X_test)
result = []
result.append(roc_auc_score(y_train, y_pred_train))
result.append(roc_auc_score(y_test, y_pred_test))
result.append(cohen_kappa_score(y_train, y_pred_train))
result.append(cohen_kappa_score(y_test, y_pred_test))
result.append(f1_score(y_train, y_pred_train, pos_label=1))
result.append(f1_score(y_test, y_pred_test, pos_label=1))
result.append(precision_score(y_train, y_pred_train, pos_label=1))
result.append(precision_score(y_test, y_pred_test, pos_label=1))
result.append(recall_score(y_train, y_pred_train, pos_label=1))
result.append(recall_score(y_test, y_pred_test, pos_label=1))
return result
# Initialize default models
clf1 = LogisticRegression(random_state=0)
clf2 = DecisionTreeClassifier(random_state=0)
clf3 = RandomForestClassifier(random_state=0)
clf4 = GradientBoostingClassifier(random_state=0)
results = []
# Build initial models
for clf in [clf1, clf2, clf3, clf4]:
result = []
result.append(clf) # name and parameters - how can I show all info? it gets truncated
result.append(train_predict_score(clf, X_train, y_train, X_test, y_test)) # how to parse this out into individual columns?
results.append(result)
results = pd.DataFrame(results, columns=['clf', 'auc_train', 'auc_test', 'f1_train', 'f1_test', 'prec_train',
'prec_test', 'recall_train', 'recall_test'])
results
答案 0 :(得分:0)
因为函数是对象,所以你可以从中创建一个列表并简单地迭代它。例如:
def add1(x):
return x+1
def sub1(x):
return x-1
for func in [add1, sub1]:
print(func(10))
产量
11
9
据我了解,您希望将模型的名称(例如LogisticRegression)及其参数存储在不同的列中。 首先,您可以获得如下参数:
clf.get_params()
这会将所有模型参数作为字典返回。 要获取模型名称,您可以获取模型的字符串表示形式并将其拆分一次('。结果列表的第一个元素是模型的名称。所以
>>>clf
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
verbose=0, warm_start=False)
变为
>>>str(clf).split('(',1)[0]
LogisticRegression
这是一个应该做你想要的小例子。它在sklearn的breast_cancer数据集上训练3个不同的分类器,并在火车和测试集上返回roc_auc
,f1
,precision
和recall
分数一个DataFrame:
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.metrics import f1_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer
#load and split example dataset
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y)
#classifiers with default parameters
clf1 = LogisticRegression()
clf2 = RandomForestClassifier()
clf3 = SVC()
clf_list = [clf1, clf2, clf3]
results_list = []
for clf in clf_list:
clf.fit(X_train, y_train)
res = {}
#extract the model name from the object string
res['Model'] = str(clf).split('(', 1)[0]
#get parameters via get_params() method
res['Parameters'] = clf.get_params()
#for every metric, record performance on train and test set
for metric_score in [roc_auc_score, f1_score, precision_score, recall_score]:
metric_name = metric_score.__name__
res[metric_name + '_train'] = metric_score(y_train, clf.predict(X_train))
res[metric_name + '_test'] = metric_score(y_test, clf.predict(X_test))
results_list.append(res)
results_df = pd.DataFrame(results_list)
生成的DataFrame:
print(results_df.to_string())
Model Parameters f1_test f1_train precision_test precision_train recall_test recall_train roc_au_test roc_au_train
0 LogisticRegression {'fit_intercept': True, 'warm_start': False, '... 0.922384 0.969697 0.922384 0.966038 0.922384 0.973384 0.922384 0.959085
1 RandomForestClassifier {'criterion': 'gini', 'warm_start': False, 'n_... 0.928137 0.998095 0.928137 1.000000 0.928137 0.996198 0.928137 0.998099
2 SVC {'decision_function_shape': None, 'verbose': F... 0.500000 1.000000 0.500000 1.000000 0.500000 1.000000 0.500000 1.000000
注意:因为您提到了在您的问题中被截断的DataFrame内容:当您尝试在控制台中打印DF时,这种情况仅用于显示目的,就像我上面所做的那样。当您直接访问相应的单元格时,内容仍然存在。