迭代函数并在有组织的pandas数据帧中输出结果

时间:2017-01-27 19:11:36

标签: python pandas scikit-learn

希望输出一个干净的数据框,显示模型名称,模型中使用的参数以及得到的评分指标。如果有更智能的方法来迭代度量函数(给定变化的参数),那就更好了。 Example picture of what I'm aiming for.

这是我到目前为止所拥有的:

def train_predict_score(clf, X_train, y_train, X_test, y_test):
    clf = clf.fit(X_train, y_train)
    y_pred_train = clf.predict(X_train)
    y_pred_test = clf.predict(X_test)

    result = []
    result.append(roc_auc_score(y_train, y_pred_train))
    result.append(roc_auc_score(y_test, y_pred_test))
    result.append(cohen_kappa_score(y_train, y_pred_train))
    result.append(cohen_kappa_score(y_test, y_pred_test))
    result.append(f1_score(y_train, y_pred_train, pos_label=1))
    result.append(f1_score(y_test, y_pred_test, pos_label=1))
    result.append(precision_score(y_train, y_pred_train, pos_label=1))
    result.append(precision_score(y_test, y_pred_test, pos_label=1))
    result.append(recall_score(y_train, y_pred_train, pos_label=1))
    result.append(recall_score(y_test, y_pred_test, pos_label=1))

    return result

# Initialize default models
clf1 = LogisticRegression(random_state=0)
clf2 = DecisionTreeClassifier(random_state=0)
clf3 = RandomForestClassifier(random_state=0)
clf4 = GradientBoostingClassifier(random_state=0)

results = []

# Build initial models
for clf in [clf1, clf2, clf3, clf4]:
    result = []
    result.append(clf) # name and parameters - how can I show all info? it gets truncated
    result.append(train_predict_score(clf, X_train, y_train, X_test, y_test)) # how to parse this out into individual columns?
    results.append(result)

results = pd.DataFrame(results, columns=['clf', 'auc_train', 'auc_test', 'f1_train', 'f1_test', 'prec_train',
                                         'prec_test', 'recall_train', 'recall_test'])
results

1 个答案:

答案 0 :(得分:0)

迭代函数

因为函数是对象,所以你可以从中创建一个列表并简单地迭代它。例如:

def add1(x):
    return x+1
def sub1(x):
    return x-1
for func in [add1, sub1]:
    print(func(10))

产量

11
9

获取型号名称和参数

据我了解,您希望将模型的名称(例如LogisticRegression)及其参数存储在不同的列中。 首先,您可以获得如下参数:

clf.get_params()

这会将所有模型参数作为字典返回。 要获取模型名称,您可以获取模型的字符串表示形式并将其拆分一次('。结果列表的第一个元素是模型的名称。所以

>>>clf
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

变为

>>>str(clf).split('(',1)[0]
LogisticRegression

实施例

这是一个应该做你想要的小例子。它在sklearn的breast_cancer数据集上训练3个不同的分类器,并在火车和测试集上返回roc_aucf1precisionrecall分数一个DataFrame:

import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.metrics import f1_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer

#load and split example dataset
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y)

#classifiers with default parameters
clf1 = LogisticRegression()
clf2 = RandomForestClassifier()
clf3 = SVC()
clf_list = [clf1, clf2, clf3]

results_list = []

for clf in clf_list:
    clf.fit(X_train, y_train)
    res = {}
    #extract the model name from the object string
    res['Model'] = str(clf).split('(', 1)[0]
    #get parameters via get_params() method
    res['Parameters'] = clf.get_params()

    #for every metric, record performance on train and test set
    for metric_score in [roc_auc_score, f1_score, precision_score, recall_score]:
        metric_name = metric_score.__name__
        res[metric_name + '_train'] = metric_score(y_train, clf.predict(X_train))
        res[metric_name + '_test'] = metric_score(y_test, clf.predict(X_test))

    results_list.append(res)

results_df = pd.DataFrame(results_list)

生成的DataFrame:

print(results_df.to_string())

                    Model                                         Parameters   f1_test  f1_train  precision_test  precision_train  recall_test  recall_train  roc_au_test  roc_au_train
0      LogisticRegression  {'fit_intercept': True, 'warm_start': False, '...  0.922384  0.969697        0.922384         0.966038     0.922384      0.973384     0.922384      0.959085
1  RandomForestClassifier  {'criterion': 'gini', 'warm_start': False, 'n_...  0.928137  0.998095        0.928137         1.000000     0.928137      0.996198     0.928137      0.998099
2                     SVC  {'decision_function_shape': None, 'verbose': F...  0.500000  1.000000        0.500000         1.000000     0.500000      1.000000     0.500000      1.000000

注意:因为您提到了在您的问题中被截断的DataFrame内容:当您尝试在控制台中打印DF时,这种情况仅用于显示目的,就像我上面所做的那样。当您直接访问相应的单元格时,内容仍然存在。