使用决策树打印分类报告

时间:2018-08-20 11:58:07

标签: python tree scikit-learn classification

这是许多人进行的Airbnb Prediction数据集。我想打印分类报告并将其导出为CSV。我已经尝试过print(classification_report(y_pred,y))的方法,但这给了我一个错误 “ ValueError:不允许使用y的混合类型,类型为{'continuous-multioutput','multiclass'}”“

我可能做得不正确,但会有所帮助

代码如下:

import numpy as np # linear algebra

# data processing, CSV file I/O (e.g. pd.read_csv)
import pandas as pd 
from sklearn.preprocessing import LabelEncoder
from xgboost.sklearn import XGBClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import classification_report,confusion_matrix

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter)
# will list the files in the input directory
from subprocess import check_output
df_train = pd.read_csv("train_users_2.csv")
df_test = pd.read_csv("test_users.csv")

# Get the values of the country destination for each row
labels = df_train['country_destination'].values 

# It's the output variable for the decision tree
df_train = df_train.drop(['country_destination'], axis=1) 
id_test = df_test['id']
piv_train = df_train.shape[0]

df_all = pd.concat((df_train, df_test), axis = 0, ignore_index = True)
df_all = df_all.drop(['id','date_first_booking'], axis=1)

# -unknown- is not considered as a missing value so we replace it by nan
df_all.gender.replace('-unknown-', np.nan, inplace=True) 

print(df_all.isnull().sum())
df_all = df_all.fillna(-1)
dac = np.vstack(df_all.date_account_created.astype(str).apply(
                lambda x: list(map(int, x.split('-')))).values)
print(dac)
df_all['dac_year'] = dac[:,0]
df_all['dac_mounth'] = dac[:,1]
df_all['dac_day'] = dac[:,2]
df_all = df_all.drop(['date_account_created'], axis = 1)

tfa = np.vstack(df_all.timestamp_first_active.astype(str).apply(
                lambda x: list(map(int, [x[:4],x[4:6],x[6:8],x[8:10],x[10:12],x[12:14]]))).values)
print(tfa)
df_all['tfa_year'] = tfa[:,0]
df_all['tfa_month'] = tfa[:,1]
df_all['tfa_day'] = tfa[:,2]
df_all = df_all.drop(['timestamp_first_active'], axis=1)

# We can see that the age has some inconsistancy variables
print(df_all.age.describe()) 
av = df_all.age.values
df_all['age'] = np.where(np.logical_or(av<14, av>100), -1, av)

features = ['gender', 'signup_method', 'signup_flow', 
'language', 'affiliate_channel', 'affiliate_provider', 
'first_affiliate_tracked', 'signup_app', 
'first_device_type', 'first_browser']

for f in features:
    df_all_dummy = pd.get_dummies(df_all[f], prefix=f)
    df_all = df_all.drop([f], axis=1)
    df_all = pd.concat((df_all, df_all_dummy), axis=1)

vals = df_all.values
X = vals[:piv_train]
le = LabelEncoder()
y = le.fit_transform(labels)   
X_test = vals[piv_train:]

model = RandomForestClassifier()
model.fit(X,y)
y_pred = model.predict_proba(X_test)

ids = []  #list of ids
cts = []  #list of countries
for i in range(len(id_test)):
    idx = id_test[i]
    ids += [idx] * 5
    cts += le.inverse_transform(np.argsort(y_pred[i])[::-1])[:5].tolist()

1 个答案:

答案 0 :(得分:0)

您发布的代码正在执行,没有任何错误(尽管我不确定last for循环的作用)。现在您要执行以下操作:

print(classification_report(y_pred, y))

,但遇到错误。这里有两个原因:

  • 首先是参数顺序错误。在classification_report()中,实际标签(真实的基础事实)位于第一位,而预测标签位于第二位。所以您的命令应该是:

    print(classification_report(y,y_pred))

  • 但是,第二个原因是您的y_predmodel.predict_proba()的输出。它包含每个样本的每个类别的概率。您不能在confusion_matrix中使用它。对于classification_report,您需要模型预测的离散标签。而是这样做:

    y_pred = model.predict(X_test)
    打印(classification_report(y,y_pred))

  • 但是,您仍然会遇到另一个错误。因为要比较classification_report()中的值,所以需要发送到model.predict()的相同数据的实际标签。但是这里有:

  

y =训练数据的实际标签

     

y_pred = model.predict(X_test)X_test =测试数据

因此,您无法将训练数据的标签与测试数据进行比较。

相关问题