执行网格搜索时响应变量的数据类型错误

时间:2019-03-04 17:01:57

标签: python machine-learning scikit-learn grid-search

我正在尝试对sklearn中的隔离林执行网格搜索。到目前为止,这是我的代码:

df = pd.read_csv('/content/PS_20174392719_1491204439457_log.csv')

from sklearn.preprocessing import StandardScaler
from sklearn import preprocessing

#Factorize Categorical Variables
df['type'] = pd.factorize(df['type'])[0]
df['nameOrig'] = pd.factorize(df['nameOrig'])[0]
df['nameDest'] = pd.factorize(df['nameDest'])[0]
df['isFraud'] = pd.factorize(df['isFraud'])[0] #Target Variable
#Normalize Continuous Variables
df['amount']= StandardScaler().fit_transform(df['amount'].values.reshape(-1,1))

df['oldbalanceOrg'] = StandardScaler().fit_transform(df['oldbalanceOrg'].values.reshape(-1,1))
df['newbalanceOrig'] = StandardScaler().fit_transform(df['newbalanceOrig'].values.reshape(-1,1))
df['oldbalanceDest'] = StandardScaler().fit_transform(df['oldbalanceDest'].values.reshape(-1,1))
df['step'] = StandardScaler().fit_transform(df['step'].values.reshape(-1,1))
df['newbalanceDest'] = StandardScaler().fit_transform(df['newbalanceDest'].values.reshape(-1,1))

del df['nameOrig']
del df['nameDest']
del df['step']

#RAM keeps crashing. Use less data
df_validation = df.iloc[4453834:] #Do grid search for more than 15% of overall data
df = df.iloc[0:636262, :]

df_validation_target = df_validation['isFraud']
df_validation = df_validation.drop(['isFraud'], axis=1)

最初,我没有将目标变量显式转换为分类类型,并且遇到了当前遇到的相同错误:

df_validation_target = df_validation_target.astype('category')

#Do GridSearch on the Validation Set (contains both classes)
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import f1_score, make_scorer

my_scoring_func = make_scorer(f1_score)

params = {
    'n_estimators': [5, 10, 50, 100, 300, 500, 700, 900, 1000],
    'max_features': [5, 10, 30, 50, 70, 100, 300, 500, 1000],
    'contamination': [0.1, 0.15, 0.2, 0.25, 0.3, 0.35, 0.4, 0.45, 0.5],
    'max_features': [0.1, 0.2, 0.3,0.4]
}


clf = IsolationForest(max_samples='auto')
grid_s = GridSearchCV(estimator=clf, param_grid=params, cv=3, scoring=my_scoring_func)

grid_s.fit(df_validation, df_validation_target)

错误突出显示了行grid_s.fit,错误代码为:

ValueError: Target is multiclass but average='binary'. Please choose another average setting.

df_validation有4个数字列,它们在(-1,1)和两个分类变量之间进行缩放,这些变量被分解并包含3个类(0、1、2)。

目标变量是二进制。

任何帮助都会很棒!

编辑:这是df_validation_target.unique()的结果:

array([0, 1])

0 个答案:

没有答案