我正在尝试对sklearn中的隔离林执行网格搜索。到目前为止,这是我的代码:
df = pd.read_csv('/content/PS_20174392719_1491204439457_log.csv')
from sklearn.preprocessing import StandardScaler
from sklearn import preprocessing
#Factorize Categorical Variables
df['type'] = pd.factorize(df['type'])[0]
df['nameOrig'] = pd.factorize(df['nameOrig'])[0]
df['nameDest'] = pd.factorize(df['nameDest'])[0]
df['isFraud'] = pd.factorize(df['isFraud'])[0] #Target Variable
#Normalize Continuous Variables
df['amount']= StandardScaler().fit_transform(df['amount'].values.reshape(-1,1))
df['oldbalanceOrg'] = StandardScaler().fit_transform(df['oldbalanceOrg'].values.reshape(-1,1))
df['newbalanceOrig'] = StandardScaler().fit_transform(df['newbalanceOrig'].values.reshape(-1,1))
df['oldbalanceDest'] = StandardScaler().fit_transform(df['oldbalanceDest'].values.reshape(-1,1))
df['step'] = StandardScaler().fit_transform(df['step'].values.reshape(-1,1))
df['newbalanceDest'] = StandardScaler().fit_transform(df['newbalanceDest'].values.reshape(-1,1))
del df['nameOrig']
del df['nameDest']
del df['step']
#RAM keeps crashing. Use less data
df_validation = df.iloc[4453834:] #Do grid search for more than 15% of overall data
df = df.iloc[0:636262, :]
df_validation_target = df_validation['isFraud']
df_validation = df_validation.drop(['isFraud'], axis=1)
最初,我没有将目标变量显式转换为分类类型,并且遇到了当前遇到的相同错误:
df_validation_target = df_validation_target.astype('category')
#Do GridSearch on the Validation Set (contains both classes)
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import f1_score, make_scorer
my_scoring_func = make_scorer(f1_score)
params = {
'n_estimators': [5, 10, 50, 100, 300, 500, 700, 900, 1000],
'max_features': [5, 10, 30, 50, 70, 100, 300, 500, 1000],
'contamination': [0.1, 0.15, 0.2, 0.25, 0.3, 0.35, 0.4, 0.45, 0.5],
'max_features': [0.1, 0.2, 0.3,0.4]
}
clf = IsolationForest(max_samples='auto')
grid_s = GridSearchCV(estimator=clf, param_grid=params, cv=3, scoring=my_scoring_func)
grid_s.fit(df_validation, df_validation_target)
错误突出显示了行grid_s.fit
,错误代码为:
ValueError: Target is multiclass but average='binary'. Please choose another average setting.
df_validation
有4个数字列,它们在(-1,1)和两个分类变量之间进行缩放,这些变量被分解并包含3个类(0、1、2)。
目标变量是二进制。
任何帮助都会很棒!
编辑:这是df_validation_target.unique()
的结果:
array([0, 1])